Abstract: A method for identifying virtual visual information in at least two images from a first sequence of successive images of a visual scene comprising real visual information and said virtual visual information comprises the steps of performing feature detection on at least one of said at least two images determining the movement of the detected features between said at least two images thereby obtaining a set of movements identifying which movements of said set pertain to movements in a substantially vertical plane thereby obtaining a set of vertical movements relating the features pertaining to said vertical movements to said virtual visual information in said at least two images such as to identify the virtual visual information Arrangements for performing embodiments of the method are disclosed as well.
METHOD AND ARRANGEMENT FOR IDENTIFYING VIRTUAL VISUAL
INFORMATION IN IMAGES
The present invention relates to a method and arrangement for
identifying virtual visual information in at least two images from a sequence of
successive images of a visual scene comprising real visual information and said
virtual visual information.
When capturing a real-world scene using one or more cameras, it is
desirable to only capture the scene objects that are in fact present, and not
presented there virtually e.g. by projection. An example may be a future video
conferencing system for enabling a video conference between several people,
physically located in several distinct meeting rooms. In such a system a virtual
environment in which all participants are placed may be represented by
projection on a screen or rendered onto one or more of the available
visualization devices present in the real meeting rooms. To capture the needed
information e.g. which persons are participating, their movements, their
expressions, etc, such as to enable the rendering of this virtual environment,
cameras are used which are placed in the different meeting rooms. However
these camera's not only track the real people and objects in the rooms, but also
the people and objects as virtually rendered e.g. on these large screens within
these same meeting rooms. While the real people need of course to be tracked
to enable a better videoconferencing experience, their projections should not, or
should at least be filtered out in a subsequent step.
Possible existing solutions to this problem make use of fixed
positioned visualization devices cooperating with calibrated cameras which can
result in simple rules in order to filter out the unwanted visual information. This
can be used for traditional screens, with fixed positions within the meeting
rooms.
A problem with this solution is that this only works for relatively static
scenes, which composition is known in advance. This solution also requires
manual calibration steps, which present a drawback in these situations requiring
easy deployability. Another drawback relates to the fact that, irrespective of the
content, an area of the captured images, corresponding to the screen area of the
projected virtual content, will be filtered out. While this may be appropriate for
older types of screen, it may not be appropriate anymore for newer screen
technologies such as e.g. translucent screens that only become opaque at certain
areas when there is something that needs to be displayed e.g. in the event of
display of a cut-out video of a person talking. In this case the area that is
allocated as being 'virtual' for a certain camera is not so at all instances in time.
Moving cameras are furthermore difficult to support using this solution.
An object of embodiments of the present invention is therefore
to. provide a method for identifying the virtual visual information within at least
two images of a sequence of successive images of a visual scene comprising real
visual information and said virtual visual information, but which does not present
the inherent drawbacks of the prior art methods.
According to embodiments of the invention this object is achieved by
the method comprising the steps of
- performing feature detection on at least one of said at least two
images,
- determining the movement of the detected features between said at
least two images, thereby obtaining a set of movements,
- identifying which movements of said set pertain to movements in a
substantially vertical plane, thereby identifying a set of vertical movements
- relating the features pertaining to said vertical movements to said
virtual visual information in said at least two images, such as to identify the
virtual visual information .
In this way, detection of movements of features in a vertical plane will
be used to identify virtual content of the image parts associated with these
features. These features can be recognized objects, such as human beings, or a
table, or a wall, a screen, a chair, or parts thereof such as mouths, ears, eyes,....
These features can also be corners, or lines, or gradients, or more complex
features such as the ones provided by algorithms such as the well-known scale
invariant feature transform algorithm.
As the virtual screen information within the meeting rooms will
generally contain images of the meeting participants, which usually show some
movements, e.g. by speaking, writing, turning their heads etc, and as the
position of the screen can be considered as substantially vertical, detection of
movements lying in a vertical plane, hereafter denoted as vertical movements,
can be a simple way of identifying the virtual visual content on the images as the
real movements of the real, thus non-projected people, are generally 3
dimensional movements, thus not lying in a vertical plane. The thus identified
virtual visual information can then be further filtered out from the images in a
next image or video processing step.
In an embodiment of the method the vertical movements are
identified as movements of said set of movements which are related by a
homography to movements of a second set of movements pertaining to said
features, said second set of movements being obtained from at least two other
images from a second sequence of images, and pertaining to the same timing
instances as said at least two images of said first sequence of images .
As determining homographies between two sets of movements is a
rather straightforward and simple operation, these embodiments allow for an
easy detection of movements in a vertical plane. These movements generally
correspond to movements projected on vertical screens, which are thus
representative for movements of the virtual visual information.
The first set of movements are determined on the first video sequence,
while the second set of movements is either determined from a second sequence
of images of the same scene, taken by a second camera, or , alternatively from a
predetermined sequence only containing the virtual information. This
predetermined sequence may e.g. correspond to the sequence to be projected
on the screen, and may be provided to the arrangement by means of a separate
video or TV channel.
By comparing the movements of the first sequence with these of the
second sequence, and identifying which ones are homographically related, it can
be deduced that these movements having a homographical relationship with
some movements of the second sequence, are therefore movements in a plane,
as this is a characteristic of homographical relationships. If it is known from
scene information that no other movements in a plane are present e.g. all
persons are just moving while yet still being seated around the table, it may be
concluded that the detected movements are these which correspond to the
movements on the screen, thus corresponding to the movements lying in a
vertical plane as no other movements in a plane will be present.
In case however people are also moving around the meeting room,
movements may also be detected on the horizontal plane of the floor. For these
situations an extra filtering step of filtering out the horizontal movements, or
alternatively, an extra selection step of selecting only the movements in a vertical
plane from all movements detected in a plane, may be appropriate.
Once the vertical movements are found, the respective image parts
pertaining to the corresponding features of these vertical movements may then
be identified as the virtual visual information
It is to be remarked that vertically is to be determined relative to a
horizontal reference plane, which e.g. may correspond to the floor of the
meeting room or to the horizontal reference plane of the first camera. Tolerances
on the vertical angle, which is typically 90 degrees with respect to this reference
horizontal plane, are typically 10 degrees above and below these 90 degrees.
The present invention relates as well to embodiments of a
arrangement for performing the present method embodiments, and to a
computer program product incorporating code for performing the present
method, to an image analyzer for incorporating such an arrangement.
It is to be noticed that the term 'coupled', used in the claims, should
not be interpreted as being limitative to direct connections only. Thus, the scope
of the expression 'a device A coupled to a device B' should not be limited to
devices or systems wherein an output of device A is directly connected to an input
of device B. It means that there exists a path between an output of A and an
input of B which may be a path including other devices or means.
It is to be noticed that the term 'comprising', used in the claims,
should not be interpreted as being limitative to the means listed thereafter. Thus,
the scope of the expression a device comprising means A and B' should not be
limited to devices consisting only of components A and B. It means that with
respect to the present invention, the only relevant components of the device are
A and B.
The above and other objects and features of the invention will become
more apparent and the invention itself will be best understood by referring to the
following description of an embodiment taken in conjunction with the
accompanying drawings wherein
Fig. 1 shows a high level schematic embodiment of a first variant of
the method,
Figs. 2a-b show a more detailed implementations of module 200 of
Fig. 1,
Figs. 3-6 show more detailed implementation of other variants of the
method
The description and drawings merely illustrate the principles of the
invention. It will thus be appreciated that those skilled in the art will be able to
devise various arrangements that, although not explicitly described or shown
herein, embody the principles of the invention and are included within its spirit
and scope. Furthermore, all examples recited herein are principally intended
expressly to be only for pedagogical purposes to aid the reader in understanding
the principles of the invention and the concepts contributed by the inventor(s) to
furthering the art, and are to be construed as being without limitation to such
specifically recited examples and conditions. Moreover, all statements herein
reciting principles, aspects, and embodiments of the invention, as well as specific
examples thereof, are intended to encompass equivalents thereof.
It should be appreciated by those skilled in the art that any block
diagrams herein represent conceptual views of illustrative circuitry embodying the
principles of the invention. Similarly, it will be appreciated that any flow charts,
flow diagrams, state transition diagrams, pseudo code, and the like represent
various processes which may be substantially represented in computer readable
medium and so executed by a computer o r processor, whether or not such
computer o r processor is explicitly shown.
Fig. 1 shows a high level schematic scheme of a first embodiment of
the method. O n two images I0†0 and I0†i from a sequence of images movement
features are extracted. The sequence of images is provided o r recorded by one
source e.g. a standalone or built in video camera, a webcam,. .denoted source
0 . The respective images are taken or selected from this sequence, in steps
denoted 100 and 10 1, at two instances in time, these timing instances being
denoted †0 and †i. Both instances in time are sufficiently separated from each
other in order to detect meaningful movement. This may comprise movement of
human beings, but also other movements of e.g. other items in the meeting
rooms. Typical values are between 0.1 and 2 seconds.
Movement feature extraction takes place in step 200. these movement
features can relate to movements of features, such as motion vectors themselves,
or can alternatively relate to the aggregate begin and endpoints of these motion
vectors pertaining to a single feature, thus more related to the features relatd to
movements themselves. Methods for determining these movements of features
are explained with reference to Fig. 2 .
Once these movements of features are determined, it is to be checked
in step 300 whether these pertain to vertical movements, in this document thus
meaning movements in a vertical plane. A vertical plane is defined as relative to
a horizontal reference plane, within certain tolerances. This horizontal reference
plane may e.g. correspond to the floor of the meeting room, or to the horizonal
reference plane of the camera o r source providing the first sequence of images.
Typical values for are 80 to 00 with respect to this reference horizontal plane.
How this determination of vertical movements is done, will be explained with
reference to e.g. Fig. 3 . Vertical movements are searched for, as this is related
to the fact that the virtual information which is to be identified, usually relates to
images of humans or their avatars as projected on a vertical screen. Thus
detecting vertical movements will enable to identify the projected
images/represen†a†ions of the people in the room, which will then be identified
as virtual information.
Methods for determining whether the movements of features are lying
in a vertical plane will be described with reference to Figs 3-4.
Once the movements of features in a vertical plane are determined,
these features are to be identified and related back to their respective image
parts of the captured images of the source. This is done in steps 400 and 500.
These image parts will then accordingly be identified or marked as being virtual
information, which can be filtered out, if appropriate.
Figs.2a-b show more detailed embodiment for extracting the
movements of features . In a first stage 201 and 202 features are detected and
extracted o n the two images I0†0 and I0†i. Features can relate to objects, but also
to more abstract items such as corners, lines, gradients, o r more complex
features such as the ones provided by algorithms such as the scale invariant
feature transform, abbreviated by Sift, algorithm. Feature extraction can be done
using standard methods such as a canny edge corner detector or this previously
mentioned Sift method. As both images I0†0 and I0†i are coming from a same
sequence provided by a single source recording a same scene, it is possible to
detect movements by identifying similar or matching features in both images. It
is however also possible (not shown o n these figures) to only detect features on
one of the images, and then to determine the movement of these features by the
traditional way of determining the motion vectors for all pixels belonging to the
detected feature of this image, by conventional block matching techniques for
determining motion vectors between pixels or macroblocks.
In the embodiments depicted in Figs. 2a-b feature extraction is thus
performed on both images and the displacement between matched features then
provides the movement or motion vectors between the matched features. This
can be a single motion vector per feature, e.g. the displacement of the gravity
point of a matching object, or can alternatively be a group of motion vectors, for
identifying the displacement of the pixels forming the object. This can also be the
case for the alternative method wherein only feature extraction is performed o n
one image, and the displacement of all pixels forming this feature is calculated.
Also in this case one single motion vector can be selected out of this group, for
representing the movement vector of the feature.
O n Figs. 2a-b feature matching and corresponding determination of
the movement of the feature between one image and the other is performed in
step 203, thus resulting in one or more motion vectors per matched feature. This
result is denoted movement vectors in Figs. 2a-b. In order to only select
meaningful movements an optional filtering step 204 can be present. This can
be used for e.g. filtering out small movements which can be e.g. attributed to
noise. This filtering step usually takes place by eliminating all detected
movements which lie below a certain threshold value, this threshold value
generally being related to the camera characteristics.
The result of this optional filtering step are motion vectors which can
be representative of meaningful movements, thus lying above a certain noise
threshold. These movement vectors can be provided as such, as is the case in
Fig. 2a, or, in an alternative embodiment as in Fig. 2b, it may be appropriate
to aggregate begin and end-points of the motion vectors, per feature.
In a next stage, the thus detected movements of features , or
alternatively features related to movements of features, are then to undergo a
check for determining whether they pertain to movements in a vertical plane.
Fig. 3 shows a preferred embodiment for determining whether these
movements of features are lying in a vertical plane. In the embodiment of Fig. 3
this is done by means of identifying whether homographical relationships exist
between the identified movements of features, and a second set of movements of
these same features. This second set of movements can be determined in a
similar way, from a second sequence of images of the same scene, recorded by
a second camera or source. This embodiment is shown in fig. 3, wherein this
second source is denoted source 1, and the images selected from that second
source are denoted l l †0 and l l ti. Images I0†0 and l l †0 are to be taken at the
same instance in time, denoted to. The same holds to images I0†i and l l ti, the
timing instance here being denoted †i.
Alternatively this second sequence can be provided externally, e.g.
from a composing application, which is adapted to create the virtual sequence
for being projected on the vertical screen. This composing application may be
provided to the arrangement as the source providing the contents to be
displayed on the screen, and thus only contains the virtual information, e.g. a
virtual scene of all people meeting together in one large meeting room. From
this sequence only containing virtual information again images at instances †0
and †i are to be captured, upon which feature extraction and feature movement
determination operations are performed. Both identified sets of movements of
features are then submitted to a step of determining whether homographical
relationships exist between several movements of both sets. The presence of a
homographical relationship is indicative of belonging to a same plane. In this
respect several sets of movements, each respective set associated to a respective
plane will be obtained. Fig. 3 shows an example of how such homographical
relationships can be obtained, namely using the well-known RANSAC, being the
abbreviation of Random Sample Consensus, algorithm. However alternative
methods such as exhaustive searching can also be used.
The result of this step is thus one or more sets of movements, each set
pertaining to a movement in a plane. This may be followed by an optional
filtering or selection step of only selecting these sets of movements pertaining to
a vertical plane, especially for these situations where also movements in another
plane are to be expected. This may for instance be the case for people walking
in the room, which will also create movement on the horizontal floor.
In some embodiments the orientation of the plane relative to the
camera , which may be supposed to be horizontally positioned, thus
representing a reference horizontal plane, can be calculated from the
homography by means of homography decomposition methods which are
known to a person skilled in the art and are for instance disclosed in
http://hal.archives-ouvertes.fr/docs/00/! 7/47/39/PDF/RR-6303.pdf. These
techniques can then be used for selecting the vertical movements from the group
of all movements in a plane.
Upon determination of the vertical movements, the features to which
they relate are again determined, followed by their mapping onto the respective
parts in the images I0†0 and I0†i, which image parts are then to be identified as
pertaining to virtual information.
In case of an embodiment using a second camera or source recording
the same scene, the identified vertical movements may also be related back to
features and image parts in images ll †0 andll ti.
Fig. 4 shows a similar embodiment as Fig. 3, but including an extra
step of aggregation with previous instances. This aggregation step uses features
determined in previous instances in time, which may be helpful during the
determination of the homographies.
Fig. 5 shows another embodiment, but wherein several instances in
time e.g. several frames of a video sequence, of both sources, are tracked for
finding matching features. A composite motion vector, being resulting from
tracking individual movements of individual features, will then result for both
sequences. Homographical relationships will then be searched for the features
moving along the composite path. This has the advantage of having the
knowledge that features within the same movement path should be in the same
homography. This reduces the degrees of freedom of the problem, facilitating an
easier resolution of the features that are related by homographies.
Fig. 6 shows an example of how such composed motion vector can be
used, by tracking the features along the movement path. This allows to perform
intermediate filtering operations e.g. for movements which are too small.
While the principles of the invention have been described above in
connection with specific apparatus, it is to be clearly understood that this
description is made only by way of example and not as a limitation on the scope
of the invention, as defined in the appended claims.
CLAIMS
1. Method for identifying virtual visual information in at least two
images from a first sequence of successive images of a visual scene comprising
real visual information and said virtual visual information , said method
comprising the steps of
- performing feature detection on at least one of said at least two
images,
- determining the movement of the detected features between said at
least two images, thereby obtaining a set of movements,
- identifying which movements of said set pertain to movements in a
substantially vertical plane, thereby obtaining a set of vertical movements
- relating the features pertaining to said vertical movements to said
virtual visual information in said at least two images, such as to identify the
virtual visual information .
2. Method according to claim 1, wherein vertical movements are
identified as movements of said set of movements which are related by a
homography to movements of a second set of movements pertaining to said
features, said second set of movements being obtained from at least two other
images from a second sequence of images, and pertaining to same timing
instances as said at least two images of said first sequence of images .
3. Method according to claim 2 wherein said second sequence of
images are provided by a second camera recording said same visual scene .
4. Method according to claim 2 wherein said at least two images of
said second sequence of images comprise only said virtual information.
5. Method according to claim 2 further comprising a step of selecting
movements related by a homography within a vertical plane
6. Method according to claim 1 wherein further comprising a step of
selecting said at least two images from said first sequence on the basis of a
separation in time from each other such as to enable movement determination of
said features.
7. Method according to claim 1 wherein said substantially vertical
plan is having a tilting angle between 80 and 00 degrees with respect to a
horizontal reference plane of said scene.
8. Arrangement for identifying virtual visual information in at least two
images from a first sequence of successive images of a visual scene comprising
real visual information and said virtual visual information , said arrangement
being adapted to receive said first sequence of successive images and to
- perform feature detection on at least one of said at least two images,
- determine the movement of the detected features between said at
least two images, thereby obtaining a set of movements,
- identify which movements of said set pertain to movements in a
substantially vertical plane, thereby obtaining a set of vertical movements
- relate the features pertaining to said vertical movements to said
virtual visual information in said at least two images, such as to identify the
virtual visual information .
9. Arrangement according to claim 8, being further adapted to
identify vertical movements as movements of said set related by a homography
to movements of a second set of movements pertaining to said features, whereby
said arrangement is further adapted to obtain said second set of movements
from at least two other images from a second sequence provided to said
arrangement, and pertaining to same timing instances as said at least two
images of said first sequence .
10. Arrangement according to claim 9 being further adapted to
receive said second sequence of images from a second camera simultaneously
recording said same visual scene as a first camera providing said first sequence
of images to said arrangement .
. Arrangement according to claim 9 wherein said second sequence
of images only comprises said virtual information such that said arrangement is
adapted to receive said second sequence of images from a video source
registered with said arrangement as only providing said virtual information.
2. Arrangement according to claim 9 further being adapted to select
movements related by a homography within a vertical plane
3. Arrangement according to claim 8 further being adapted to select
said at least two images from said first sequence such that said at least two
images are separated in time from each other such as to enable movement
determination of said features.
14. Arrangement according to claim 8 wherein said substantially
vertical plan is having a tilting angle between 80 and 00 degrees with respect
to a horizontal reference plane of said scene.
5. Computer program comprising software adapted to perform any
of the steps as set out in any of the previous claims 1-7 when executed on a
data-processing apparatus.
| # | Name | Date |
|---|---|---|
| 1 | 2300-DELNP-2013.pdf | 2013-03-20 |
| 2 | 2300-delnp-2013-Form-3-(20-06-2013).pdf | 2013-06-20 |
| 3 | 2300-delnp-2013-Correspondence-Others-(20-06-2013).pdf | 2013-06-20 |
| 4 | 2300-delnp-2013-GPA.pdf | 2013-08-20 |
| 5 | 2300-delnp-2013-Form-5.pdf | 2013-08-20 |
| 6 | 2300-delnp-2013-Form-3.pdf | 2013-08-20 |
| 7 | 2300-delnp-2013-Form-2.pdf | 2013-08-20 |
| 8 | 2300-delnp-2013-Form-18.pdf | 2013-08-20 |
| 9 | 2300-delnp-2013-Form-1.pdf | 2013-08-20 |
| 10 | 2300-delnp-2013-Correspondence-Others.pdf | 2013-08-20 |
| 11 | 2300-delnp-2013-Claims.pdf | 2013-08-20 |
| 12 | 2300-delnp-2013-Form-3-(23-09-2013).pdf | 2013-09-23 |
| 13 | 2300-delnp-2013-Correspondence Others-(23-09-2013).pdf | 2013-09-23 |
| 14 | 2300-delnp-2013-Form-3-(25-02-2014).pdf | 2014-02-25 |
| 15 | 2300-delnp-2013-Correspondence-Others-(25-02-2014).pdf | 2014-02-25 |
| 16 | 2300-delnp-2013-Form-3-(08-04-2015).pdf | 2015-04-08 |
| 17 | 2300-delnp-2013-Correspondence Others-(08-04-2015).pdf | 2015-04-08 |
| 18 | 2300-DELNP-2013-FER.pdf | 2018-12-27 |
| 19 | 2300-DELNP-2013-AbandonedLetter.pdf | 2019-10-15 |
| 1 | searchstrategy_26-12-2018.pdf |