"A Method For Optimizing Audio Visual Data In A Virtual Multi Conference Environment And A System Thereof"


Updated about 2 years ago

Abstract

The invention provides a method for optimizing an audio visual data received from an endpoint in a virtual multi conferencing arrangement. The method includes identifying an input stream transmitted by an end point, splitting the input stream into an audio stream and/or a video stream, processing the audio stream to obtain a mixed audio stream and designating a video stream as active video stream. The method also includes collating the mixed audio stream with the active video to form a composite stream for transmission to a plurality of endpoints across the virtual multi conferencing arrangement. A system for integrating the audio visual data and optimizing the same for seamless transmission through a virtual multi conferencing environment is also provided.

Information

Application ID 996/MUM/2011
Invention Field COMMUNICATION
Date of Application 2011-03-30
Publication Number 05/2013

Applicants

Name Address Country Nationality
GREAT SOFTWARE LABORATORY PVT. LTD #8, 74-75, ANAND PARK - AUNDH, PUNE - 411 007, INDIA India India

Inventors

Name Address Country Nationality
ATUL NARKHEDE #8, 74-75, ANAND PARK - AUNDH, PUNE - 411 007, INDIA India India
AVIJIT SEN MAJUMDAR #8, 74-75, ANAND PARK - AUNDH, PUNE - 411 007, INDIA India India

Specification

FORM 2
THE PATENTS ACT, 1970
(39 of 1970)
&
THE PATENTS RULES, 2003
COMPLETE SPECIFICATION
[See section 10]
A METHOD FOR OPTIMISING
AUDIO-VISUAL DATA IN A
VIRTUAL MULTI
CONFERENCE ENVIRONMENT AND A SYSTEM THEREOF
GREAT SOFTWARE
LABORATORY, A COMPANY ESTABLISHED UNDER THE COMPANIES ACT, 1956, WHOSE ADDRESS IS # 8, 74-75, ANAND PARK - AUNDH, PUNE - 411 007, INDIA
AN INDIAN NATIONAL
THE FOLLOWING SPECIFICATION PARTICULARLY DESCRIBES THE NATURE OF THIS INVENTION AND MANNER IN WHICH IT IS TO BE PERFORMED.

A METHOD FOR OPTIMISING AUDIO-VISUAL DATA IN A VIRTUAL MULTI CONFERENCE ENVIRONMENT AND A
SYSTEM THEREOF
FIELD OF THE INVENTION
The present invention generally relates to multi conferencing environments. More particularly, embodiments of the invention relates to optimization methods for integrating audio visual data in a multi conferencing environment.
BACKGROUND
Multipoint Conferencing Units referred to herein as MCUs generally use specialized hardware for the Central Processing Unit (CPU) that involves heavy audio mixing and video compositing. Rich streaming multimedia based collaboration and communication tools are now widely available for online meeting, training and discussion purposes. With the proliferation of internet connectivity more and more people are using these online tools to meet their collaboration needs. There are primarily three ways how any infrastructure of multimedia collaboration platform work. In one form of implementation, organizations setup MCU hardware appliances, mostly with embedded ASICs, with supported endpoints across multiple locations to support multimedia session.
In another form of implementation, service providers put up a hosted data center and distribute the software endpoints running on internet connected computing or mobile devices. In order to support large number of multimedia session in data center, service providers use

specialized hardware, high power CPU, GPU etc to provide hardware acceleration to CPU intensive media operations. Also, a peer to peer (P2P) architecture where media from all the endpoints can potentially be available at all the endpoints with the expense of excess bandwidth consumption can be employed. But CPU is borrowed from endpoint to achieve decoding, mixing or compositing exercise. The various forms of implementation described herein above display dependency on the CPU for enabling seamless integration and transmission of audio-video data. But in a virtual environment, the above described methods cannot be easily deployed. It is not possible to add custom hardware in the virtual cloud infrastructure from a cloud-service provider. Hence there is a need for a method that can be operated in a general purpose virtual machine based environment, which uses purely software techniques and eliminates the need for hardware acceleration for mixing and transmission of audio-video data. This enables the usage of standard Virtual machines on the cloud, lowering the cost of the service as well as permitting easy scalability to handle more multimedia traffic.
SUMMARY
One aspect of the invention provides a method for optimizing an audio visual data received from an endpoint in a virtual multi conferencing arrangement. The method includes identifying an input stream transmitted by an end point, splitting the input stream into an audio stream and/or a video stream, processing the audio stream to obtain a mixed audio stream and designating a video stream as active video stream. The method also includes collating the mixed audio stream

with the active video to form a composite stream for transmission to a plurality of endpoints across the virtual multi conferencing arrangement.
Another aspect of the invention provides a system for optimization of the audio visual data from an endpoint and transmitting an optimized and time synchronized composite audio and/or video stream to a plurality of endpoints. A stream processor located in the virtual multi conferencing system generates the optimized composite data. The stream processor includes a congestion handler which streamlines the received audio visual data. A splitter coupled to the congestion handler splits the received audio visual data into an audio stream and a video stream. An audio mixer coupled to the splitter receives and mixes the split audio stream. A video analyzer coupled to the splitter detects the split video stream to designate an active video stream. The stream processor also includes a stream compositor for receiving the mixed audio stream from the audio mixer, the active video stream designated by the video analyzer and collating the mixed audio stream with the active video stream to form a composite stream.
BRIEF DESCRIPTION OF THE DRAWINGS
So that the manner in which the above recited features of the present invention can be understood in detail, a more particular description of the invention, briefly summarized above, may be had by reference to various embodiments, some of which are illustrated in the appended drawings. It is to be noted, however, that the appended drawings illustrate only typical embodiments of this invention and are therefore

not to be considered limiting of its scope, for the invention may admit to other equally effective embodiments.
FIG. 1 shows the various components of the system that employs routines for optimization of multiplex data, according to an embodiment of the invention.
FIG. 2 illustrates the stream compositor in a VMCU (Virtual Multi-Conference Unit), according to an embodiment of the invention. FIG. 3 illustrates the components of a stream compositor and their interoperability, according to an embodiment of the invention. FIG. 4 illustrates steam compositing timelines, according to an embodiment of the invention.
FIG. 5 illustrates a flow chart of stream compositing method, according to an embodiment of the invention.
FIG. 6 illustrates CPU optimization routine, according to an embodiment of the invention.
DETAILED DESCRIPTION OF INVENTION
Various embodiments of the invention provide a method for integrating audio visual data in a virtual multi conferencing environment. One embodiment of the invention provides an optimization method for integrating audio visual data in a virtual MCU (VMCU). FIG. 1 shows the various components of the system that employs routines for optimization of multiplex data, according to an embodiment of the invention. Like other multimedia collaboration platforms, streams 102 originate from connected endpoints 101, 109. Alternatively the streams can also originate from the VMCU server 113. Subsequent to their point of origin they are delivered typically on

public internet. Each endpoint 101, 109 has capability to capture audio and video, packetize and timestamp them within stream context and then send them to the VMCU server 113. Similarly endpoints 101, 109 are configured to accept streams 107, 108 from the VMCU server 113 using internet connectivity and then either play (for audio) or render (for video) them using the endpoint's multimedia capability. The VMCU server 113 controls the receipt and transmission of streams through a controller 118. Further the VMCU server includes a stream compositor 115. The stream compositor 115 comprises of a plurality stream processor 116 for processing the stream received by the VMCU server 113 from the endpoint 109. The stream compositor 115 also includes an audio mixer 117 for mixing the audio received from the endpoint. Further components of the stream processor 115 along with their connections and functions shall be described in detail herein below.
FIG. 2 illustrates the stream compositor of a VMCU, according to an embodiment of the invention. The stream compositor 201 comprises of a plurality of stream processors SP1, SP2, SP3. Each of the stream processors SP1, SP2 and SP3 is configured to receive an audio visual data from at least one endpoint (not shown). The input audio visual data is then converted to an mixed audio 203 through an audio mixer 202 coupled to the stream compositor 201. Further the stream processor also generates an active video 204. The stream compositor 201 generates a composite stream 205 from the mixed audio 203 and active video 204.
FIG. 3 illustrates the components of at least one Stream Processor (SP) of the VMCU, according to the embodiment of the invention. In any

given instance, a VMCU can have a plurality of stream processors. Each of the stream processor employed in the VMCU includes a Congestion Handler 302, a stream splitter 303 , an audio mixer 305 and a stream compositor 306. In one embodiment of the invention, SP1 is a stream processor in the VMCU receiving at least one input stream 301 from an endpoint. The input stream 301 is received by the congestion handler 302 located within the SP1. The congestion handler 302 is configured for receiving the inbound stream 301, detecting the load of the inbound stream and dropping stream packets arriving late at the congestion handler 302. In one embodiment of the invention, the late arrival of stream packets could be due to network delay. A Splitter 303 is coupled to the congestion handler 302. The splitter 303 receives the decongested and pruned stream from the congestion handler 302. The splitter 303 is configured for separating the incoming stream packet into at least one audio component 311 referred to herein as A (t) and at least one video component 313 referred to herein as V(t). A(t) is forwarded to an Audio Mixer where as V(t) may not take part in further stream processing unless it is chosen as the active video stream of the session.
A second stream processor SP2 in the VMCU receives an input stream 308 and is processed in a manner identical to the one described herein for the SP1. The input stream received at the SP2 is split into an audio stream A(t') and a video stream V(t'). In one embodiment of the invention, V (f) is chosen as the active stream. The designated active video stream V (t') from SP2 along with the mixed audio A (t) from SPlare routed to a stream compositor located on the stream processor SP1. The Stream Compositor component of SP 1 is configured to

collate output mixed audio stream A(T), received from the audio mixer of SP1 with active video V(f), received from SP2 without decoding-encoding operation of video codec. Further to collation of the audio stream A (t) and the corresponding video stream V (t')? the stream compositor is also configured to derive a composite output stream (A(T), V(T)). The content of V (f) is essentially same as V(T), only rimestamps are aligned to the time base T.
FIG. 4 illustrates the stream compositing timelines, according to an example of the invention. More particularly, FIG.4 explains the time stamp realignment achieved at the stream compositor, according to the example of the invention. Any time synchronized multimedia stream handled by the system can be represented by the following notation S(t) = (A(t), V(t)); where A and V stands for audio and video respectively. By nature audio packets are generated uniformly after every configured time interval. In one embodiment of the invention, the configured time interval includes but is not limited to intervals of 20milli seconds (ms), 40ms, 80ms. However, the video packets are not equidistant in time scale and generation of video depends on at least one factor as mentioned herein below or a combination thereof. The factors determining generation of video include motion in the picture frame, change in picture and codec determined key frame interval. The stream compositor receives at least one mixed audio stream and one active video stream as input. In an instance of the example of the invention, audio packets are available in every 40ms time interval within mixed audio stream. Unlike audio, video packets are not uniformly distributed within time space and the stream compositor is configured to handle the inter packet time difference within video

stream while performing realignment of the video stream with the mixed audio stream at the stream processor. The process of compositing the mixed audio stream with the active video stream comprises of the following steps:
Initiation Step: The step of initiation of the stream compositing is achieved by assigning a value for the time of receipt of the first stream from the audio mixer of a stream processor. The time can be set to any value and the value set marks the beginning of input of at least one mixed audio stream and a video stream for initiation of stream compositing. In an example of the invention, let T be the time at the initiation of the stream compositing, wherein the stream compositor receives at least one mixed audio stream from the audio mixer of the stream processor. The stream compositor then detects the presence of active video at the assigned time interval T. If no active video packet is detected at the stream compositor at a given instance T, the stream compositor then transmits only the mixed audio received at the stream compositor.
Appending step: The stream compositor after receiving the first set of mixed audio stream and/or video stream initializes the received stream as the first stream and transmits the same to the end point. The stream compositor is configured to check the input streams received at the stream processor at regular intervals of time. The intervals of time can be predefined. In one example of the invention, the interval of time set can be multiples of the initially set time value. In an embodiment of the invention, the time interval for checking the subsequent streams

of mixed audio and/or video arriving at the stream compositor is set at T + x, wherein x is a predefined value. At a time T+40, the value of which is predefined at the stream compositor, the stream compositor checks for the next set of mixed audio and/or video to be received. The mixed audio is appended with the first received mixed audio whereas the video packet received at the stream compositor at T.+ 40 is timestamped at T+40 with the corresponding audio packet and transmitted to the endpoint. Subsequent to every receipt, analysis and transmission of composited mixed audio and video, the stream compositor is configured to estimate the time of arrival of the next input stream at the stream compositor. The stream compositor estimates the time of arrival of the next input stream and accordingly set the subsequent time interval for appending the input streams and transmitting a composited stream to the endpoint. In an example of the invention, the stream compositor estimates the time of arrival of the next input stream for stream compositing at 190ms and sets a time interval of greater than T+230, at which the time stamp of the incoming video stream is to matched with the corresponding mixed audio and composited accordingly for uninterrupted transmission. In effect video packet is buffered and at T+80 only audio packets is sent. The buffered video packet finally gets delivered within the composite stream at time T+240. The new video packet which has arrived at T+240 is scheduled to deliver at T+640 based on time difference (t'+590) - (t'+l90) = 400.

FIG. 5 illustrates the process of stream compositing achieved in a VMCU, according to an embodiment of the invention. The primary driver of the process is the Audio Mixer output, which is typically available at every 40ms interval; whereas Active video packets are buffered and then inserted within the composite stream. Video timeline V (t) is initialized with the current composite stream time T when the first active video packet becomes available. For each audio packet the SC increments composite timeline and uses it for audio timestamping. Additionally it checks against video timeline to determine whether there is any video packet available for delivery, if so it timestamps the video packet with, composite timeline,. Later on it looks, for next video packet available in the buffer, if so it increments video timeline accordingly.
The stream compositor is also configured for handling non synchronization of audio stream being received at the stream compositor. In case of active video switch, the source timestamp of video stream changes completely. So to keep the alignment right, the stream compositor removes the existing active video stream state before handling the new timeline. In case of network congestion, the delayed audio video packets are deliberately dropped to keep the real time nature of the interaction intact. This may result in the amount of video dropped being higher than the amount of audio in time measure. In such conditions the stream compositor can apply at least one of the correction mechanisms. A) The stream compositor may send signals to forcefully drop buffered audio packets if they have not already been mixed. B) Also, the stream compositor may apply time shift to the video packets arriving after the congestion window, before inserting

them within the composite stream to ensure that the amount of video is in synchronization with the amount of audio in time measure, as predefined by the stream compositor.
INDUSTRIAL APPLICATION: Conference Stream Compositor
To accomplish further optimization, a dedicated SC is kept for handling outbound composite streams for muted participants within a session. A muted participant essentially does not have an inbound stream of any significance so sharing the same outbound stream from a single SC makes sense to minimize CPU consumption. This kind of dedicated SC is called Conference Stream Compositor using VMCU terminology. In a webinar kind of interaction, most of the participants are muted (since there is mostly one and atmost 2-3 active speakers at any time). This separate, pre-computed stream for muted participants can be reused for most of the participants without any additional computational overhead.
Scaling by CPU Optimization
Figure 6 compares the units of generally available MCUs and that of VMCU to clearly indicate the saving of CPU resource. For any general MCU the stream compositing process involves both audio and video decode and encode for each connection. Typically audio decode / encode can consume 4% CPU of a standard cloud based VM instance where as video decode / encode can go anywhere from 25% to 40% of available CPU resource. That means effectively only 2-3 connections can be supported by a VMCU process running on a single virtual

machine instance. This very low number leads to almost non scalable situation for a software only MCU solution. However VMCU innovation solves this problem by completely dropping video decodes / encodes units and avoids most CPU heavy operation. The re timestamping of video does not consume any significant CPU resource. So now going by the same number, a single VMCU process on the standard virtual machine instance can support 25 connections, which is a 10 fold increase and make the software only MCU feasible for real world scaling. Moreover in cloud environment VM instances are readily available so easy horizontal scaling is possible by adding more VM instances running VMCU process on demand. The invention provides a method for integrating audio visual data received from an endpoint at a virtual multi conferencing environment to achieve a composite stream. The method is optimized for operating on a large number of endpoints in a virtual environment eliminating the dependency of specific hardware and significantly reducing the CPU usage time.
The foregoing description of the invention has been set for merely to illustrate the invention and is not intended to be limiting. Since modifications of the disclosed embodiments incorporating the spirit and substance of the invention may occur to person skilled in the art, the invention should be construed to include everything within the scope of the appended claims and equivalents thereof.

WE CLAIM:
1. A method for optimizing audio visual data received from at least
one endpoint in a virtual multi-conferencing environment, the
method comprising the steps of
identifying at least one input stream transmitted by at least
one end point in the multi conferencing arrangement;
splitting the input stream into at least one audio stream
and/or at least one video stream:
processing the audio stream to obtain a mixed audio
stream;
designating at least one video stream as active video
stream; and
collating the mixed audio stream with the active video to
form a composite stream; wherein the composite stream is
transmitted to a plurality of endpoints across the virtual
multi-conferencing arrangement.
2. The method according to claim 1, wherein the active video stream is a non encoded video stream.
3. The method according to claim 1, wherein the step of collation includes adding a timestamp to the active video stream.
4. The method according to claim 1, further wherein the step of collation includes appending the timestamped active video stream to the mixed audio stream to obtain a composite stream.
5. A virtual multi conferencing system for receiving an audio visual data from at least one endpoint and transmitting an optimized and time synchronized composite audio and/or video stream to a plurality of endpoints, characterized in that at least one stream

processor located in the virtual multi conferencing system generates the optimized composite data.
6. The stream processor according to claim 5, wherein the stream
processor comprises of
a congestion handler configured for streamlining the
received audio visual data;
a splitter operably coupled to the congestion handler and
configured for splitting the received audio visual data into
at least one audio stream and/or at least one video stream;
an audio mixer coupled to the splitter and configured for
receiving and mixing the split audio stream;
a video analyser coupled to the splitter and configured for
detecting the split video stream to designate at least one
active video stream; and
a stream compositor configured for receiving
the mixed audio stream from the audio mixer; and the active video stream designated by the video analyser; wherein the mixed audio stream is collated with the active video stream to form a composite stream.
7. The stream processor according to claim 6, wherein the origin of active video stream is distinct from the origin of the mixed audio.
8. The system according to claim 5, wherein the endpoint is at least one user participating in the multi conferencing.
9. The system according to claim 5, wherein the stream processor is located in a cloud environment.

10. A. method and a system for integrating audio visual data in a virtual multi conferencing environment as described in the specification and as illustrated in the accompanying drawings.

Documents

Name Date
996-MUM-2011-Retyped Pages under Rule 14(1) (MANDATORY) [01-08-2018(online)].pdf 2018-08-01
996-MUM-2011-OTHERS [01-08-2018(online)].pdf 2018-08-01
996-MUM-2011-FER_SER_REPLY [01-08-2018(online)].pdf 2018-08-01
996-MUM-2011-DRAWING [01-08-2018(online)].pdf 2018-08-01
996-MUM-2011-COMPLETE SPECIFICATION [01-08-2018(online)].pdf 2018-08-01
996-MUM-2011-CLAIMS [01-08-2018(online)].pdf 2018-08-01
996-MUM-2011-2. Marked Copy under Rule 14(2) (MANDATORY) [01-08-2018(online)].pdf 2018-08-01
ABSTRACT1.jpg 2018-08-11
996-mum-2011-form 5(30-3-2011).pdf 2018-08-11
996-MUM-2011-FORM 5(12-4-2011).pdf 2018-08-11
996-mum-2011-form 3(30-3-2011).pdf 2018-08-11
996-MUM-2011-FORM 26(12-4-2011).pdf 2018-08-11
996-mum-2011-form 2(title page)-(30-3-2011).pdf 2018-08-11
996-MUM-2011-FORM 18(15-5-2012).pdf 2018-08-11
996-MUM-2011-FORM 1(12-4-2011).pdf 2018-08-11
996-mum-2011-form 1(30-3-2011).pdf 2018-08-11
996-mum-2011-form 2(30-3-2011).pdf 2018-08-11
996-MUM-2011-FER.pdf 2018-08-11
996-mum-2011-drawing(30-3-2011).pdf 2018-08-11
996-mum-2011-correspondence(30-3-2011).pdf 2018-08-11
996-MUM-2011-CORRESPONDENCE(15-5-2012).pdf 2018-08-11
996-mum-2011-description(complete)-(30-3-2011).pdf 2018-08-11
996-MUM-2011-CORRESPONDENCE(12-4-2011).pdf 2018-08-11
996-mum-2011-abstract(30-3-2011).pdf 2018-08-11
996-mum-2011-claims(30-3-2011).pdf 2018-08-11
996-MUM-2011-IntimationOfGrant01-07-2019.pdf 2019-07-01
996-MUM-2011-PatentCertificate01-07-2019.pdf 2019-07-01
996-MUM-2011-RELEVANT DOCUMENTS [30-03-2020(online)].pdf 2020-03-30

Orders

Applicant Section Controller Decision Date URL