BACKGROUND
4
5
100011 Multimedia conference calls typically involve communicating voice, video, and/or data information between multiple endpoints. With the proliferation of data networks, multimedia conferencing is migrating from traditional circuit-switched networks to packet networks. To establish a multimedia conference call over a packet network, a conferencing server typically operates to coordinate and manage the conference call. I'hc conferencing server receives a video stream from a sending participant and multicasts the video stream to other participants in the conference call. 100021 One problem associated with communicating multimedia information such as digital video for a video conference call is that digital video (sometimes combined with embedded digital audio) often consumes large amounts of storage and transmission capacity. A typical raw digital video sequence includes 15, iO or even 60 frames per second (frame/s). Each frame can include hundreds of thousands of pi.xels. l!ach pixel or pel represents a tiny element of the picture. In raw form, a computer commonly represents a pixel with 24 bits, for example. Thus a bit rate or number of bits per second of a typical raw digital video sequence can be on the order of 5 million bits per second (bit/s) or more. Most media processing devices and communication networks lack the resources to process raw digital video, for this reason, media communication systems use source compression (also called coding or encoding) to reduce the bit rate of digital video. Decompression (or decoding) reverses compression.
1
100031 lypicaliy there are design tradeoffs in selecting a particular type of video compression for a given processing device and/or communication network, for example, compression can be lossless where the quality of the video remains high at the cost of a higher bit rate, or lossy where the quality of the video suffers but decreases in bit rate are more dramatic. Most system designs make some compromi.ses between quality and bit rate based on a given set of design eonstrainis and performance requirements. Consequently, a given video compression technique is typically not suitable for different types of media processing devices and/or communication networks. 'I'his may be particularly problematic when one or more receiving devices ulili/.c multiple display frames, windows or other objects to display video information for different participants in a multimedia conference call. This is
further exacerbated when different participants appear in different display windows to accommodate different sets of speakers.
4
SUMIMARY |0004| Phis Summaiy is provided to introduce a selection of concepts in a simplified form that arc further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subjecl matter, nor is il intended to be used to limit the scope of the claimed subject matlcr.
(0005| Various embodiments are generally directed to digital encoding, decoding and processing of digital media content, such as video, images, pictures, and so forth. In some embodiments, the digital encoding, decoding and processing of digital media content may be based on the Society of Motion Picture and Television lingineers (SMI'Tl') standard 421M ("VC-1") video codec series of standards and variants. More particularly, some embodiments are directed to multiple resolution video encoding and decoding techniques and how such techniques are enabled in the VC-I bitslrcam without breaking backward compatibility, in one embodiment, for example, an apparatus may include a video encoder arranged to compress or encode digital video information into an augmented SMPTE VC-I video stream or bitstream. I'hc video encoder may encode the digital video information in the form of multiple layers, such as a base layer and one or more spatial and/or temporal enhancement layers. The base layer may offer a defined minimum degree of spatial resolution and a ba.se level of temporal resolution. One or more enhancement layers may include encoded video information that may be used to increase the base level of spatial resolution and/or the ba.se level of temporal resolution for the video information encoded into the base layer.
|()006| In various embodiments, a video decoder may selectively decode video information from the base layer and one or more enhancement layers to playback or reproduce the video information at a desired level of quality. Likewise, an Audio Video Multipoint Control Unit (AVMCU) may select to forward video information from the base layer and one or more enhancement layers to a conference participant based on information such as network bandwidth currently available and receiver's decoding capability.
|0007| In some embodiments in particular, a video decoder may be arranged to selecfively decode video information from the base layer and one or more enhancement layers of the video stream in order to playback or reproduce the video
information at varying levels of video resolution and quality for a visual composition used for a conference call. A visual composition typically includes multiple display of-)jccts, such as display windows, each displaying video information for different participants in a conference call. In one embodiment, for example, a client such as a receiving client terminal may include a processing system, memory and a display. I'he processing system may include a proccs.sor arranged to allocate a display object bit rale for multiple display objects for a visual composition for a conference call. The allocations may be made where a total display object bit rate for all display ol:)jects is equal to or less than a total input bit rate for a client, such as a client terminal. The client terminal may then send a subscription request for dilTerent video layers each with difTerent levels of spatial resolution, temporal resolution and quality for two or more display objects based on the allocations to the conferencing scner or sending client terminals. In this manner, the client terminal may make efficient use of its input bit rate budget. Alternatively, the client terminal may receive scalable video streams, and the video decoder may decode the appropriate video information from the difTerent video layers of the video streams. In this manner, the client terminal may make efficient use of its computational resources. A rendering module (e.g., a display chipset) may be arranged to render said decoded video information in each display frame to create a visual composition for a conference call on said display. |0008| Various embodiments may also be directed to adaptive scheduling techniques for a conferencing server or AVMCU. An apparatus may include a receiver arranged to receive encoded video information in multiple video streams each having difTerent video layers including abase layer having a first level of spatial resolution and a first level of temporal resolution, and an enhancement layer increasing the first level of spatial resolution or the first level of temporal resolution. The apparatus may further include an adaptive scheduling module coupled to the receiver. I'he adaptive scheduling module may be arranged to transmit the different video layers at different times to a receiving client terminal in response to changes in a dominant or active speaker in a conference call. Other embodiments are described and claimed.
BRII I DESCRIPTION OF IHE DRAWINCiS |0009| V\C,. 1 illustrates an embodiment for a multimedia conference system. jOOlOj KKJ!. 2 ilkistratcs an embodiment for a visual composition. |00111 FKJ. ^ illustrates an embodiment for a computing environment. |0012| FKi. 4 illustrates an embodiment for a client terminal.
|0013| h\C,. 5 illustrates an embodiment for a logic flow.
|0014| FIC. 6 illustrates an embodiment for a video capture and playback system.
|0015| Fl(i. 7 illustrates an embodiment for a general video encoder system.
|0016| V\G. 8 illustrates an embodiment for a general video decoder system.
jOOlVj KIC. 9 illustrates an embodiment for a video layer hierarchy.
jOOl 8| V]G. 10 illustrates a first diagram for an adaptive scheduling technique.
|0()19| \'\C>. 11 illustrates a second diagram for an adaptive scheduling technique.
|0020| V\G. 12 illustrates a third diagram for an adaptive scheduling technique.
DEi AILED DESCRIPTION |00211 Various embodiments may be directed to managing visual compositions for a multimedia conference call. In a multiparty video conference, each receiving client terminal receives a video stream from each of the other client terminals participating in a conference call, as well as emitting a video stream of its own. A receiving client terrninal may arrange the multiple video streams from the other receiving client tcnninals on a display screen in the form of a visual composition, A visual composition renders or displays video streams from all or a subset of the participants as a mosaic on a display device for a given client terminal, for example, a visual composition may have a top display object to display video information for a current active speaker, and a panoramic view of the other participants may be displayed by a .smaller set of display objects positioned beneath the top display object. |()022| I'lach visual composition has different communication requirements, for example, smaller display objects and picturc-in-picture displays may have lower spatial resolution requirements than larger display objects. Similarly, video information for the less active participants may have lower temporal resolution requirements than the videos of the more active participants, f ower spatial and/or temporal resolutions generally have lower bit rates for a given picture quality, often measured in terms of signal-to-noise ratio (SNR) or other metric, for a given spatio-temporal resolution, lower picture quality typically has a lower bit rate, while higher picture quality typically has a higher bit rate. Some visual compositions may have lower picture quality requirements for some or all of the participants. |()()23| I-ach client terminal typically has an overall input bit rate budget, or constraint, as well as an overall output bit rate budget. Consequently, one design goal is to efficiently utilize the input bit rate and output bit rate budgets. Accordingly, various embodiments may implement a scalable video representation to improve utilization and efficiency for the overall input bit rate budget and/or output bit rate
budget for a given client terminal to render or display a visual composition for a multimedia conference call. The availability of multiple spatial resolutions, temporal resolutions, and quality levels for each video stream allows a client terminal to make efficient use of its inpul bit rate budget for any given composition, by selectively receiving and/or decoding only the video information needed for the visual composition.
|0024| In various embodiments, a visual composition module may be implemented at a client terminal, a conferencing server, or any other device used in a conference call, fhe visual composition module may receive all or a portion of the scalable video stream and perform scaled decoding and visual composition display operations accordingly. In various embodiments, the visual composition module may receive a total input bit rate for multiple display objects of a given client terminal. Once the visual composition module receives the total input bit rate budget for the client terminal, the visual composition module may dynamically allocate a display object bit rate to each display object used for a visual composition at the client terminal, fhe visual composition module may allocate a display object bit rate to a given display object based on any number of factors as described below. In some embodiments, for example, the visual composition module may allocate display object bit rates based on a display object size, a display object location, and an instantaneous channel capacity for a given communications link or media channel. |0025| During allocation operations, the visual composition module limits display object bit rate allocations to a total display object bit rate for all display objects that is equal to or less than the total input bit rate for the client terminal. Visual composition module may dynamically vary display object bit rate allocations based on changing conditions, such as changes in active speaker, changes in display object si/e, changes in an amount of motion for video information in a given display object, change in status (paused video or streaming video) and so forth. Visual composition module may output the display object bit rate allocations to a scalable video decoder capable of decoding scaled video information from the scaled video encoder, 'fhe scalable video decoder may receive the display object bit rate allocations from visual composition module, and initiate scalable decoding operations to decode video information from the different video layers for each display object in accordance willi its display object bit rate allocation. For a given set of video information and display object, scalable video decoder may decode varying levels of spatial resolution, temporal resolution and quality. Alternatively, the visual composition module may
send a subscription message to the conferencing server requesting different video layers with the desired level of resolution and quality for each display object in the visual composition. In this manner, a scalable video encoder/decoder and/or visual composition module may improve efficient use of input bit rates for a given client terminal when rendering a visual composition with multiple display objects corresponding to multiple participants in a multimedia conference call. |0()26| Various embodiments may also be directed to adaptive scheduling techniques for a multimedia conference call. When a new dominant or active .speaker starts talking from a given sending client terminal, a conferencing server or AVMCU may send a key frame request for a new video key frame so any receiving client terminals can start rendering a display object with the new dominant speaker. A key frame, however, is relatively large and therefore takes a greater amount of time to transmit relative to other video frames. As a result, video latency is higher and it takes several seconds before the participant can see the new dominant speaker. |()027| Various embodiments may solve these and other problems using an adaptive scheduling module. I'hc adaptive scheduling module may be arranged to allow adaptive scheduling of the transmission of the video layers in time on behalfof another device, such as a sending client terminal. As a result, response time may be improved when a dominant or active speaker starts talking and sending his/her video. fhe lower video layers are transmitted first and additional layers are gradually transmitted to improve the video quality over time, in this manner, a visual composition may be rendered which smoothly transitions from one spatial or temporal resolution to a finer one when a new dominant speaker begins speaking, thereby activating a switch in display objects to show video information for the new dominant speaker. Adaptively scheduling the transmission of video layers may reduce flicker, blanking, and other side effects introduced by the transition between dominant speakers and corresponding display objects in the visual composition.
Multimedia Conferencing System |0028| FKi. 1 illustrates a block diagram for a multimedia conferencing system 100. Multimedia conferencing system 100 may represent a general system architecture suitable for implementing various embodiments. Multimedia conferencing system 100 may comprise multiple elements. An element may comprise any physical or logical .structure arranged to pcrfomn certain operations, liach clement may be implemented as hardware, software, or any combination thereof, as desired for a given set of design parameters or performance constraints. I'-xamples of
hardware elements may include devices, components, processors, microprocessors, circuits, circuit elements (e.g., transistors, resistors, capacitors, inductors, and .so forth), integrated circuits, application specific integrated circuits (ASIC), programmable logic devices (PLD), digital signal processors (DSP), field programmable gate array (l-'PCiA), memory units, logic gates, registers, semiconductor device, chips, microchips, chip sets, and so forth, lixamples of .software may include any software components, programs, applications, computer programs, application programs, system programs, machine programs, operating system .software, middleware, firmware, .software modules, routines, subroutines, functions, methods, interfaces, .software interfaces, application program interfaces (API), instruction .sets, computing code, computer code, code segments, computer code segments, words, values, .symbols, or any combination thereof". Although multimedia conferencing sy.stem 100 as shown in FIG. 1 has a limited number of elements in a certain topology, it may be appreciated that multimedia conferencing syslcm 100 may include more or less elements in alternate topologies as desired for a given implementation. I'he embodiments are not limited in this context. |0029| In various embodiments, multimedia conferencing system 100 may be arranged to communicate, manage or process different types of information, such as media information and control information. Fixamples of media information may generally include any data representing content meant for a user, such as voice information, video information, audio information, image information, textual information, numerical information, alphanumeric symbols, graphics, and .so forth. Control information may refer to any data repre.sendng command.s, instructions or control words meant for an automated sy.stem. For example, control information may be u.scd to route media information through a .system, to establish a connection between devices, instruct a device to process the media information in a predetermined manner, and .so forth. It is noted that while .some embodiments may be described specifically in the context of selectively removing video frames from video information to reduce video bit rates, various embodiments encompasses the use of any type of desired media information, such as pictures, images, data, voice, music or any combination thereof
|()0-10| In various embodiments, multimedia conferencing .system 100 may include a conferencing .server 102. Conferencing .server 102 may compri.se any logical or physical entity that is arranged to manage or control a multimedia conference call between client terminals I06-1-AH, where in represents the number of
terminals in the conference. In various embodiments, conferencing server 102 may comprise, or be implemented as, a processing or computing device, such as a computer, a server, a router, a switch, a bridge, and so forth. A specific implementation for conferencing server 102 may vary depending upon a set of communication protocols or standards to be used for conferencing server 102. In one example, conferencing server 102 may be implemented in accordance with the International ielecommunication Union (ITU) H.323 series of standards and/or variants. The II..32.3 .standard defines a multipoint control unit (MCU) to coordinate conference call operations. In particular, the MCU includes a multipoint controller (MC) that handles 11.245 signaling, and one or more multipoint processors (MP) to mi.x and process the data streams. In another example, conferencing server 102 may be implemented in accordance with the Internet Engineering Task l^'orce (Ili'lT) Multiparly Multimedia Session Control (MMUSIC) Working (jroup Session Initiation Protocol (SIP) series of standards and/or variants. SIP is a proposed standard for initiating, modifying, and terminating an interactive user session that involves multimedia elements such as video, voice, instant messaging, online games, and virtual reality. Both the 11.323 and SIP standards are essentially signaling protocols for Voice over Internet Protocol (VoIP) or Voice Over Packet (VOP) multimedia conference call operations. It may be appreciated that other signaling protocols may be implemented for conferencing server 102, however, and still fall within the scope of the embodiments. I'he embodiments arc not limited in this context. |0()311 In various embodiments, multimedia conferencing system 100 may include one or more client terminals \06-\-m to connect to conferencing server 102 over one or more communications links l08-l-«, where ni and n represent positive integers that do not necessarily need to match. For example, a client application may host several client terminals each representing a separate conference at the same time. Similarly, a client application may receive multiple media streams. P'or example, video streams from all or a subset of the participants may be displayed as a mo.saic on the participant's display with a top window with video for the current active speaker, and a panoramic view of the other participants in other windows. Client terminals 106- 1-w may comprise any logical or physical entity that is arranged to participate or engage in a multimedia conference call managed by conferencing ser\'cr 102. Client terminals \{)6-\-in may be implemented as any device that includes, in its most basic form, a processing system including a processor and memory (e.g., memory units 110-1-/;), one or more multimedia input/output (I/O) components, and a wireless
and/or wired network connection. Hxamplcs of multimedia 1/0 components may include audio 1/0 components (e.g., microphones, speakers), video 1/0 components (e.g., video camera, display), tactile (1/0) components (e.g., vibrators), user data (1/0) components (e.g., keyboard, thumb board, keypad, touch screen), and so forth, l/xarnplcs of client terminals 106-1 -/« may include a telephone, a VoIP or VOP telephone, a packet telephone designed to operate on a Packet Switched I'elephone Network (PS'IN), an Internet telephone, a video telephone, a cellular telephone, a personal digital assistant (PDA), a combination cellular telephone and PDA, a mobile computing device, a smart phone, a one-way pager, a two-way pager, a messaging device, a computer, a personal computer (PC), a desktop computer, a laptop computer, a notebook computer, a handheld computer, a network appliance, and so forth, rhe embodiments are not limited in this context.
|0032| Depending on a mode of operation, client terminals 106-]-//; may be referred to as sending client terminals or receiving client terminals. l"or example, a given client terminal 106-1-/// may be referred to as a sending client terminal when operating to send a video stream to conferencing server 102. In another example, a given client terminal 106-1-/// may be referred to as a receiving client terminal when operating to receive a video stream from conferencing server 102, such as a video stream from a sending client terminal, for example. In the various embodiments described below, client terminal 106-1 is described as a sending client terminal, while client terminals 106-2-/// arc described as receiving client terminals, by way of example only. Any of client terminals 106-1-/// may operate as a sending or receiving client terminal throughout the course of conference call, and frequently shift between modes at various points in the conference call. The embodiments are not limited in this respect.
|0()33| In various embodiments, multimedia conferencing system 100 may comprise, or form part of, a wired communications system, a wireless communications system, or a combination of both. For example, multimedia conferencing system 100 may include one or more elements arranged to communicate information over one or more types of wired communications links, lixamples of a wired communications link may include, without limitation, a wire, cable, bus, printed circuit board (PCB), lithernet connection, peer-to-peer (P2P) connection, backplane, switch fabric, semiconductor material, twisted-pair wire, co-axial cable, fiberoptic connection, and so forth. Multimedia conferencing system 100 also may include one or more elements arranged to communicate information over one or more types of
wireless communications links. Hxamplcs of a wireless communications link may include, without limitation, a radio channel, infrared channel, radio-frequency (RV) channel. Wireless l-idclity (Wil-i) channel, a portion of the RF spcclriim, and/or one or more licensed or license-free frequency bands.
|0()34| Multimedia conferencing system 100 also may be arranged to operate in accordance with various standards and/or protocols for media processing, l^xamples of media processing standards include, without limitation, the Society of Motion Piclurc and felcvision l-ngineers (SMPTH) 421M ("VC-1") series of standards and variants. V(^-l implemented as MICROSOF^IXW WINDOWS(R) MHDIA VIDIX) version 9 (WMV-9) series of standards and variants. Digital Video Broadcasting Terrestrial (DVB-f) broadcasting standard, the ITU/IBC 11.263 standard. Video Coding for Fow Bit rale Communication, ITU-T Recommendation 11.263v3, published November 2000 and/or the flTJ/U'C H.264 standard, Video Coding for Very Low Bit rale Communication, ITU-T Recommendation 11.264, published May 2003, Motion Picture lixperts Group (MPEG) standards (e.g., MPbXM, MPi;Ci-2, MPlX)-4), and/or High performance radio Local Area Network (IlipcrLAN) slandards. Ivxamples of media processing protocols include, without limitation, Session Description Protocol (SDP), Real Time Streaming Protocol (R'l'SP), Real¬time I'ransporl Protocol (RTP), Synchronized Multimedia integration Language (SMIL) protocol, and/or Internet Streaming Media Alliance (ISMA) protocol. The embodiments arc not limited in this context.
|00.35| in one embodiment, for example, conferencing server 102 and client terminals \{)6-]-m of multimedia conferencing system 100 may be implemented as part of an U.323 system operating in accordance with one or more of the 11.323 series of"standards and/or variants. H.323 is an ITU standard that provides specidcation for computers, equipment, and services for multimedia communication over networks that do not provide a guaranteed quality of service. 11.323 computers and equipment can carry real-time video, audio, and data, or any combination of these elements. I'liis standard is based on the IL'LF i^'fP and RTCP protocols, with additional protocols for call signaling, and data and audiovisual communications. iL323 defines how audio and video information is formatted and packaged for transmission over the network. Standard audio and video coders/decoders (codecs) encode and decode inpuLoutpiil from audio and video sources for communication between nodes. A codec converts audio or video signals between analog and digital forms. In addition, 11.323 specifies T. 120 services for data communications and conferencing within and next to an 11.323
session. The T. 120 support services means that data handling can occur cither in conjunction with 11,323 audio and video, or separately, as desired for a given implementation.
|()()36| In accordance with a typical H.323 system, conferencing sci-ver 102 may be implemented as an MCU coupled to an H.323 gateway, an 11.323 gatekeeper, one or more H.323 terminals 106-l-/;j, and a plurality of other devices such as personal computers, servers and other network devices (e.g., over a local area network), I'lie 11.323 devices may be implemented in compliance with the H.323 series of standards or variants, H.323 client terminals 106-l-«; are each considered "endpoints" as may be further discussed below. The H.323 endpoints support 11.245 control signaling for negotiation of media channel usage, 0-931 (H.225.0) for call signaling and call setup. 11.225.0 Registration, Admission, and Status (RAS), and RTP/R'I'CP for sequencing audio and video packets, fhe H.323 endpoints may ftirthcr implement various audio and video codecs, 'r.l20 data conferencing protocols and certain MCU capabilities. Although some embodiments may be described in the context of an H.323 .system by way of example only, it may be appreciated that multimedia conferencing system 100 may also be implemented in accordance with one or more of the IliTl" SIP series of .standards and/or variants, as well as other multimedia signaling standards, and still fall within the .scope of the embodiments. The embodiments are not limited in this context.
|00.37| In general operation, multimedia conference system 100 may be used for multimedia conference calls. Multimedia conference calls typically involve communicating voice, video, and/or data information between multiple end points. for example, a public or private packet network may be used for audio conferencing calls, video conferencing calls, audio/video conferencing calls, collaborative document sharing and editing, and so forth. The packet network may also be connected to the PSTN via one or more suitable VoIP gateways arranged to convert between circuit-switched information and packet information. I'o establish a multimedia conference call over a packet network, each client terminal 106-1-/H may connect lo conferencing server 102 using various types of wired or wireless communications links \OH-\-n operating at varying connection speeds or bandwidths, such as a lower bandwidth PSTN telephone connection, a medium bandwidth 1)S1. modem connection or cable modem connection, and a higher bandwidth intranet connection over a local area network (LAN), for example.
|0038| In a multiparty video conference, each receiving client terminal 106-2-/;! receives a video stream from each of the other client terminals participating in a conference call, as well as emitting a video stream of its own. A receiving client terminal 106-2-/;; may arrange the multiple video streams from the other receiving client terminals 106-2-;;/ on a display screen in the form of a visual composition. This may be accomplished, for example, using a visual composition module 1 10-1-/; implemented as part of client terminals 106-1-/;/, respectively. A representative example of a visual composition may be described with reference to \-\G. 2. |0039| Fl(i. 2 illustrates an exemplary embodiment of a visual composition, IKi. 2 illustrates a visual composition 200 having video streams from all or a subset of the participants displayed as a mosaic on a display for a given client terminal 106-1-;;;. As shown in lT(j. 2, a top display object 202 may comprise a display window arranged to display video information for a current active speaker, and a panoramic view of the other participants may be displayed by a smaller set of display objects 204-1 -r positioned beneath the top display object. As the active speaker changes to one of the participants displayed in one of the smaller set of display objects 204-1-v, the video information for the active speaker from one of the smaller set of display objects 204-1 -v may be displayed in the top display object 202, and vice-versa. It may be appreciated that visual composition 200 is only one example of a visual composition, and other visual compositions may be used with a different number of display objects and different sizes of display objects as desired for a given implementation or particular conference call. For example, the display objects 202, 204 may be implemented as "head-and-shoulder" cutouts (e.g., with or without any background), transparent objects that can overlay other objects, rectangular regions in perspective, and so forth. l"he embodiments are not limited in this context. [0040| As shown in V\G. 2, visual composition 200 may include a main window containing the currently active speaker, multiple smaller windows of the other participants, and perhaps other elements such as a small picture-in-picture or semi-transparent overlay of a recently active speaker within the main window, i'urthermore, the visual composition may be dynamic. Since the active speaker may change, other participants in the conference call may rotate through the main window and picture-in-picture. In some cases, not all participants may be visible all the time. The set of visible participants may change in time.
|()0411 In some embodiments, a visual composition may involve more than one conference, A participant may desire to have each conference call arranged
appropriately, according to their relationships to each other and their relative importance. In principle, these conferences could be completely independent of each other. In some eases, however, they would be sub-conferences of a main conference. I'or example, a sccondaiy conference may be a side chat with another participant in the primary conference.
|0042| l':ach client lerminal 106-l-/« may choose to construct its own. unique visual composition. Typically there is special treatment for rendering video information for a user of a client terminal 106-1-HI as displayed on the client terminal 106-1-//;, such as leaving it out of the composition entirely or putting it in a special location.
|0043| Itach visual composition has different communication requirements, for example, the smaller display objects and picture-in-picture displays may have lower spatial resolution requirements than the larger display objects. Similarly, video information for the less active participants may have lower temporal resolution requirements than the videos of the more active participants. |0044| Spatial resolution may refer generally to a measure of accuracy with respect to the details of the space being measured. In the context of digital video, spatial resolution may be measured or expressed as a number of pixels in a frame, picture or image, for example, a digital image size of 640 x 4H0 pixels equals .i26,6SX individual pixels. In general, images having higher spatial resolution are composed with a greater number of pixels than those of lower spatial resolulion. Spatial resolution may affect, among other things, image quality for a video frame, picture, or image.
|0045| Temporal resolution may generally refer to the accuracy of a particular measurement with respect to time. In the context of digital video, temporal resolution may be measured or expressed as a frame rate, or a number of frames of video ini'ormation captured per second, such as 15 framc/s, 30 framc/s, 60 frame/s, and so forth. In general, a higher temporal resolution refers to a greater number of framcs/s than those of lower temporal resolution. Temporal resolution may affect, among other things, motion rendition for a sequence of video images or frames. A video stream or bitslream may refer to a continuous sequence of segments (e.g., bits or bylcs) representing audio and/or video information.
|0046| Lower spatial and./or temporal resolutions generally have lower bit rales for a given picture quality, often measured in terms of a signal-to-noise ratio (SNR) or other metric, for a given spatio-temporal resolution, lower picture quality typically
has a lower bit rate, while higher picture quality typically has a higher bit rale. Some visual compositions may have lower picture quality requirements for some or all of the participants.
|0047| liach client terminal \06-\-ni typically has an overall input bit rate budget, or constraint, as well as an overall output bit rate budget. Consequently, one design goal it to cfHciently utilize the input bit rate and output bit rate budgets. Accordingly, various embodiments may implement a scalable video encoder 104 to improve utilization and efficiency for the overall input bit rate budget and/or output bit rate budget for a given client terminal 106-I-/H to render or display a visual composition for a multimedia conference call. The availability of multiple spatial resolutions, temporal resolutions, and quality levels for each video stream allows a client terminal 106-1'/;; to make cfUcicnt use of its input bit rate budget for any given composition, by selectively receiving and/or decoding only the video information needed for the visual composition.
|0048| In various embodiments, scalable video encoder 104 may be implemented to operate using one or more scalable coding and decoding techniques, sometimes referred to as embedded coding or layered coding. Scalable coding is an efficient way for a transmitter such as conferencing server 102 to produce multiple spatial resolutions, temporal resolutions, and quality levels, and to send these multiple levels while making efficient use of the overall output bit rate budget. In contrast, traditionally, multiple versions of the same video are produced as independent encodings, which arc all sent in parallel, a techniques sometimes referred to as "simulcast." Simulcast techniques typically make inefficient use of the overall input bit rate and/or output bit rate budgets. Multimedia conferencing system 100 in general, and scalable video encoders 104 and visual composition modules 110-1-/; in particular, may be described with reference to FIG. 3.
|0049| FK;. 3 illustrates a block diagram of computing environment 300. Computing environment 300 may be implemented as a device, or part of a device, such as conferencing server 102 and/or client terminals \0(i-]-ni. In ,some embodiments, computing environment 300 may be implemented to execute software 310. l^xamplcs of software 310 may include scalable video encoder 104 and/or visual composition modules 1 10-I-/7. For example, when computing environment 300 is implemented as part of conferencing server 102, software programs 310 may include .scalable video encoder 104 and/or visual composition modules I 10-1-/; and accompanying data. In another example, when computing environment 300 is
implemented as pain of a client terminal \06-\-tn, software programs 310 may include scalable video encoder 104 and/or visual composition modules 110-1-/; and accompanying data. In yet another example, when computing environment 300 is implemented as part of conferencing server 102 and/or client terminal 106-1 -»;, software programs 310 may include an operating system or other system software lypically implemented for an electrical, electronic, and/or electro-mechanical device. Although some embodiments may be described with operations for scalable video encoder 104 and/or visual composition modules 1 10-1-/7 implemented as .software stored and executed by computing environment 300, it may be appreciated that the operations for software modules 104, 110 may be implemented using dedicated hardware, software or any combination thereof. The embodiments arc not limited in this context.
|0050| In its most basic configuration, computing environment 300 typically includes a processing system 308 that comprises at least one processing unit 302 and memory 304. Processing unit 302 may be any type of processor capable of executing software, such as a general-purpose processor, a dedicated processor, a media processor, a coiUrollcr, a microcontroller, an embedded processor, a digital signal proces.sor (DSP), and so forth. Memory 304 may be implemented using any machine-readable or computer-readable media capable of storing data, including both volatile and non-volatile memory. For example, memory 304 may include read-only memory (ROM), random-access memory (RAM), dynamic RAM (DRAM), Double-Dala-kale DRAM (DIDRAM), .synchronous DRAM (SDRAM), static RAM (SRAM), programmable ROM (PROM), erasable programmable ROM (I^PROM), electrically erasable programmable ROM (EEPROM), flash memory, polymer memory such as ferroelectric polymer memory, ovonic memory, phase change or ferroelectric memory, silicon-oxide-nitride-oxide-.silicon (SONOS) memory, magnetic or optical cards, or any other type of media suitable for .storing information. [00511 As shown in I'IG. 3, memory 304 may store various .software programs 310, such as scalable video encoder 104, vi.sual compo.sition module 1 10, and accompanying data. In some cases, such as for .scalable video encoder 104, .software programs 310 may have to be duplicated in the memory if it is designed to handle one media stream at a time. Likewi.se, proces.sor 302 and .scalable video encoder 104 may be duplicated .several times if the host sy.stem is a muUi-core microproccs.sor-ba.sed computing platform. Memory 304 may also .store other .software programs to implement different aspects of conferencing .server 102, such as various types of
operating system software, application programs, video codecs, audio codecs, call control software, gatekeeper software, multipoint controllers, multipoint processors, and so forth. Alternatively such operations may be implemented in the form of dedicated hardware (e.g., DSP, ASIC, FPGA, and so forth) or a combination of hardware, firmware and/or software as desired for a given implementation, ihc embodiments are not limited in this context.
|0052| Computing environment 300 may also have additional features and/or functionality beyond configuration 308. For example, computing environment 300 may include storage 312, which may comprise various types of removable or non¬removable storage units. Storage 312 may implement using any of the various types of machine-readable or computer-readable media as previously described. Computing environment 300 may also have one or more input devices 314 such as a keyboard. mou.se, pen, voice input device, touch input device, and so forth. One or more oulpul devices 316 such as a display device, speakers, printer, and so forth may also be included in computing environment 300 as well.
100531 Computing environment 300 may further include one or more communications connections 318 that allow computing environment 300 to communicate with other devices via communication links l08-l-«. Communications connections 318 may include various types of standard communication elements, such as one or more communications interfaces, network interfaces, network interface cards (NIC), radios, wireless transmitters/receivers (transceivers), wired and/or wireless communication media, physical connectors, and so forth. Communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. The term "modulated data signal" means a signal that has one or more of its characteristics sei or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes both wired communications media and wireless communications media, as previously described. The terms machine-readable media and computer-readable media as used herein are meant to include both storage media and communications media. |0054| In various embodiments, computing environment 300 may be implemented as some or all of client terminals I06-1-A«. In particular, computing environment 300 may be implemented with software programs 3 10 to include one or more visual composition modules 1 \0-\-p. In a multiparty video conference, each
receiving client terminal 106-I-HJ receives a video stream from each ofthe other client temiinals participating in a conference call, as well as emitting a video stream of its own. l"or a given client terminal l06-l-/«, visual composition modules 1 10-1-/; may arrange the multiple video streams from the other client terminals 106-1 -/» in a visual composition on a display screen, such as visual composition 200, for example. |005.S| In various embodiments, each visual composition has different communication requirements. Vov example, the smaller display objects and picturc-in-picture displays may have lower spatial resolution requirements than the larger display objects. Similarly, video information for the less active parlicipanis may have lower temporal resolution requirements than the videos of the more active participants. Lower spatial and/or temporal resolutions generally have lower bit rates for a given picture quality, often measured in terms of a SNR or other metric, for a given spatio-temporal resolution, lower picture quality typically has a lower bit rale, while higher picture quality typically has a higher bit rate. Some visual compositions may have lower picture quality requirements for some or all ofthe participants. |0056| Ivach client terminal 106-l-w typically has an overall input bit rate budget, or constraint, as well as an overall output bit rate budget. Consequently, one design goal is to efficiently utilize the input bit rate and output bit rate budgets. Various embodiments may implement a scalable video encoder 104 at conferencing server 102 or various client terminals 106-I-A7i to improve utilization and efficiency for the overall input bit rate budget and/or output bit rate budget for a given client terminal 106-1 -1)1 to render or display a visual composition for a multimedia conference call. The availability of multiple spatial resolutions, temporal resolutions, and quality levels for each video stream allows a client terminal 106- 1-w to make efficient use ol' its input bit rate budget for any given visual composition, by selectively receiving aiid'or decoding only the video information needed for the various display objects within the visual composition. Client terminals 106-1-/;/ and corresponding visual composition modules 1 \0-\-p may be fxirther described with reference to b'Ki. 4. |0()57| FKJ. 4 illustrates an embodiment for a representative client terminal. IKi. 4 provides a more detailed block diagram of a client terminal representative of any one of client terminals I06-1-H(. AS shown in FIG. 4, a client terminal 106 may comprise a wired or wireless parser 404-l-p arranged to receive as input one or more encoded video streams 402-1-fv. In one embodiment, for example, encoded video streams 402-1-o may be generated by scalable video encoder 104 implemented as part of conferencing server 102. Hncoded video streams 402-1-O may include video
information encoded with different video layers. liach video layer may have differenl levels of spatial resolution, temporal resolution and quality. Scalable video encoder 104 may mulliplcx the various video layers into encoded video streams 402-1-0. and transmit video streams 402-l-o over communication link lOK via one or more communication connections 318. Scalable video encoder 104 and encoded video slrcams 402-1-r> may be described in more detail with reference to i<"i(iS. 6-9 further below.
|0058| In various embodiments, encoded video streams 402-1 -o may be received by one or more parsers 404-1 -p. Parsers 404-1 -p may output the received video streams 402-l-o to one or more scalable video decoders 406-1-/- each communicatively coupled to parsers 404-1-/?. Parsers 404-\-p may also output received video streams 402-1-o and/or scalability indicators to visual composition module I 10, also communicatively coupled to parsers 404-1-/;. |0059| In various embodiments, visual composition module 1 10 may receive video streams 402-1 -o and/or scalable indicators from parser 404-1 -p. In both cases, visual composition module 110 may use video streams 402-l-o or the scalability indicators to determine whether video streams 402-1-O contain different video layers with different levels of spatial resolution, temporal resolution, and/or quality. If examination of video streams 402-1 -o or a value for the scalability indicator indicates an unscalable video stream, then video decoders 406-1-c/ may perform decoding and visual composition display operations as normal. If visual composition module 1 10 determines that video streams 402-l-o are scalable video streams, then video decoders 406-1 -q may perform scaled decoding accordingly. In the latter case, how many and which spatial or temporal scales in the video bitstream are decoded is determined by visual composition module 110. In each case, the same visual composition module 1 10 may be used to coordinate the allocation of spatial, temporal resolution across all composition windows such that overall input bit rate budget or another constraint such as decoding performance is not exceeded.
|0060| In one embodiment, for example, visual composition module 110 may receive a total input bit rate for multiple display objects of client terminal 106. fhe total input bit rate value may be received statically from memory 304, or dynamically from rendering modules 40S-l-r, a communications interface, a transceiver, an operating system, and so forth. The total input bit rate value may provide an indication as to the total input bit rate budget for client terminal 106. fhe total input bit rate budget may vary in accordance with a number of factors, such as an
instantaneous bandwidth of communication link 108, processing speed of processing unit 302, memory size for memory 304, memory bandwidth (e.g., memory bus speeds, access times, and so forth), user selected quality and resolution criteria, display object size, display object location, amount of motion within a video frame .sequence for a display object, coding bit rates, graphics bus speeds, and so forth, furthermore, the total input bit rate budget may vary over time in response to changing conditions for one or more of these factors.
8
|00611 Once visual composition module 110 receives the total input bit rate budget for client terminal 106, visual composition module 110 may allocate a display object bit rate to each display object used for a visual composition at client terminal 106. Visual composition module I 10 may allocate a display object bit rate to a given display object based on any number of factors as previously described with reference to a total input bit rate budget. In some embodiments, for example, visual composition module 1 10 may allocate display object bit rates based on a display object size, a display object location, and an instantaneous channel capacity for communications link 108.
|(K)62| During allocation operations, visual composition module 1 10 limits di.splay object bit rate allocations to a total display object bit rale for all display objects that is equal to or less than the total input bit rate for client terminal 106. Visual composition module 110 may dynamically vary display object bit rate allocations based on changing conditions, such as changes in active speaker, changes in display object size, changes in an amount of motion for video information in a given display object, changes in status (paused video or streaming video) and so forth. Visual composition module 110 may output the display object bit rate allocations to scalable video decoders 406-1-*^ communicatively coupled to visual composition module 1 10.
|0063| In various embodiments, client terminal 106 may include a set of scalable video decoders 406-1 -