Abstract: The disclosure relates generally to methods and systems to synergize context of end-user with quality-of-experience (QoE) of live video feed. Conventional techniques mostly lack synergy between encoder/decoder and underlying protocol, and they mostly employ motion vector-based encoding, thereby making them not so suitable for latency critical application. In most of the telepresence/teleoperation based live video streaming applications, at a time user’s attention is focused on certain region of streamed video and not the entire frame. But most of existing techniques under challenges network conditions, undermine quality evenly throughout frame without giving importance to end-user’s instantaneous region of interest. Existing mechanisms that try to alleviate this problem using foveated rendering are computationally expensive hence cannot be deployed easily on real life robotics platforms. The present disclosure attempts to alleviate all above challenges through end-user foveation centric spatio-temporal bitrate adaptation scheme tightly entangled with underlying protocol to achieve dynamic foveated rendering. [To be published with FIG. 7B]
FORM 2
THE PATENTS ACT, 1970
(39 of 1970)
&
THE PATENT RULES, 2003
COMPLETE SPECIFICATION
(See Section 10 and Rule 13)
Title of invention:
METHODS AND SYSTEMS TO SYNERGIZE CONTEXT OF END-USER WITH QUALITY-OF-EXPERIENCE OF LIVE VIDEO FEED
Applicant
Tata Consultancy Services Limited
A company Incorporated in India under the Companies Act, 1956
Having address:
Nirmal Building, 9th floor,
Nariman point, Mumbai 400021,
Maharashtra, India
Preamble to the description:
The following specification particularly describes the invention and the manner in which it is to be performed:
2
TECHNICAL FIELD
[001]
The disclosure herein generally relates to tele-robotics, and, more par-ticularly, to methods and systems to synergize context of end-user with quality-of-experience of live video feed.
5
BACKGROUND
[002]
Quality-of-experience (QoE)-centric streaming of visual feedback is a key factor for Internet-facing delay-sensitive interactive applications such as mobile telerobotics. The sanity of such visual feeds is extremely important for meaningful command execution by the operator to a remote robot. Conventional techniques for 10 live video streaming for real time applications lack synergy between encoding and underlying protocol. Additionally, they employ a motion vector induced group of pic-tures (GoP) based encoding that severely degrades the quality evenly throughout the frame under challenged network conditions. Also, GoP based encoding scheme is not suitable for latency critical applications. It is observed that at a time, the end-user’s 15 attention is limited to a specific region of interest in the current frame. This region of interest keeps shifting dynamically based on scene changes and user context. There-fore, to improve end-user experience, there have been some works in the domain of foveated rendering in the context of VR applications. They mimic the human visual system to render frames with high quality in foveal (region of interest) region and 20 comparatively lower quality in peripheral regions. This helps reduce overall bitrate of the encoded video and allocate encoding resources judiciously. But they have not been applied in concerned real-time delay-sensitive physical Applications. Also, they use additional sensors in the headgear to track head movement and gaze and cannot be democratized for general consoles. Also, most of the foveated rendering tech-25 niques attempt at improving quality of foveal region by encoding the foveal region at higher resolution than the peripheral regions. But such methods usually make the sys-
3
tem computationally expensive and
hence are not easily deployable on real life robot-ic platforms.
SUMMARY
[003]
Embodiments of the present disclosure present technological im-5 provements as solutions to one or more of the above-mentioned technical problems recognized by the inventors in conventional systems.
[004]
In an aspect, a processor-implemented method to synergize context of end-user with quality-of-experience of live video feed is provided. The method in-cluding the steps of: defining for a current video frame of a live video stream trans-10 mitted by a transmitter using an acquisition device, (i) a foveal region as a circle with a dynamically determined radius and a dynamically determined center, and (ii) a pe-ripheral region as a region of a frame surrounding the foveal region; receiving at the transmitter, (i) an instantaneous feedback of a video quality, and (ii) 2-dimensional (2-D) eye-gaze coordinates of an end-user from the receiver, on a periodic timer expi-15 ry; employing a dynamic foveated rendering for rendering the one or more successive video frames of the live video stream, at the transmitter, by dynamically adapting the center of the foveal region based on the 2-D eye-gaze coordinates of the end-user; performing at the transmitter, a foveation-centric spatial compression for both the current basic as well as Delta encoded frame , based on the dynamically determined 20 radius and the center of the foveal region; packetizing the one or more encoded suc-cessive video frames of the live video stream, into one or more packets, using a pack-etization technique; transmitting the one or more packets, to the receiver, over the network communication channel at a predefined frame rate based on a chosen encod-ing scheme determined based on the instantaneous value of an error estimates con-25 tained in the instantaneous feedback; receiving the one or more packets, at the receiv-er; reconstructing the one or more frames from the one or more packets, using the
4
payload specific header; and estimating an error rate of the current video frame, using
the payload specific header and a number of packets received.
[005]
In another aspect, a system to synergize context of end-user with quality-of-experience of live video feed is provided. The system includes: a memory storing instructions; one or more input/output (I/O) interfaces; an acquisition device; 5 and one or more hardware processors coupled to the memory via the one or more I/O interfaces, wherein the one or more hardware processors are configured by the in-structions to: define for a current video frame of a live video stream transmitted by a transmitter using an acquisition device, (i) a foveal region as a circle with a dynami-cally determined radius and a dynamically determined center, and (ii) a peripheral 10 region as a region of a frame surrounding the foveal region; receive at the transmitter, (i) an instantaneous feedback of a video quality, and (ii) 2-dimensional (2-D) eye-gaze coordinates of an end-user from the receiver, on a periodic timer expiry; employ a dynamic foveated rendering for rendering the one or more successive video frames of the live video stream, at the transmitter, by dynamically adapting the center of the 15 foveal region based on the 2-D eye-gaze coordinates of the end-user; perform at the transmitter, a foveation-centric spatial compression for both basic as well as Delta encoded frames, based on the dynamically determined radius and the center of the foveal region; packetize the one or more encoded successive video frames of the live video stream, into one or more packets, using a packetization technique; transmit the 20 one or more packets, to the receiver, over the network communication channel at a predefined frame rate based on a chosen encoding scheme determined based on the instantaneous value of an error estimates contained in the instantaneous feedback; receive the one or more packets, at the receiver; reconstruct the one or more frames from the one or more packets, using the payload specific header; and estimate an er-25 ror rate of the current video frame, using the payload specific header and a number of packets received.
5
[006]
In yet another aspect, there is provided a computer program product comprising a non-transitory computer readable medium having a computer readable program embodied therein, wherein the computer readable program, when executed on a computing device, causes the computing device to: define for a current video frame of a live video stream transmitted by a transmitter using an acquisition device, 5 (i) a foveal region as a circle with a dynamically determined radius and a dynamically determined center, and (ii) a peripheral region as a region of a frame surrounding the foveal region; receive at the transmitter, (i) an instantaneous feedback of a video quality, and (ii) 2-dimensional (2-D) eye-gaze coordinates of an end-user from the receiver, on a periodic timer expiry; employ a dynamic foveated rendering for render-10 ing the one or more successive video frames of the live video stream, at the transmit-ter, by dynamically adapting the center of the foveal region based on the 2-D eye-gaze coordinates of the end-user; perform at the transmitter, a foveation-centric spa-tial compression for both basic as well as Delta encoded frames, based on the dynam-ically determined radius and the center of the foveal region; packetize the one or 15 more encoded successive video frames of the live video stream, into one or more packets, using a packetization technique; transmit the one or more packets, to the re-ceiver, over the network communication channel at a predefined frame rate based on a chosen encoding scheme determined based on the instantaneous value of an error estimates contained in the instantaneous feedback; receive the one or more packets, at 20 the receiver; reconstruct the one or more frames from the one or more packets, using the payload specific header; and estimate an error rate of the current video frame, us-ing the payload specific header and a number of packets received.
[007]
In an embodiment, the dynamic foveated rendering is employed for rendering the one or more successive video frames of the live video stream, by: in-25 stantaneously retrieving the 2-D eye-gaze coordinates of the end-user with respect to entire receiver side screen , using an online eye-tracking module on a periodic basis; transforming the 2-D eye-gaze coordinates to corresponding co-ordinates with respect
6
to a viewing canvas window; invalidating the 2
-D eye-gaze co-ordinates if they go beyond the viewing canvas window and computing various error estimates for the current video frame based on error incurred for current and past video frames and transmitting (i) the computed error estimates of the current video frame as the instan-taneous feedback of the video quality along with (ii) the 2-D eye-gaze co-ordinates, 5 on the periodic timer expiry.
[008]
In an embodiment, the center is dynamically determined based on the 2-D eye gaze coordinates of the end-user at the receiver; and the radius is dynamical-ly determined based on the instantaneous feedback of the video quality received on the periodic timer expiry from the receiver. 10
BRIEF DESCRIPTION OF THE DRAWINGS
[009]
The accompanying drawings, which are incorporated in and constitute a part of this disclosure, illustrate exemplary embodiments and, together with the de-scription, serve to explain the disclosed principles: 15
[010]
FIG. 1 illustrates an exemplary application scenario for methods and systems of the present disclosure.
[011]
FIG. 2 is an exemplary block diagram of a system to synergize context of end-user with quality-of-experience of live video feed, in accordance with some embodiments of the present disclosure. 20
[012]
FIGS. 3A through 3B illustrate exemplary flow diagrams of a proces-sor-implemented method illustrating the protocol semantics, transmitter and receiver side operations to achieve dynamic foveated rendering, in accordance with some em-bodiments of the present disclosure.
[013]
FIG. 4 illustrates an exemplary view of a foveation-based realization 25 of intra-frame quality variation, in accordance with some embodiments of the present disclosure.
7
[014]
FIG. 5 shows a dynamic foveation-centric spatio-temporal encoding performed at the transmitter, and various inputs parameters are supplied to the spatial encoder of both basic and delta encoded frames to achieve dynamic foveation-centric spatial encoding, in accordance with some embodiments of the present disclosure.
[015]
FIG. 6 is an exemplary flow diagram showing steps illustrating how 5 the instantaneous 2D end-user eye-gaze co-ordinates are retrieved from an online tracker and subsequent post processing of eye-gaze co-ordinates to achieve dynamic foveated rendering, in accordance with some embodiments of the present disclosure.
[016]
FIG. 7A shows continuous capturing of the instantaneous eye-gaze lo-cation of the end-user using an eye-tracker module and the dynamic foveated render-10 ing mechanism is getting reflected on the live video streamed over the viewing can-vas window, in accordance with some embodiments of the present disclosure.
[017]
FIG. 7B shows pixel transformation of the captured eye-gaze location of the end-user to corresponding pixel co-ordinates with respect to viewing canvas window in the dynamic foveated rendering mechanism, in accordance with some em-15 bodiments of the present disclosure.
[018]
FIG. 7C is a flowchart depicting a process performed at the receiver to retrieve the instantaneous eye-gaze co-ordinates of end-user from an online eye-tracker module running in parallel on a periodic basis, followed by post processing of retrieved eye-gaze co-ordinates and subsequent transmission of this in ACK message 20 to transmitter.
[019]
FIGS. 8A-8E shows timing diagrams in different situations compris-ing: (a) Tx gets ACK with periodic feedback from Rx signifying no loss at Rx and invalid 2D eye-gaze co-ordinates resulting in sending the next frame with improved bitrate with the instantaneous foveal region defined with a center set as the center of 25 the current frame since invalid 2D eye-gaze co-ordinates are received, (b) Tx gets ACK with periodic feedback from Rx signifying loss at Rx above threshold and in-stantaneous 2D eye-gaze co-ordinates resulting in sending the next frame with re-
8
duced bitrate
reduction with current foveal region defined using the received eye-gaze co-ordinates, (c) ACK lost for a full frame making Tx send next frame as full frame with foveal region defined with a center set as the center of the current frame since current 2D eye-gaze co-ordinates are not received, (d) ACK received before periodic timer expiry for a frame sent at previous expiration of periodic timer with 5 error at Rx within threshold and instantaneous 2D eye-gaze co-ordinates of end-user. Transmitter sends the next frame as delta frame with foveal region defined using cen-ter as per the received 2D eye-gaze co-ordinates, (e) ACK belonging to a frame sent at periodic timer expiry is lost so, Tx sends the next frame with ‘periodic timer status’ flag set with foveal region defined with a center set as the center of the current frame 10 since current 2D eye-gaze co-ordinates are not received, in accordance with some embodiments of the present disclosure.
[020]
FIG. 9A is a graph showing a performance comparison between the present disclosure and WebRTC on full referential visual metric for last mile channel degradations. 15
[021]
FIG. 9B shows a setup for long-haul telerobotic experiment, in ac-cordance with some embodiments of the present disclosure.
[022]
FIG. 9C is a graph showing a performance comparison of the present disclosure with WebRTC on full-referential quality metric for the long-haul experi-mental setup shown in FIG. 9B. 20
[023]
FIG. 9D is a graph showing a comparative bandwidth consumption at Tx & Rx for live streaming via relay servers at Mumbai, Ohio & Tokyo, in accord-ance with some embodiments of the present disclosure.
[024]
FIG. 9E is a graph showing MOS scores comparison for present dis-closure with WebRTC for a typical Teleoperation scenario for long haul experimental 25 shown in FIG. 9B for streaming via Relay servers located in Mumbai, Tokyo, Ohio, in accordance with some embodiments of the present disclosure.
9
[025]
FIG. 9F shows a setup for a latency measurement, in accordance with some embodiments of the present disclosure.
[026]
FIG. 9G is a graph showing a latency comparison of present disclosure with WebRTC for experimental setup in FIG. 9B for streaming via Relay servers in Ohio, Tokyo, and Mumbai, in accordance with some embodiments of the present dis-5 closure.
[027]
FIG. 9H shows a setup for live experiment with the dynamic fovea-tion, in accordance with some embodiments of the present disclosure.
[028]
FIG. 9I is a graph showing a comparative result of MOS scores of fixed and dynamic foveated rendering, in accordance with some embodiments of the 10 present disclosure.
DETAILED DESCRIPTION OF EMBODIMENTS
[029]
Exemplary embodiments are described with reference to the accompa-nying drawings. In the figures, the left-most digit(s) of a reference number identifies 15 the figure in which the reference number first appears. Wherever convenient, the same reference numbers are used throughout the drawings to refer to the same or like parts. While examples and features of disclosed principles are described herein, mod-ifications, adaptations, and other implementations are possible without departing from the scope of the disclosed embodiments. 20
[030]
In a typical telerobotics application, a human operator remotely con-trols a mobile robot over the public Internet. The operator gets the live context of the remote environment through the live video feed from the robot camera to the operator console and sends control commands to the robot based on contextual inferences. The end-user quality-of-experience (QoE) of such visual feed, in turn, impacts the QoE of 25 the entire control operation. It is very sensitive to poor visual quality, overshoot in end-to-end motion-to-photon or scene-to-screen delay and freezing. Such occurrences
10
drastically reduce the confidence of the operator on the sanity of inferred remote con-
text and adversely impacts the overall actuation decisions.
[031]
Additionally, it is observed that in real life telerobotics application, the attention of the end-user is focused on a portion of the scene and not the entire scene. This becomes the instantaneous region of interest of the end-user. This region of in-5 terest (foveal region) usually dynamically shifts throughout the frame based on scene changes and current user’s attention. Therefore, transmitting video in real time with-out compromising with the quality of foveal region and while meeting the instantane-ous channel QoS requirements becomes an essential requirement for successful work-ing of any real-time telepresence/ tele-operation systems. 10
[032]
In recent times, a foveated rendering is being tried in the context of virtual reality (VR) glass applications to enable bit-rate reduction aligned with user’s observation. But those have not been applied in concerned real-time delay sensitive physical applications. Also, they use additional sensors in the headgear to track head movement and gaze and cannot be democratized for general consoles. 15
[033]
The present disclosure attempts to solve the above discussed challeng-es in state of art techniques. The key contributions of the methods and systems of the present disclosure are:
Spatiotemporal encoding and QoE-centric bit-rate adaptation: A human-foveation centric spatial bitrate adaptation is introduced on intelligent temporal encoding using 20 background subtraction on simple MJPEG. This makes it simple, agile, robust, adap-tive, end-user QoE aware yet bandwidth efficient.
Democratized gaze-tracking for QoE adaptation with real-time user-context. Finally, this is a first of its kind practical system which integrates a simple RGB camera-based gaze tracking module at the operator which synchronously influences the transmitter 25 and the protocol by supplying the instantaneous user context to the encoder-state ma-chine.
11
[034]
A video quality-of-experience (QoE) for latency-sensitive interactive applications is modelled as in equation (1) as disclosed by applicant’s prior patent applications:
𝐸=𝑓(𝑉𝑞,𝑃𝐹,𝜎,𝛿,𝑑) (1)
wherein, 𝐸 is an observed QoE in a given time span τ, 𝑉𝑞 is an average visual quality 5 of the frames across τ, 𝑃𝐹 is a probability of freezing, 𝜎 is a variance in distance be-tween successive frames, 𝛿 is a gap between the achieved framerate and the desired rate for the present context, and 𝑑 is a perceived delay with respect to a local actua-tion.
[035]
Apart from 𝑉𝑞, the rest are dependent variables influenced by the ratio 10 of present bitrate (b) vs the channel capacity, and the present frame rate (R). In order to maintain E, b cannot be reduced drastically as that would adversely affect 𝑉𝑞. So, a reference for 𝑉𝑞 is maintained as a constraint and, to practically create a tunable mechanism, the equation (1) needs to be redefined as equation (2):
𝐸′=𝑓(𝑉𝑞,𝑅) (2) 15
where 𝑉𝑞is expressed as in equation (3).
[036]
In the present disclosure, the foveal region is defined as a circle with dynamically determined radius and a dynamically determined center. The center of the foveal region is dynamically determined using external triggers reflecting the re-mote operator’s eye-gaze. The foveal radius is dynamically adapted in response to 20 instantaneous channel conditions. This adapts the area of the foveal region. Thus, un-der degrading and improving channel conditions the area of foveal region is reduced and increased respectively by dynamic adaptation of the foveal radius. The increase and decrease in area of foveal region is referred to as foveal expansion and shrinking respectively. Together they are referred to as Foveal Breathing. 25
𝑉𝑞={𝑉𝑞𝑓𝑜𝑣∪𝑉𝑞𝑝𝑒𝑟} (3)
12
wherein, 𝑉𝑞𝑓𝑜𝑣 is a quality within the foveal region, 𝑉𝑞𝑝𝑒𝑟 is a quality at the peripheral region beyond the foveal region. A ratio 𝐺=𝐴𝐴′ is defined where 𝐴 is an area under foveal region, 𝐴′ is an area under peripheral region. To maintain QoE, the whole tun-able mechanism to be constrained by 𝑉𝑞𝑓𝑜𝑣. 𝐺 is a foveal breathing parameter which controls foveal breathing. Thus, the equation (2) is further modified as equation (4): 5
𝐸′′=𝑓(𝑉𝑞𝑓𝑜𝑣,𝑉𝑞𝑝𝑒𝑟,𝐺,𝑅) (4)
[037]
Thus, the equation (4) is equipped with four handles for the tuning. Let, 𝜌𝑘∀𝑘∈{0,…,3} denote the priority of operation for each of the components in right hand side of equation 4 respectively. At any instant, based on the present condi-tion, the process should identify the set of indices i such that, 𝜌𝑖>𝜌𝑗𝑤ℎ𝑒𝑟𝑒𝑖,𝑗⊆10 {0,…,3}and the parameters corresponding to the indices in i are tuned. Under a de-grading channel condition, the normally desired sequence of ordering of 𝜌𝑘 would be as in equation (5):
𝜌1>𝜌2>𝜌3>𝜌0 (5)
[038]
Thus, initially desired bitrate to be achieved by tuning the quality (in 15 terms of quantization) in the peripheral region. Once a limit of matrix sparsity is reached, the foveal breathing is followed by reducing G to achieve the desired bitrate. Then, the frame rate is reduced extrinsically to a certain threshold. Finally, if all the previous tunings do not yield then the quality of the present foveal region needs to be reduced. In ideal case, the reduction of the quality of the foveal region is not exer-20 cised. The tuning may also lead to improving all the factors when channel condition improves and in an ideal situation, 𝑉𝑞𝑓𝑜𝑣=𝑉𝑞𝑝𝑒𝑟.
In the best possible scenario when 𝑉𝑞𝑝𝑒𝑟 → 𝑉𝑞𝑓𝑜𝑣, G → ∞ .
In the worst scenario when 𝑉𝑞𝑓𝑜𝑣 → 𝑉𝑞𝑝𝑒𝑟 , G → 0.
[039]
FIG. 1 illustrates an exemplary application scenario for methods and 25 systems of the present disclosure. In the exemplary application scenario of FIG. 1, a transmitter Tx 120, a receiver Rx 130, and a communication network 140 having the
13
end
-to-end transmission channel are present. The transmitter Tx 120 transmits the video obtained through a video producing unit (not shown in FIG. 1) through the communication network 140 and the receiver Rx 130 receives the video via a video consuming unit (not shown in FIG. 1). The video producing unit may be a video source unit, a video acquisition unit such as camera, video sensor, and so on. The 5 video consuming unit may be the end device where the video is being viewed or dis-played. In a typical scenario, the mobile robot may act as the transmitter Tx 120 and the human operator may act as the receiver Rx 130.
[040]
Referring now to the drawings, and more particularly to FIG. 2 through FIG. 9I, where similar reference characters denote corresponding features 10 consistently throughout the figures, there are shown preferred embodiments, and these embodiments are described in the context of the following exemplary systems and/or methods.
[041]
FIG. 2 is an exemplary block diagram of a system 200 to synergize context of end-user with quality-of-experience of live video feed, in accordance with 15 some embodiments of the present disclosure. In an embodiment, the system 200 in-cludes or is otherwise in communication with one or more hardware processors 204, communication interface device(s) or input/output (I/O) interface(s) 206, and one or more data storage devices or memory 202 operatively coupled to the one or more hardware processors 204. The one or more hardware processors 204, the memory 20 202, and the I/O interface(s) 206 may be coupled to a system bus 208 or a similar mechanism.
[042]
The I/O interface(s) 206 may include a variety of software and hard-ware interfaces, for example, a web interface, a graphical user interface, and the like. The I/O interface(s) 106 may include a variety of software and hardware interfaces, 25 for example, interfaces for peripheral device(s), such as a keyboard, a mouse, an ex-ternal memory, a plurality of sensor devices, a printer and the like. Further, the I/O
14
interface(s) 206 may enable the system 200 to communicate with other devices, such
as web servers and external databases.
[043]
The I/O interface(s) 206 can facilitate multiple communications within a wide variety of networks and protocol types, including wired networks, for exam-ple, local area network (LAN), cable, etc., and wireless networks, such as Wireless 5 LAN (WLAN), cellular, or satellite. For the purpose, the I/O interface(s) 206 may include one or more ports for connecting a number of computing systems with one another or to another server computer. Further, the I/O interface(s) 206 may include one or more ports for connecting a number of devices to one another or to another server. 10
[044]
The one or more hardware processors 204 may be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal proces-sors, central processing units, state machines, logic circuitries, and/or any devices that manipulate signals based on operational instructions. Among other capabilities, the one or more hardware processors 204 are configured to fetch and execute computer-15 readable instructions stored in the memory 202. In the context of the present disclo-sure, the expressions ‘processors’ and ‘hardware processors’ may be used inter-changeably. In an embodiment, the system 200 can be implemented in a variety of computing systems, such as laptop computers, portable computers, notebooks, hand-held devices, workstations, mainframe computers, servers, a network cloud and the 20 like.
[045]
The memory 202 may include any computer-readable medium known in the art including, for example, volatile memory, such as static random access memory (SRAM) and dynamic random access memory (DRAM), and/or non-volatile memory, such as read only memory (ROM), erasable programmable ROM, flash 25 memories, hard disks, optical disks, and magnetic tapes. In an embodiment, the memory 202 includes a plurality of modules 202a and a repository 202b for storing data processed, received, and generated by one or more of the plurality of modules
15
202a. The plurality of modules 202a may include routines, programs, objects, com-
ponents, data structures, and so on, which perform particular tasks or implement par-ticular abstract data types.
[046]
The plurality of modules 202a may include programs or computer-readable instructions or coded instructions that supplement applications or functions 5 performed by the system 200. The plurality of modules 202a may also be used as, signal processor(s), state machine(s), logic circuitries, and/or any other device or component that manipulates signals based on operational instructions. Further, the plurality of modules 202a can be used by hardware, by computer-readable instruc-tions executed by the one or more hardware processors 204, or by a combination 10 thereof. In an embodiment, the plurality of modules 202a can include various sub-modules (not shown in FIG. 2). Further, the memory 202 may include information pertaining to input(s)/output(s) of each step performed by the processor(s) 204 of the system 200 and methods of the present disclosure.
[047]
The repository 202b may include a database or a data engine. Further, 15 the repository 202b amongst other things, may serve as a database or includes a plu-rality of databases for storing the data that is processed, received, or generated as a result of the execution of the plurality of modules 202a. Although the repository 202b is shown internal to the system 200, it will be noted that, in alternate embodiments, the repository 202b can also be implemented external to the system 200, where the 20 repository 202b may be stored within an external database (not shown in FIG. 2) communicatively coupled to the system 200. The data contained within such external database may be periodically updated. For example, data may be added into the ex-ternal database and/or existing data may be modified and/or non-useful data may be deleted from the external database. In one example, the data may be stored in an ex-25 ternal system, such as a Lightweight Directory Access Protocol (LDAP) directory and a Relational Database Management System (RDBMS). In another embodiment, the
16
data stored in the repository 202b may be distributed between the system 200 and the
external database.
[048]
Referring to FIGS. 3A through 3B, components and functionalities of the system 200 are described in accordance with an example embodiment of the pre-sent disclosure. For example, FIGS. 3A through 3B illustrate exemplary flow dia-5 grams of a processor-implemented method 300 illustrating the protocol semantics, transmitter and receiver side operations to achieve dynamic foveated rendering, in accordance with some embodiments of the present disclosure. Although steps of the method 300 including process steps, method steps, techniques or the like may be de-scribed in a sequential order, such processes, methods, and techniques may be con-10 figured to work in alternate orders. In other words, any sequence or order of steps that may be described does not necessarily indicate a requirement that the steps be per-formed in that order. The steps of processes described herein may be performed in any practical order. Further, some steps may be performed simultaneously, or some steps may be performed alone or independently. 15
[049]
At step 302 of the method 300, the one or more hardware processors 204 of the system 200 are configured to define for a current video frame of a live vid-eo stream transmitted by a transmitter using an acquisition device 208 (not shown in FIG. 2), (i) a foveal region as a circle with a dynamically determined radius and a center, and (ii) a peripheral region as a region of a frame surrounding the foveal re-20 gion.
[050]
The foveal region is defined as a circle around an arbitrary pixel on the frame as the center with the adaptive radius Ȓ for a given percentage μ such that:
Ȓ=𝑀𝑎𝑥(𝐼𝑚𝑎𝑔𝑒𝐻𝑒𝑖𝑔ℎ𝑡,𝐼𝑚𝑎𝑔𝑒𝑊𝑖𝑑𝑡ℎ).μ100⁄ (6)
wherein μ is the foveal radius percentage factor. 25
[051]
At step 304 of the method 300, the one or more hardware processors 204 of the system 200 are configured to receive at the transmitter, (i) an instantaneous feedback of a video quality, from the receiver Rx, upon an expiry of a periodic timer
17
present at the receiver Rx. In addition to the instantaneous feedback of the video
quality, 2-dimensional (2-D) eye-gaze coordinates of the end-user present from the receiver are also received on the periodic timer expiry.
[052]
In an embodiment, the transmitter Tx in the present disclosure main-tains the periodic timer as present in ARV of the applicant. On expiry of the periodic 5 timer the transmitter Tx transmits the first packet of the present frame in flight in CON (Confirmable) mode, but with no retransmission. The tuning decision in opti-mizing the QoE depends on the periodic feedback (s). At each playout interval t, the receiver Rx determines the total number of expected packets for each frame by pars-ing the offset field of first packet containing the position indicator for the last packet 10 in the frame. The receiver Rx uses this information to compute an Instantaneous Error Rate ꟾ𝑡 using the following equation (7):
ꟾ𝑡∶=𝑁𝑙𝑜𝑠𝑡𝑁𝑡𝑜𝑡𝑎𝑙𝑋100 -------------- (7)
[053]
Using ꟾ𝑡, the receiver Rx computes a Cumulative error rate 𝐶𝑚𝑘𝑡∶=ꟾ𝑡+𝐶𝑚𝑘𝑡−1 which indicates accumulated error between two instances of receiving 15 CON packets. Additionally, the receiver Rx maintains a log of ꟾ𝑡 at time t marking the end of present play-out interval. Whenever the receiver Rx receives a CON packet on periodic timer expiry, it computes a predominant Error Rate 𝑃𝑡∶=𝑀𝑜𝑑𝑒(ꟾ𝑡,ꟾ𝑡−1,………,ꟾ𝑡−𝑘) where 𝑃𝑡 indicates the most frequent Error Rate within the interval t and t-k, where t-k is the time when the receiver Rx last received a CON 20 packet. The receiver Rx piggybacks 𝐶𝑚𝑘𝑡 and 𝑃𝑡 with ACK of CON packet. The in-stantaneous feedback of the video quality refers to the computed values of instanta-neous cumulative and predominant error rate that is sent at periodic Timer expiry.
[054]
At step 306 of the method 300, the one or more hardware processors 204 of the system 200 are configured to employ a dynamic foveated rendering for 25 rendering the one or more successive video frames of the live video stream, at the transmitter, by dynamically adapting the center of the foveal region based on the 2-D eye-gaze coordinates of the end-user.
18
[055]
The center of the foveal region is dynamically determined using exter-nal triggers reflecting eye gaze coordinates or the remote operator or the end-user. This is achieved using the instantaneous 2D eye gaze co-ordinates of the end-user obtained from the periodic feedback from Rx. Additionally, the instantaneous radius of the foveal region is computed based on the current value of 𝜇 . The instantaneous 5 center and radius hence determined are used to define the current foveal region of the frame. The part of the image beyond the foveal region is called the peripheral region. FIG. 4 illustrates an exemplary view of a foveation-based realization of intra-frame quality variation, in accordance with some embodiments of the present disclosure.
[056]
FIG. 5 shows a dynamic foveation-centric spatio-temporal encoding 10 performed at the transmitter, and various inputs parameters are supplied to the spatial encoder of both basic and delta encoded frames to achieve dynamic foveation-centric spatial encoding, in accordance with some embodiments of the present disclosure. As shown in FIG. 5, both basic as well as delta frame encoders perform foveation-centric spatial encoding of basic and delta frames respectively considering the instantaneous 15 foveal region determined using the dynamically adaptive radius and dynamically de-termined center.
[057]
FIG. 6 is an exemplary flow diagram showing steps illustrating how the instantaneous 2D end-user eye-gaze co-ordinates are retrieved from an online tracker and subsequent post processing of eye-gaze co-ordinates to achieve dynamic 20 foveated rendering. As shown in FIG. 6, employing the dynamic foveated rendering for rendering the one or more successive video frames of the live video stream to achieve QoE is explained through steps 306a to 306d.
[058]
At step 306a, an online eye-tracker module is run in parallel to the re-ceiver Rx. The eye-tracker module continuously retrieves the instantaneous eye-gaze 25 location of end-user on the screen. The receiver Rx periodically invokes the eye-tracker module to retrieve the instantaneous 2D eye-gaze co-ordinates. FIG. 7A shows continuous capturing of the instantaneous eye-gaze location of the end-user
19
using an eye
-tracker module and the dynamic foveated rendering mechanism is get-ting reflected on the live video streamed over the viewing canvas window, in accord-ance with some embodiments of the present disclosure.
[059]
As shown in FIG. 7A, the 2-D eye-gaze coordinates hence obtained are captured with respect to the entire receiver side screen. But the live streamed vid-5 eo on the receiver side screen is limited to viewing canvas window. Hence, at step 306b, the 2-D eye-gaze coordinates captured at step 306a are transformed to corre-sponding co-ordinates with respect to a viewing canvas window. FIG. 7B shows pixel transformation of the captured eye-gaze location of the end-user to corresponding pixel co-ordinates with respect to viewing canvas window in the dynamic foveated 10 rendering mechanism, in accordance with some embodiments of the present disclo-sure.
[060]
Next at step 306c, the 2-D eye-gaze co-ordinates are invalidated if they go beyond the viewing canvas window. That means, the 2-D eye-gaze co-ordinates that are within the viewing canvas window are only considered as valid co-15 ordinates at the transformation step 306b. Finally at step 306d, the error estimates of the current video frame are transmitted as the instantaneous feedback of the video quality along with the 2-D eye-gaze co-ordinates that are obtained at the transfor-mation step 306b, on the periodic timer expiry.
[061]
FIG. 7C is a flowchart depicting a process performed at the receiver to 20 retrieve the instantaneous eye-gaze co-ordinates of end-user from an online eye-tracker module running in parallel on a periodic basis, followed by post processing of retrieved eye-gaze co-ordinates and subsequent transmission of this in ACK message to transmitter. As shown in FIG. 7C, the receiver Rx periodically calls GazOB mod-ule to retrieve the instantaneous the eye-gaze coordinates (location) of the end-user 25 with respect to entire receiver side screen and transforms the coordinates with respect to receiver side screen to corresponding co-ordinates with respect to the viewing can-
20
vas window. Transmit the transformed 2D eye
-gaze co-ordinates along with other computed error estimates piggybacked with ACK.
[062]
Dynamic determination for foveal center: High end-user QoE ensured by providing the part of the content which the user is currently interested in higher quality. The region of highest instantaneous visual attention on the screen is deter-5 mined based on tracking eye movements of the end-user. The present disclosure em-ploys a module called Gaze Observer (GazOB) which is integrated with the encoder and protocol states. The module begins by recording the gaze data using an online eye-tracker. The present disclosure can employ any real-time online eye-tracker to achieve the same purpose. The employed eye-tracker module in GazOB requires an 10 initial training phase to calibrate the head and eye movements of the end-user. To de-termine end-user gaze locations on the screen, the currently employed eye-tracker in GazOB divides the entire screen into a grid of 4 X 5 cells. It determines the grid cell with the highest visual attention of the end-user. It then returns the pixel co-ordinates of the center of the estimated grid cell with respect to the entire receiver side screen. 15
[063]
The 2D eye-gaze co-ordinates obtained from GazeOB module are ob-tained with respect to the entire receiver side screen. The pixel co-ordinates on the receiver side screen, obtained from the GazOB module, have to be transformed into corresponding co-ordinate values on the canvas window which is the viewing win-dow for the end-user. Let the width and height of entire receiver side screen be repre-20 sented as𝑆𝑤 and 𝑆ℎrespectively. Let 𝐶𝑤 and 𝐶ℎ refer to the width and height of the canvas window. The co-ordinates of center of foveal region obtained from GazOB are represented as (𝑈𝑥,𝑈𝑦). If (𝑈𝑥,𝑈𝑦) lies within the canvas window, then they are transformed into co-ordinates (𝑈𝑥𝑡,𝑈𝑦𝑡) as per the following set of equations. FIG. 6 depicts a pixel transformation to transform the 2-D eye-gaze coordinates to corre-25 sponding co-ordinates with respect to a viewing canvas window, in accordance with some embodiments of the present disclosure.
𝑈𝑥𝑡=𝑈𝑥−𝑆𝑤−𝐶𝑤2
21
𝑈𝑦𝑡=𝑈𝑦−𝑆ℎ−𝐶ℎ2
[064]
Before starting the receiver Rx, the currently employed eye-tracking module needs an initial calibration phase to match the eye pattern of the end-user. After this the receiver Rx is started with the GazOB module running in parallel in background. The receiver Rx runs two threads of execution simultaneously. One 5 thread corresponds to reception of packets and decoding of received packets for frame rendering. The other thread corresponds to periodic invocation of the GazOB module by the receiver Rx to obtain the coordinates of the center of instantaneous foveal region. The periodicity of the call to GazOB module depends on the timeout value of the inherent Periodic Timer. For instance, if the periodic timer expires after 10 every 1 sec, the time difference between two consecutive calls to GazOB module should be less than 1 sec so that the receiver Rx is always equipped with the instanta-neous eye-gaze locations whenever it must send periodic feedback to the transmitter Tx. The Receiver piggybacks (𝑈𝑥𝑡,𝑈𝑦𝑡along with other error estimates with the ACK of the reliable packet. If (𝑈𝑥,𝑈𝑦 goes beyond the canvas window, then the receiver 15 Rx invalidates the eye-gaze co-ordinates so obtained by passing (0,0) value in the (𝑈𝑥𝑡,𝑈𝑦𝑡field depicted by the following set of equations:
(𝑈𝑥𝑡,𝑈𝑦𝑡)={(0,0),𝑖𝑓(𝑈𝑥,𝑈𝑦)𝑏𝑒𝑦𝑜𝑛𝑑𝑐𝑎𝑛𝑣𝑎𝑠𝑏𝑜𝑢𝑛𝑑𝑎𝑟𝑦(𝑈𝑥𝑡,𝑈𝑦𝑡),𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
[065]
At the transmitter side Tx, the received instantaneous user eye-gaze location is analyzed. If the co-ordinates are (0,0), the transmitter Tx understands that 20 eye-gaze location went beyond the canvas, hence sets the center of current foveal re-gion as the center of the video frame. Otherwise uses the same co-ordinates obtained in the periodic feedback as the center of current foveal circle as per the following equations:
𝑓𝑜𝑣𝑐𝑒𝑛𝑡𝑒𝑟={(𝐶𝑤2,𝐶𝑤2) 𝑖𝑓 (𝑈𝑥𝑡,𝑈𝑦𝑡=(0,0)(𝑈𝑥𝑡,𝑈𝑦𝑡 ), 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 25
22
[066]
At step 308 of the method 300, the one or more hardware processors 204 of the system 200 are configured to perform at the transmitter, a foveation-centric spatial compression for both basic as well as Delta encoded frames, based on the dy-namically determined radius and the center of the foveal region.
[067]
The temporal compression is achieved using background subtraction 5 to generate delta frames. The spatial compression is achieved considering the instan-taneous foveal region determined by instantaneous value of the radius and the center set as per the received 2D eye-gaze co-ordinates by adaptively switching between the foveal and the peripheral phases. In an embodiment, the spatio-temporal compression mechanism is employed as disclosed in applicant earlier inventions. 10
[068]
Spatial encoding scheme is inherited from previous disclosures. The spatial encoding is achieved by adaptively switching between peripheral and foveal phases. In peripheral phase, only the quality of peripheral region is adapted using nu-ances of quantization mechanism in JPEG encoding with the help of a dynamically determined scaling factor 𝑆. Whereas in foveal phase, the area of foveal region is in-15 creased or decreased by dynamically adapting to the foveal radius. This mechanism is referred to as foveal breathing and determines the instantaneous foveal radius that defines the current foveal region. The bitrate adaptation in each phase and adaptive switching between the two phases is accomplished based on the instantaneous feed-back of video quality in periodic feedback. 20
[069]
The Temporal encoder in Tx generates Delta and Basic frames. The spatial encoder for Basic frames and Delta frames is referred to as Basic frame en-coder and Delta Frame Encoder respectively. The Encoded frame is packetized and transmitted over the communication channel. Several input parameters are supplied to the spatial encoder to achieve QoE aware bitrate adaptation: the maximum quality 25 factor (𝑄𝑚𝑎𝑥) that defines the quality of foveal region, Scaling factor (𝑆) that is dy-namically determined based on the instantaneous feedback of video quality and is used to determine the quality of peripheral region, coordinates of the center of the
23
instantaneous foveal region (
𝑓𝑜𝑣𝑐𝑒𝑛𝑡𝑒𝑟=(𝑈𝑥𝑡,𝑈𝑦𝑡) and the radius of instantaneous foveal region (𝑓𝑜𝑣𝑟𝑎𝑑) determined by the foveal breathing mechanism that together defines the current foveal region.
[070]
At step 310 of the method 300, the one or more hardware processors 204 of the system 200 are configured to packetize the one or more encoded succes-5 sive video frames of the live video stream, into one or more packets, using a packeti-zation technique.
[071]
The packetization technique converts the one or more encoded succes-sive video frames of the live video stream into one or more packets to facilitate re-construction and error concealment of the video frames at the receiver Rx. 10
[072]
The packetization happens such that an integral number of MCUs are placed in a single packet with necessary padding bits as MCUs are not byte aligned. The MCU Payload in each packet is preceded by a payload specific header. The pre-sent disclosure follows the protocol semantics and state machine of CoAP. In an em-bodiment, the reliable semantics is achieved through the CON (Confirmable) mode 15 and best-effort by combining NON (Non-Confirmable) mode with No-response op-tion. The CON mode transmissions are carried out in non-blocking mode as per the previous disclosure of the applicant. The Acknowledgement message from receiver carries the error estimates along with the instantaneous 2D eye-gaze co-ordinates of end-user. The new protocol header option disclosed previously named NRTx is used 20 to manage maximum number of retransmissions allowed per packet. A simple yet effective Retransmission Timeout adaptation based on channel conditions disclosed in previous work is also employed.
[073]
At step 312 of the method 300, the one or more hardware processors 204 of the system 200 are configured to transmit the one or more packets obtained at 25 step 310 of the method 300, to the receiver Rx, over the network communication channel at a predefined frame rate determined based on a chosen encoding scheme
24
determined based on instantaneous value of an error estimates contained in the instan-
taneous feedback.
[074]
At step 314 of the method 300, the one or more hardware processors 204 of the system 200 are configured to receive the one or more packets sent by the transmitter Tx at step 312 of the method 300, at the receiver Rx. 5
[075]
The receiver side reconstruction, correction, reassembly of received video frames and jitter buffer adaptation are achieved using mechanism disclosed in previous patents. The receiver Rx performs reassembly of received packets depend-ing on whether the received frame is Full or Delta encoded frames using the metadata information contained in the payload specific header of each packet. To tackle packet 10 loss, Rx is also equipped with a frugal yet efficient loss handling mechanism for both basic and delta frames. In addition, to achieve smooth rendering of the live video, the receiver Rx employs a Kalman Filter based Jitter Buffer Adaptation mechanism as per the previous disclosure of the applicant.
[076]
At step 316 of the method 300, the one or more hardware processors 15 204 of the system 200 are configured to reconstruct the one or more frames from the one or more packets received at step 314 of the method 300, using the payload specif-ic header. The reconstruction of the one or more frames is performed depending on the encoding scheme employed at the transmitter Tx for the live video streaming.
[077]
At step 318 of the method 300, the one or more hardware processors 20 204 of the system 200 are configured to estimate an error rate of the current video frame, using the payload specific header and a number of packets received. The se-mantics of the protocol headers allow the receiver Rx to determine the ratio of pack-ets lost in a frame (𝑁𝑙𝑜𝑠𝑡) against the packets expected (𝑁𝑡𝑜𝑡𝑎𝑙). The receiver Rx computes the instantaneous error rate 𝐸𝑡: 25
𝐸𝑡=𝑁𝑙𝑜𝑠𝑡𝑁𝑡𝑜𝑡𝑎𝑙×100
25
[078]
Using 𝐸𝑡, the receiver Rx calculates two error estimates. One is the Cumulative Error rate (𝐶𝑚𝑘𝑡) which represents the accumulated error between the re-ception of two consecutive reliable packets and is computed as:
𝐶𝑚𝑘𝑡=𝐸𝑡+𝐶𝑚𝑘𝑡−1
[079]
The receiver Rx also maintains a log of 𝐸𝑡at each time interval 𝐸𝑡. 5 Whenever the receiver Rx receives a reliable packet on periodic timer expiry, it com-putes the Predominant Error Rate 𝑃𝑡as:
𝑃𝑡=𝑀𝑜𝑑𝑒 (𝐸𝑡,𝐸𝑡−1,…,…,𝐸𝑡−𝑘
[080]
FIGS. 8A-8E shows timing diagrams in different situations compris-ing: (a) Tx gets ACK with periodic feedback from Rx signifying no loss at Rx and 10 invalid 2D eye-gaze co-ordinates resulting in sending the next frame with improved bitrate with the instantaneous foveal region defined with a center set as the center of the current frame since invalid 2D eye-gaze co-ordinates are received, (b) Tx gets ACK with periodic feedback from Rx signifying loss at Rx above threshold and in-stantaneous 2D eye-gaze co-ordinates resulting in sending the next frame with re-15 duced bitrate reduction with current foveal region defined using the received eye-gaze co-ordinates, (c) ACK lost for a full frame making Tx send next frame as full frame with foveal region defined with a center set as the center of the current frame since current 2D eye-gaze co-ordinates are not received, (d) ACK received before periodic timer expiry for a frame sent at previous expiration of periodic timer with 20 error at Rx within threshold and instantaneous 2D eye-gaze co-ordinates of end-user. Transmitter sends the next frame as delta frame with foveal region defined using cen-ter as per the received 2D eye-gaze co-ordinates, (e) ACK belonging to a frame sent at periodic timer expiry is lost so, Tx sends the next frame with ‘periodic timer status’ flag set with foveal region defined with a center set as the center of the current frame 25 since current 2D eye-gaze co-ordinates are not received, in accordance with some embodiments of the present disclosure.
26
Example Scenario:
[081]
The methods and systems of the present disclosure were implemented in C++ using OpenCV and Boost libraries. The methods and systems captured the raw frames and entire encoding happens in the own S/W without using any special H/W accelerator or encoding in camera firmware. The system was built on Ubuntu 5 20.04 on a standard Intel Core i5 machine. The transmitter side is ported to R-Pi3 which is housed in a telerobotic car designed for remote teleoperation. The present disclosure was designed to live stream both stored videos and live camera feed. A parallel WebRTC implementation built on JS was created with a media channel for video streaming and a data channel for exchanging kinematic controls and feedback 10 (s).
[082]
To realize dynamic shifting of the foveal region, the dynamic foveated rendering was implemented in Python. The WebRTC system was also designed to transmit both stored video and live camera feed. The present disclosure was bench-marked with the WebRTC. For the present disclosure, the maximum desired framer-15 ate for Delta frames was set as 15 fps, the frame rate for Basic frame drops to 5 fps initially μ was set at 50% with lowest limit as 15%, S was set at 50%, and desired quality factor was set at 9. The default video resolution was set at 640 × 480 (VGA).
[083]
The standard video sequences have been used for stored video trans-mission. Initially a full referential quality comparison (structural similarity index 20 measure (SSIM), peak signal-to-noise ratio (PSNR), Video Quality Metric (VQM), Video Multimethod Assessment Fusion (VMAF)) is performed for stored video se-quences under last mile channel degradation. To ensure a wide spectrum of different test cases comprising of static FoV, dynamic FoV, high motion, low motion, etc. Akiyo, Hall, Foreman and Tennis sequences respectively, are chosen and rescaled all 25 to 640 × 480 resolution. For testing the impact of last-mile impairments, both Tx and Rx were kept in the same WiFi network and the access point was moved ‘far from - and- near to’ the test set up in a U-shaped trajectory.
27
[084]
To enable full referential comparison for objective quality comparison, a stream recording mechanism was created in the receiver and transmitter pages in the WebRTC system. For WebRTC the samples were all WebM encoded. FIG. 9A is a graph showing a performance comparison between the present disclosure and WebRTC on full referential visual metric for last mile channel degradations. In case-5 of WebRTC the Rx side rendering starts to degrade much early as the Received Sig-nal Strength Indicator (RSSI) starts to dip. In the interest of maintaining the bitrate the encoder compresses the video heavily and at times the whole resolution of the video was reduced with the video freezing for several seconds and takes quite some time to recover despite recovery in RSSI making it not so suitable for delay sensitive 10 applications. The packet flow starts from a peak and dips as it approaches the lossy zone and, in some cases, there was practically silence. This behavior was also ob-served in the browser log of the packet loss reports and the selective ACKs reported from RTCP. What was more interesting was that, in some cases, though the video freezes, still the Tx keeps on pumping the data. These are cases where the GOP has 15 gone out of sync due to loss of I-frames while, unaware of the application going out-of-sync, the transport was keeping the flow alive until feedback was received from RTCP. This was attributed to the GOP based encoded stream as loss of an I frame causes entire GOP to be dropped at the receiver. But the present disclosure continued decent performance and tried to regain lost frames through its zero overhead error 20 concealment described earlier. There was a momentary freeze around the deep degra-dation of RSSI. But due to agile frame-by-frame operation, it regained quickly as soon as RSSI started to rise just above -70 dB.
[085]
Then the system was deployed over a long-haul P2P setting. The transmitter on the Pi-car was put in Kolkata, India. The operator console was in Ban-25 galore, India. Both units were put in private networks behind restrictive NATs which do not allow hole punching. This ensured that WebRTC will always have to route through TURN server. A relay service was also created, collocated with the TURN
28
server for establishing NAT
-independent P2P for the present disclosure. The TURN and the relay servers were replicated in three different AWS instances in Mumbai (India), Tokyo (Japan), and Ohio (US-east). This way, the performance under com-munication was tracked over Internet backbones running through different parts of the world. The experiment was conducted in a real teleoperation scenario. 5
[086]
FIG. 9B shows a setup for long-haul telerobotic experiment for dy-namic foveated rendering, in accordance with some embodiments of the present dis-closure. As shown in FIG. 9B, a person in Kolkata threw a ball on the floor in a given trajectory, and the person in Bangalore had to track the ball by moving the Pi-car re-motely. While WebRTC system was equipped with data channel for this purpose, a 10 special control console was created for operating while observing feeds using the pre-sent disclosure. The control commands were also relayed through the same relay server. Also experimented on 20 users aged between 25 – 45 years. Each user was told to do the ‘ball-tracking’ exercise for 15 times in each sitting. Out of the 15 times the traffic was routed through Mumbai, Tokyo and Ohio for 5 times each. The exper-15 iment was repeated for the same subjects over a span of 5 days at different time of the day (morning, afternoon, evening). Each time the stream was recorded for full refer-ential measures of the videos and the operators in Bangalore were told to mark the experience on a scale of 5. The Wireshark was used to measure the live BW con-sumption for each experiment. 20
[087]
FIG. 9C is a graph showing a performance comparison of present dis-closure with WebRTC on full-referential quality metric for the long-haul experi-mental setup shown in FIG. 9B. FIG. 9E is a graph showing MOS scores comparison for the present disclosure with WebRTC for a typical Teleoperation scenario for long haul experimental shown in FIG. 9B for streaming via Relay servers located in Mum-25 bai, Tokyo, Ohio, in accordance with some embodiments of the present disclosure. The MOS results gave additional consideration for ease of operation by only looking at the video feed from Kolkata. This was followed by making a comparative study
29
between the average scene
-to-screen latency figures of the present disclosure and WebRTC. To measure the scene to screen latency time synchronizing the two smart phones was done with milliseconds clock in Bangalore and Kolkata. The view of the clock was streamed from Kolkata. In Bangalore the mobile clock was set by the con-sole and reception was recorded showing the time in both the screen and on the clock. 5 FIG. 9F shows a setup for a latency measurement, in accordance with some embodi-ments of the present disclosure.
[088]
FIG. 9G is a graph showing a latency comparison of present disclo-sure with WebRTC for experimental setup in FIG. 9F for streaming via Relay servers in Ohio, Tokyo, and Mumbai, in accordance with some embodiments of the present 10 disclosure. The avg latency was observed in three different routes over the span of the experiment. As expected, it was found larger latency variation in Ohio, followed by Tokyo and then Mumbai. Mumbai was the least as both peers were in India. Ohio was the farthest. Though there were problems in synchronized operation due to the inherent photonic delay in the network, but regular freezing made the operation quite 15 unrealizable in case of WebRTC. Several cycles of Ball throwing exercise were missed at Bangalore due to video freeze. This observation was very frequent for Ohio-routed traffic as WebRTC was unable to adapt with the latency variation lead-ing to frequent freezes. At times the quality of the reception also extremely deterio-rated leading to inability to do any kind of teleoperation. But such problems were 20 much less in case of the present disclosure. The quality degradation happened quite gracefully. Also, since the present disclosure does not perform any chroma subsam-pling so originality of the scene colour was also preserved. There was momentary re-duction in reception rate during overshooting of end-to-end latency, but it could con-ceal loss of packets due to latency variation to the satisfaction of the users. Also, like 25 the previous experiment, in this case also the present disclosure performance im-proved almost in tandem with the recovery of the network, but for WebRTC it took quite long time to recover from on-screen freezing despite recovery of the network.
30
[089]
Along with upholding the end-user QoE, the present disclosure is also bandwidth efficient. FIG. 9D is a graph showing a comparative bandwidth consump-tion at Tx & Rx for live streaming via relay servers at Mumbai, Ohio & Tokyo, in accordance with some embodiments of the present disclosure. Intriguingly, band-width consumption consistently reduces from Mumbai-routed traffic to Ohio-routed 5 traffic. The reason was the transmitter reduces the transmission rate in sync with de-grading channel conditions. For Mumbai, the degradation is the least of the three. This observation was made both for WebRTC and the present disclosure. But it was seen that the present disclosure performs much better than WebRTC in bandwidth consumption in cases of routing via Mumbai and Tokyo. But for Ohio sequence, 10 WebRTC consumes extremely low bandwidth as it pauses transmission for several seconds leading to long freezes in the streamed video deteriorating end-user experi-ence. Hence only in case of routing via Ohio, WebRTC consumed less bandwidth than the present disclosure at the expense of the quality of streamed video. Whereas the BW consumed at Rx to send feedback (s) to Tx increases in reverse order with 15 Mumbai being the least and Ohio being the highest. Even then, the present disclosure was shown to utilize much lesser channel bandwidth than WebRTC.
[090]
Next, the present disclosure was modified to support both fixed as well as dynamic foveated rendering. For fixed foveated rendering, the center of foveal re-gion was set as the center of current video frames for the entire video stream, rest of 20 the mechanisms remain the same as dynamic foveated rendering in present disclo-sure. The performance comparison was done between fixed and dynamic foveal re-gion under long haul experimental setup for streaming via AWS instance hosted in Ohio following the architecture described earlier. The experiment was done only for Ohio as that route exhibited the most dynamic variations therefore the end-user QoE 25 improvement attainable by dynamic foveated rendering over fixed foveated rendering will be more pronounced.
31
[091]
FIG. 9I is a graph showing a comparative result of MOS scores of fixed and dynamic foveated rendering, in accordance with some embodiments of the present disclosure. The experimental setup for the same is illustrated in FIG. 9H. The demonstrator was demonstrating a plan on a board. The robo-car camera was stream-ing those visuals. The observer was supposed to gaze at that part of the board in the 5 video feed where the demonstrator was pointing to. The demonstrator is in Kolkata. The visuals on that part should be of better quality to maintain legibility for the ob-server despite channel degradation. The Operator’s camera was also switched on so that the GazOB module was able to track the gaze of the operator. Note that the oper-ator-side camera in this case does not stream video. However, even if needed to 10 transmit the video of the operator on the reverse path, GazOB does not latch on the camera and both gaze detection and video streaming can be done simultaneously. Whereas, for fixed foveated rendering, no eye-gaze tracking through operator’s cam-era using GazeOB was performed. Hence foveal region was not dynamically shifted based on operator’s eye-gaze. The performance was evaluated using subjective 15 measures involving 20 users. For each user, first the experiment was performed with fixed foveated rendering for 5 minutes, followed by same streaming with GazOB turned on without changing anything. The users were to rate their QoE for both sce-narios on a scale of 1 to 5 (with 1 being the lowest and 5 being the highest). The av-erage MOS in shows that the present disclosure with GazOB improves the end-user 20 experience.
[092]
The written description describes the subject matter herein to enable any person skilled in the art to make and use the embodiments. The scope of the sub-ject matter embodiments is defined by the claims and may include other modifica-tions that occur to those skilled in the art. Such other modifications are intended to be 25 within the scope of the claims if they have similar elements that do not differ from the literal language of the claims or if they include equivalent elements with insubstantial differences from the literal language of the claims.
32
[093]
The embodiments of the present disclosure herein address the unre-solved problem of synergizing context of the end-user with quality-of-experience of live video feed. An end-to-end system is implemented by the present disclosure for QoE-centric efficient live streaming mechanism where the transmission logic has an implicit consideration for end-user context. The present disclosure shows how intelli-5 gent frame-by-frame encoding mechanism tightly coupled with a suitable intelligent application protocol semantics can provide much better performance in real-life inter-active applications like telerobotics.
[094]
It is to be understood that the scope of the protection is extended to such a program and in addition to a computer-readable means having a message 10 therein; such computer-readable storage means contain program-code means for im-plementation of one or more steps of the method, when the program runs on a server or mobile device or any suitable programmable device. The hardware device can be any kind of device which can be programmed including e.g., any kind of computer like a server or a personal computer, or the like, or any combination thereof. The de-15 vice may also include means which could be e.g., hardware means like e.g., an appli-cation-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or a combination of hardware and software means, e.g., an ASIC and an FPGA, or at least one microprocessor and at least one memory with software processing compo-nents located therein. Thus, the means can include both hardware means, and soft-20 ware means. The method embodiments described herein could be implemented in hardware and software. The device may also include software means. Alternatively, the embodiments may be implemented on different hardware devices, e.g., using a plurality of CPUs.
[095]
The embodiments herein can comprise hardware and software ele-25 ments. The embodiments that are implemented in software include but are not limited to, firmware, resident software, microcode, etc. The functions performed by various components described herein may be implemented in other components or combina-
33
tions of other components. For the purposes of this description, a computer
-usable or computer readable medium can be any apparatus that can comprise, store, communi-cate, propagate, or transport the program for use by or in connection with the instruc-tion execution system, apparatus, or device.
[096]
The illustrated steps are set out to explain the exemplary embodiments 5 shown, and it should be anticipated that ongoing technological development will change the manner in which particular functions are performed. These examples are presented herein for purposes of illustration, and not limitation. Further, the bounda-ries of the functional building blocks have been arbitrarily defined herein for the con-venience of the description. Alternative boundaries can be defined so long as the 10 specified functions and relationships thereof are appropriately performed. Alterna-tives (including equivalents, extensions, variations, deviations, etc., of those de-scribed herein) will be apparent to persons skilled in the relevant art(s) based on the teachings contained herein. Such alternatives fall within the scope of the disclosed embodiments. Also, the words “comprising,” “having,” “containing,” and “includ-15 ing,” and other similar forms are intended to be equivalent in meaning and be open ended in that an item or items following any one of these words is not meant to be an exhaustive listing of such item or items, or meant to be limited to only the listed item or items. It must also be noted that as used herein and in the appended claims, the singular forms “a,” “an,” and “the” include plural references unless the context clear-20 ly dictates otherwise.
[097]
Furthermore, one or more computer-readable storage media may be utilized in implementing embodiments consistent with the present disclosure. A com-puter-readable storage medium refers to any type of physical memory on which in-formation or data readable by a processor may be stored. Thus, a computer-readable 25 storage medium may store instructions for execution by one or more processors, in-cluding instructions for causing the processor(s) to perform steps or stages consistent with the embodiments described herein. The term “computer-readable medium”
34
should be understood to include tangible items and exclude carrier waves and transi-
ent signals, i.e., be non-transitory. Examples include random access memory (RAM), read-only memory (ROM), volatile memory, nonvolatile memory, hard drives, CD ROMs, DVDs, flash drives, disks, and any other known physical storage media.
[098]
It is intended that the disclosure and examples be considered as exem-5 plary only, with a true scope of disclosed embodiments being indicated by the follow-ing claims.We Claim:
1. A processor-implemented method (300), comprising the steps of:
defining for a current video frame of a live video stream transmitted by a transmitter using an acquisition device, via one or more hardware processors, (i) a foveal region as a circle with a dynamically determined radius and a dynamically determined center, and (ii) a peripheral region as a region of a frame surrounding the foveal region (302);
receiving at the transmitter, via the one or more hardware processors, (i) an instantaneous feedback of a video quality, and (ii) 2-dimensional (2-D) eye-gaze coordinates of an end-user from a receiver, on a periodic timer expiry (304);
employing, via the one or more hardware processors, a dynamic foveated rendering for rendering the one or more successive video frames of the live video stream, at the transmitter, by dynamically adapting the center of the foveal region based on the 2-D eye-gaze coordinates of the end-user (306);
performing at the transmitter, via the one or more hardware processors, a foveation-centric spatial compression for both basic as well as Delta encoded frames, based on the dynamically determined radius and the center of the foveal region (308);
packetizing, via the one or more hardware processors, the one or more encoded successive video frames of the live video stream, into one or more packets, using a packetization technique (310); and
transmitting, via the one or more hardware processors, the one or more packets, to the receiver, over the network communication channel at a predefined frame rate based on a chosen encoding scheme determined based on instantaneous value of an error estimates contained in the instantaneous feedback (312).
2. The method as claimed in claim 1, comprising:
receiving, via the one or more hardware processors, the one or more packets, at the receiver (314);
reconstructing, via the one or more hardware processors, the one or more frames from the one or more packets, using the payload specific header (316); and
estimating, via the one or more hardware processors, an error rate of the current video frame, using the payload specific header and a number of packets received (318).
3. The method as claimed in claim 1, wherein the dynamic foveated rendering is
employed for rendering the one or more successive video frames of the live
video stream, by:
instantaneously retrieving the 2-D eye-gaze coordinates of the end-user with respect to entire receiver side screen, using an online eye-tracking module on a periodic basis (306a);
transforming the 2-D eye-gaze coordinates to corresponding co-ordinates with respect to a viewing canvas window (306b);
invalidating the 2-D eye-gaze co-ordinates if that go beyond the viewing canvas window (306c) and computing various error estimates for the current video frame based on error incurred for current and past video frames (306c); and
transmitting (i) the computed error estimates of the current video frame as the instantaneous feedback of the video quality along with (ii) the 2-D eye-gaze co-ordinates, on the periodic timer expiry (306d).
4. The method as claimed in claim 1, wherein:
the center is dynamically determined based on the 2-D eye gaze coordinates of the end-user at the receiver; and
the radius is dynamically determined based on the instantaneous feedback of the video quality received on the periodic timer expiry.
5. A system (200) comprising:
a memory (202) storing instructions;
one or more input/output (I/O) interfaces (206);
an acquisition device (208); and
one or more hardware processors (204) coupled to the memory (202) via the
one or more I/O interfaces (206), wherein the one or more hardware
processors (204) are configured by the instructions to:
define for a current video frame of a live video stream transmitted by a transmitter using an acquisition device, (i) a foveal region as a circle with a dynamically determined radius and a dynamically determined center, and (ii) a peripheral region as a region of a frame surrounding the foveal region;
receive at the transmitter, (i) an instantaneous feedback of a video quality, and (ii) 2-dimensional (2-D) eye-gaze coordinates of an end-user from a receiver, on a periodic timer expiry;
employ a dynamic foveated rendering for rendering the one or more successive video frames of the live video stream, at the transmitter, by dynamically adapting the center of the foveal region based on the 2-D eye-gaze coordinates of the end-user;
perform at the transmitter, a foveation-centric spatial compression for both basic as well as Delta encoded frames, based on the dynamically determined radius and the center of the foveal region;
packetize the one or more encoded successive video frames of the live video stream, into one or more packets, using a packetization technique; and
transmit the one or more packets, to the receiver, over the network communication channel at a predefined frame rate based on a chosen encoding scheme determined based on instantaneous value of an error estimates contained in the instantaneous feedback.
6. The system as claimed in claim 6, wherein the one or more hardware
processors (204) are configured by the instructions to:
receive the one or more packets, at the receiver;
reconstruct the one or more frames from the one or more packets, using the payload specific header; and
estimate an error rate of the current video frame, using the payload specific header and a number of packets received.
7. The system as claimed in claim 6, wherein the one or more hardware
processors (204) are configured by the instructions to employ the dynamic
foveated rendering for rendering the one or more successive video frames of
the live video stream, by:
instantaneously retrieving the 2-D eye-gaze coordinates of the end-user with respect to entire receiver side screen, using an online eye-tracking module on a periodic basis;
transforming the 2-D eye-gaze coordinates to corresponding co-ordinates with respect to a viewing canvas window;
invalidating the 2-D eye-gaze co-ordinates if that go beyond the viewing canvas window and computing various error estimates for the current video frame based on error incurred for current and past video frames; and
transmitting (i) the computed error estimates of the current video frame as the instantaneous feedback of the video quality along with (ii) the 2-D eye-gaze co-ordinates, on the periodic timer expiry.
8. The system as claimed in claim 6, wherein the one or more hardware processors (204) are configured by the instructions to dynamically determine: the center based on the 2-D eye gaze coordinates of the end-user; and the radius based on the instantaneous feedback of the video quality received on the periodic timer expiry.
| # | Name | Date |
|---|---|---|
| 1 | 202421012395-STATEMENT OF UNDERTAKING (FORM 3) [21-02-2024(online)].pdf | 2024-02-21 |
| 2 | 202421012395-REQUEST FOR EXAMINATION (FORM-18) [21-02-2024(online)].pdf | 2024-02-21 |
| 3 | 202421012395-FORM 18 [21-02-2024(online)].pdf | 2024-02-21 |
| 4 | 202421012395-FORM 1 [21-02-2024(online)].pdf | 2024-02-21 |
| 5 | 202421012395-FIGURE OF ABSTRACT [21-02-2024(online)].pdf | 2024-02-21 |
| 6 | 202421012395-DRAWINGS [21-02-2024(online)].pdf | 2024-02-21 |
| 7 | 202421012395-DECLARATION OF INVENTORSHIP (FORM 5) [21-02-2024(online)].pdf | 2024-02-21 |
| 8 | 202421012395-COMPLETE SPECIFICATION [21-02-2024(online)].pdf | 2024-02-21 |
| 9 | 202421012395-FORM-26 [16-03-2024(online)].pdf | 2024-03-16 |
| 10 | 202421012395-Proof of Right [13-06-2024(online)].pdf | 2024-06-13 |
| 11 | 202421012395-Power of Attorney [11-04-2025(online)].pdf | 2025-04-11 |
| 12 | 202421012395-Form 1 (Submitted on date of filing) [11-04-2025(online)].pdf | 2025-04-11 |
| 13 | 202421012395-Covering Letter [11-04-2025(online)].pdf | 2025-04-11 |
| 14 | 202421012395-FORM-26 [22-05-2025(online)].pdf | 2025-05-22 |