Immersive Media Content Presentation And Interactive 360° Video

Immersive Media Content Presentation And Interactive 360° Video Communication

Abstract: An apparatus for presenting immersive media content is described. The apparatus obtains from a sender video data representing immersive content for a certain viewing direction and/or for a certain viewpoint, and displays the video data representing the immersive content for the certain viewing direction and/or for the certain viewpoint.

Patent Information

Application #

Filing Date

18 November 2021

Publication Number

14/2022

Publication Type

INA

Invention Field

COMPUTER SCIENCE

Status

Email

IPRDEL@LAKSHMISRI.COM

Parent Application

Patent Number

Legal Status

Grant Date

2025-03-19

Renewal Date

Applicants

FRAUNHOFER-GESELLSCHAFT ZUR FÖRDERUNG DER ANGEWANDTEN FORSCHUNG E.V.

Hansastraße 27c 80686 München

Inventors

1. GÜL, Serhan

c/o Fraunhofer-Institut für Nachrichtentechnik, Heinrich-Hertz-Institut, HHI, Einsteinufer 37 10587 Berlin

2. SÁNCHEZ DE LA FUENTE, Yago

c/o Fraunhofer-Institut für Nachrichtentechnik, Heinrich-Hertz-Institut, HHI Einsteinufer 37 10587 Berlin

3. HELLGE, Cornelius

c/o Fraunhofer-Institut für Nachrichtentechnik, Heinrich-Hertz-Institut, HHI, Einsteinufer 37 10587 Berlin

4. SCHIERL, Thomas

c/o Fraunhofer-Institut für Nachrichtentechnik, Heinrich-Hertz-Institut, HHI, Einsteinufer 37 10587 Berlin

5. SKUPIN, Robert

c/o Fraunhofer-Institut für Nachrichtentechnik, Heinrich-Hertz-Institut, HHI Einsteinufer 37 10587 Berlin

6. WIEGAND, Thomas

c/o Fraunhofer-Institut für Nachrichtentechnik, Heinrich-Hertz-Institut, HHI Einsteinufer 37 10587 Berlin

Specification

Description

The present invention relates to the field of immersive media and 360° video. Embodiments of the inventive approach concern improvements for immersive media communication or immersive media content presentation including, for example, video on demand, VoD, streaming, live streaming, video conferencing or virtual reality, VR, applications such as online gaming applications. Embodiments of the inventive approach concern improvements for a 360° video communication including, for example, video conferencing or virtual reality, VR, applications such as online gaming applications.

Immersive media has been gaining a lot of attention in the last years. Key technologies for the presentation or representation of immersive media content may be categorized into

(i) 3DoF, three Degrees of Freedom, content, e.g. 360° videos,

(ii) 6DoF, six Degrees of Freedom, content, e.g., captured volumetric objects, like real objects, or volumetric videos of, e.g., real objects,

(iii) 3D objects generated using, e.g., computer graphics, like Computer-Generated Imagery, CGI, and consisting, e.g., of 3D meshes and 2D textures.

A combination of these technologies is also possible. For example, multiple volumetric objects may be presented to a user with overlaid on a 360° video played in the background. The presented volumetric objects may be dynamic sequences or computer-generated 3D objects.

360° video gained a lot of attention in the last years and some products for 360° applications appeared on the market. Standardization activities specify streaming and encoding of 360° video data. The work in this field primarily focuses on streaming of 360° video using the Hypertext Transfer Protocol, HTTP, or broadcast/broadband transmissions.

An enabling technology that has recently become the center of attention for various immersive applications is volumetric video. Volumetric videos capture the three-dimensional space in a realistic way and may provide a better immersion compared to 360° videos. Volumetric videos are also suitable for the representation of six degrees-of-freedom, 6D0F, content allowing a viewer to freely move inside the content and observe the volumetric objects from different points of views and distances.

Recently, various technologies have been emerging for capturing, processing, compressing and streaming of volumetric content. One prominent example in the compression domain is the Video-based Point Cloud Compression, V-PCC, standard. V-PCC encodes a point cloud into different video bitstreams, like texture, geometry, occupancy map, plus additional metadata. Applying existing video compression algorithms for point cloud compression brings very high compression efficiency and enables re-using the available hardware video decoders especially on mobile devices.

Different to the 360° videos, volumetric videos are usually represented in 3D formats, e.g. point clouds, meshes and the like, which may require different processing and transmission techniques for efficient delivery. When multiple volumetric objects, captured or computergenerated, are present in a scene, the positions and relations of the objects with each other may be described using a scene graph whose nodes represent the entities present in the scene. A scene description language, e.g. X3D, may be used to construct the scene graph that describes the objects. Delivering multiple 3D objects may increase the bandwidth requirements and require tight synchronization of the playback of the volumetric objects.

Video communication typically runs over RTP/RTCP (Real-Time /Real-Time Control Protocol). In RTP, access units, AUs, are split into RTP packets which contain a header and the content of the video. Before the actual transmission of the video, a negotiation phase typically occurs during which both end points, the server and the receiver, exchange capabilities and agree on the characteristics of the video and modes to be used for the video communication. In order to describe characteristics of the transmitted bitstream as well as the transmission mode in use, the Session Description Protocol, SDP, may be used. The SDP may be used for a capabilities negotiation. For example, when considering a High Efficiency Video Coding, HEVC, bitstream, the server may send respective parameters sets, e.g., the sprop-parameter-sets, wherein the transmission may be out-of-band, i.e., may not be within the actual transmission of the video data. The client may accept the parameters as they are. An example for a SDP negotiation is given below, and the parameter sets #0 may be stored and used by the encoder of the server and the decoder of the client, while the parameter sets #1 may be used by the encoder of the client and the decoder of the sender.

Sender :

m=video 49170 RTP/AVP 98

a=rtpmap : 98 H265/90000

a=fmtp : 98 level-id=90; //Main profile, Level 3.0

sprop-vps=

sprop-sps=

sprop-pps=

Client :

m=video 49170 RTP/AVP 98

a=rtpmap : 98 H26S/90000

a=fmtp : 98 level-id=90; //Main profile, Level 3.0

sprop-vps=

sprop-sps=

sprop-pps=

A further example for a SDP negotiation is given below, which is similar to the above example, however with a level downgrade. The parameter sets #0 are ignored and may come in-band, i.e., during the transmission of the actual video data.

Sender :

m=video 49170 RTP/AVP 98

a=rtpmap : 98 H265/ 90000

a=fmtp : 98 level-id=120 ; //Main profile, Level 4.0

sprop-vps=

sprop-sps=

sprop-pps=

Client :

m=video 49170 RTP/AVP 98

a=rtpmap : 98 H265/90000

a=fmtp:98 level-id=90; //Main profile, Level 3.0

sprop-vps=

sprop-sps=

sprop-pps=

In addition to the media description as shown in the above examples, SDP may also be used for capabilities negotiation and selection of different configurations. For example, RFC 5939 extends the SDP by defining an SDP Capability Negotiation, SDPCapNeg, solution that enables not just the SDP for the actual configuration but also one or more alternative SDP session description, also referred to as potential configurations. Dependent on whether the actual configuration or one of the potential configurations is chosen, it may be necessary that the server performs further processing so as to implement to the selected configuration. Potential configurations are provided on top of the configuration included in the m-Iine of the SDP message. For example, in case a server wants to establish a secure RTP, SRTP, media stream but may also accept plain RTP, the server puts plain RTP in the actual

configuration and SRTP as a potential configuration. The client may use the plain RTP in case the client does not support SRTP or does not understand SDPCapNeg.

SDPCapNeg defines additional SDP attributes to express capabilities and to negotiate configurations. More specifically, the following additional attributes may be used:

• “a=acap” defines how to list an attribute name and its value as a capability.

• “a=tcap” defines how to list transport protocols, for example the RTP audio/video profile, RTP/AVP, as capabilities.

• “a=pcfg” lists supported potential configurations, wherein a potential configuration may include attribute capabilities, transport capabilities or combinations thereof. The capabilities may be used to generate an alternative SDP session description that may be used by a conventional SDP procedure or by the negotiation procedure.

• “a=acfg” is an atribute that may be used by the client to identify a potential configuration provided by the server.

Below, an example for an SDP negotiation using SDPCapNeg is given.

v=o

o=- 25678 753849 IN IP4 192.0.2.1

s=

c=IN IP4 192.0.2.1

t=0 0

m=audio 53456 RTP/AVP 0 18

a=tcap : 1 RTP/SAVP RTP/SAVPF

a=acap:l crypto : 1 AES_CM_128_HMAG_SHA1_32

inline : NzB4dlBINUAvLEw6UzF3WSJ+PSdFcGdUJShpXlZj I 2
20 | 1:32 a=pcfg:l t=1 a=1

a=pcfg: 2 t=2 a=1

in the above example, two potential configurations are indicated by the attribute a=pcfg:1 and a=pcfg:2. The first potential configuration indicates t=1 and a=1 meaning that the first transport capability indicated by the attribute a=tcap, namely RTP/SAVP, Real-Time Transport Protocol/Secure Audio Video Profile, is provided for the first potential configuration with the attribute capability indicated in a=acap to be crypto:1 . In a similar way, the second potential configuration indicates t=2 and a=1 meaning that the second transport capability indicated in a=tcap is used, namely RTP/SAVPF (RTP/SAVPF=...) with the attribute capability indicated in a=acap to be crypto: 1 :.. .

In addition to a SDP negotiation, which may be used for a configuration before the actual video transmission starts, a real-time control protocol, RTCP, which is typically used together with the RTP, may be used as a feedback mechanism to control encoding modes

during the session. RTCP may typically be used for RTP stream synchronization, packet loss reporting, delay estimation and the like. It may also be used as a feedback channel to control video coding parameters. For example, in the HEVC payload format, there may be the following parameters to be controlled:

• Picture Loss Indication, PLI: an indication of a loss of an undefined amount of coded video data belonging to one or more pictures.

• Slice Loss Indication, SLI: an indication of a loss of a number of CTBs in a CTB raster scan (CTB = Coding Tree Block).

• Reference Picture Selection Indication, RPSI: the selection of reference pictures to avoid error propagation.

• Full Intra Request, FIR: a message to force an encoder to send an IDR (IDR = Instantaneous Decoder Refresh).

RTCP control packets may be periodically exchanged among the end points of the video communication. In a point-to-point scenario, the RTP sender and the RTP receiver may send reciprocal sender reports, SR, and receiver reports, RR, to each other. The RTCP receiver report, RR, may indicate a reception quality and may include one or more of the following Quality of Service, QoS, metrics:

• cumulative number of packets lost,

• loss fraction,

• interarrival jitter,

• timing information.

The timing information may include

• a time stamp of a last SR, LSR, received, and

• a delay since the last SR received, DLSR.

The sender may use the LSR and the DLSR fields to calculate a round trip time, RTT, between the sender and the receiver.

Fig. 1 illustrates an example for calculating the RTT using respective RTCP reports exchanged between a sender and a receiver. Initially, as is indicated at 100, the sender sends a sender report SR to the receiver including the timing information indicated at 102. The receiver, after receiving SR, transmits its receiver report RR as is indicated at 104, and the receiver report includes the LSR and the DLSR fields. Upon receiving the RR at the sender, as indicated in 106, determines the actual time as indicated at 108 and calculates the RTT as indicated in 110. More specifically, the sender determines the arrival time 110a, also referred to as the actual time, and subtracts DLSR 1 10b and LSR 1 10c from the actual time 1 10a so as to obtain RTT. The calculated time is the network RTT and excludes any processing at the end points, for example a buffering delay at the receiver to smooth jitter or the like. The sender may use the known RTT so as to optimize the video encoding.

Some applications, such as multicast inference of network characteristics, MINC, or voiceover IP, VoIP, monitoring, require other and more detailed statistics. For example, RFC 361 1 (RTCP extended reports) provides some extensions. For example, the receiver reference time report block extends time stamps of the RTCP in such a way that nonsenders may also send time stamp. In other words, a receiver may also estimate the RTT when compared to other participants by sending the report and receiving DLRR reports as they are defined in RFC 361 1 (DLRR = a delay since the last RR received).

Typically, RTCP packets are not sent individually but are packed into compound packets for transmission and are sent in relatively large time intervals so that the overhead caused by the RTCP packets does not drastically increase, preferably it is kept around 5% of the traffic. In addition, a minimum interval, for example about 5 seconds, between RTCP reports may be recommended. However, some applications may require fast reporting, and to achieve a timely feedback, the extended RTP profile for RTCP-based feedback (RTP/AVPF) as defined in RFC 4585 introduces the concept of early RTCP messages as well as algorithms allowing for low-delay feedback in small multicast groups and preventing feedback implosion in large groups. There are three operation modes in RTP/AVPF, namely:

• immediate feedback,

• early RTCP mode,

• regular RTCP mode.

A receiver may send a feedback message earlier than the next regular RTCP reporting interval by using the immediate feedback mode. These techniques may be used to define application specific messages that allow controlling or steering or influencing encoding techniques or decisions for delay critical situations. An example for the use of RTCP feedback messages may be found in 3GPP TS 26.1 14 (IMS Media Handling and Interaction). 3GPP TS 26.1 14 specifies different“rtcp-fb” attribute values in the SDP so as to convey,

(i) a video region-of-interest ROI, arbitrarily selected by the receiver, and

(ii) a video ROI pre-defined by the sender and selected by the receiver as stated in 3GPP TS 26.1 14, Section 7.3.7.

An example of an RTCP feedback message indicating a certain position within a larger image, namely the ROI, may be specified as follows:

There is a need to improve immersive media communication or immersive media content presentation. There is a need to further improve 360° video communication.

This need may be achieved by the subject-matter as defined in the present application and as defined in the appended claims.

Embodiments of the present invention are now described in further detail with reference to the accompanying drawings, in which:

Fig. 1 illustrates an example for calculating the RTT using respective RTCP reports exchanged between a sender and a receiver;

Fig. 2 is a schematic representation of a system for a 360° video communication between a sender and a receiver;

Fig. 3 illustrates an example of an environment, similar to Fig. 2, in which embodiments of the present invention may be applied and advantageously used;

Fig. 4 illustrates a viewport transmission delay during a 360° video communication in which the sender provides the bitstream including the video data matching the viewing direction at the receiver;

Fig. 5 illustrates a viewport transmission delay during a 360° video communication in which the sender provides the bitstream including the video data matching the viewing direction at the receiver when applying viewport prediction techniques;

Fig. 6 illustrates a viewport with a central area to which the receiver gazes and which has a higher resolution that the outer area surrounding the central area; and

Fig. 7 illustrates an example of a computer system on which units or modules as well as the steps of the methods described in accordance with the inventive approach may execute.

Embodiments of the present invention are now described in more detail with reference to the accompanying drawings, in which the same or similar elements have the same reference signs assigned.

In streaming applications the 360° video data for the entire 360° video is provided by a server towards a client, e.g., over the air by a broadcast/broadband transmission or over a network, like the internet, using HTTP, and the client renders the received video data for display. Thus, the entire video content is provided to the receiver in video communication applications, for example, video conferencing or virtual reality, VR, applications such as online gaming applications, in general only a part of a scene of the 360° video is presented to a user at the receiver, e.g., dependent on a viewing direction of the user. The client, on the basis of the viewing direction, processes the entire video data so as to display to a user that part of the scene of the 360° video corresponding to the user’s viewing direction. However, providing the entire video data for the 360° video to the receiver requires high transmission capabilities of the link between the sender and the receiver. Also, the receiver needs to have sufficient processing power to process the entire video data so as to present the desired part of a scene to a user. Since some the 360° video communication applications may be real time applications, the long duration or time associated with the transmission and/or processing of the entire data may be disadvantageous.

The above-described protocols, like RTP, RTCP and SDP provide mechanisms and signaling for the transmission of video data, the existing mechanisms and signaling are not specific for a 360° video communication so that using the known mechanisms and signaling may be disadvantageous.

Embodiments of the present invention provide different aspects for improving immersive media communication or immersive media content presentation. Embodiments of the present invention provide different aspects for improving a 360° video communication.

Fig. 2 is a schematic representation of a system for a immersive media communication or a 360° video communication between a sender 200, also referred to as a server, and a receiver 202, also referred to as a client. The server 200 and the client 202 may communicate via a wired communication link or via a wireless communication link for transmitting a media stream 204 including video or picture and/or audio information. More specifically, the media stream 204 includes the 360° video data as provided by the server 200, for example in respective RTP packets. In addition, respective RTCP packets are included in the media stream as explained above. In accordance with embodiments of the present invention, the RTP, RTCP and SDP are extended so as to provide a mechanism and a signaling for an improved and more efficient immersive media communication or immersive media content presentation or for an improved and more efficient 360° video communication. The server 200 includes a signal processor 206, and the client 202 includes a signal processor 208. The client 202 as well as the server 200 may operate in accordance with the inventive teaching described herein below in more detail.

Receiver/Client for immersive media presentation

The present invention provides (see for example claim 1 ) an apparatus for presenting immersive media content, wherein the apparatus is to

obtain from a sender video data representing immersive content for a certain viewing direction and/or for a certain viewpoint, and

display the video data representing the immersive content for the certain viewing direction and/or for the certain viewpoint.

In accordance with embodiments (see for example claim 2), to obtain the video data from the sender, the apparatus is to

- signal to the sender the certain viewing direction and/or the certain viewpoint, and

- receive from the sender the video data for the certain viewing direction and/or the certain viewpoint.

In accordance with embodiments (see for example claim 3), the apparatus comprises

a display device, e.g., a HMD, to display to a user the video data for the certain viewing direction and/or the certain viewpoint,

a sensor to detect the viewing direction and/or the viewpoint of the user, and

a processor to signal the detected viewing direction and/or the certain viewpoint to the sender and to process the received video data for display on the display device.

In accordance with embodiments (see for example claim 4), the apparatus is to receive from the sender for the certain viewing direction and/or the certain viewpoint of the video data representing the immersive content

(i) first video data rendered by the sender and representing a 2D viewport version of video data representing the immersive content for the certain viewing direction and/or for the certain viewpoint, or

(ii) second video data not rendered by the sender and representing at least a part of the immersive content to be transmitted by the sender.

In accordance with embodiments (see for example claim 5), during an immersive media session, the apparatus is to receive from the sender the first video data or the second video data dependent on a latency between the receiver and the sender.

In accordance with embodiments (see for example claim 6), the latency comprises one or more of:

- an end-to-end latency, the end-to-end latency comprising one or more of a network latency, a rendering latency, and a coding latency,

- a motion-to-photon, MTP, latency, the MTP latency being a time from a detection of a change in the certain viewing direction and/or in the certain viewpoint at the receiver until displaying the rendered video data for the new viewing direction and/or the new viewpoint, wherein the MTP latency may be reduced by a prediction look ahead time.

In accordance with embodiments (see for example claim 7),

the apparatus is to receive from the sender the first video data, in case the latency is below or at a certain threshold, e.g., 15ms to 20ms, and

the apparatus is to receive from the sender the second video data, in case the latency is above the certain threshold.

In accordance with embodiments (see for example claim 8),

in case the sender provides the first video data representing the 2D viewport version in a first format and the second video data representing the non-rendered immersive content in a second format, the apparatus is to send to the sender a message, e.g., an RTCP message, requesting switching between the first and second formats, either immediately or at a certain time following the message, or

in case the sender provides the first video data and the second video data using the same format and provides for dynamic switching from a first processing mode for providing the 2D viewport version to a second processing mode for providing the non-rendered part of the video data representing the immersive content, the apparatus is to send to the sender a message, e.g., an RTCP message, requesting switching between the first and second modes, either immediately or at a certain time following the message.

In accordance with embodiments (see for example claim 9), the certain threshold is one or more of:

- a network latency,

- an end-to-end latency,

- a maximum or an acceptable motion-to-photon, MTP latency,

- an MTP latency corresponding to a predefined Quality of Experience, QoE,

- MTP latency reduced by a prediction lookahead time indicative of a temporal capability of a predictor to look into the future.

In accordance with embodiments (see for example claim 10), at the beginning of an immersive media session, when the latency is still unknown, the apparatus is to accept only the second video data, until the latency is known or may be estimated reliably.

In accordance with embodiments (see for example claim 11),

at the beginning of an immersive media session, the apparatus is to negotiate with the sender, and

when negotiating with the sender, the apparatus is to receive from the sender, using for example the Session Description Protocol, SDP, one or more parameters of the video data representing the immersive content, e.g., Supplementary Enhancement Information, SEI, messages.

In accordance with embodiments (see for example claim 12),

when negotiating with the sender, the apparatus is to receive from the sender further an indication that the video data or format is dynamically switchable between (i) the first video data, and (ii) the second video data, and

during the immersive media session, the apparatus is to receive respective video data packets, like Real Time Transport Protocol, RTP, packets, wherein a video data packet may be marked, e.g., using an RTP header extension, so as to indicate a switching between the first video data and the second video data, the marked video data packet indicating

- an immediate switching between the first video data and the second video data, or

- a certain time until switching between the first video data and the second video data.

In accordance with embodiments (see for example claim 13), the apparatus comprises a predictor providing a viewport prediction and/or a viewpoint prediction, or the apparatus is to receive from the sender the viewport prediction and/or the viewpoint prediction, the viewport prediction indicating a change from the current viewing direction of the user to a new viewing direction of the user to happen after the lookahead time, and the viewpoint prediction indicating a change from the current viewpoint of the user to a new viewpoint of the user to happen after the lookahead time.

In accordance with embodiments (see for example claim 14), viewpoint changes are

- constrained, e.g., to multiple discrete viewpoints a user may access, or

- unconstrained, e.g., a user is allowed to fully navigate inside a virtual scene.

In accordance with embodiments (see for example claim 15), responsive to the viewport prediction and/or viewpoint prediction, the apparatus is to determine a specific viewport and/or viewpoint to be signaled, e.g., based on a prediction accuracy, the lookahead time and a Round Trip Time, RTT, and to signal the specific viewport and/or viewpoint to the sender using a feedback message, e.g., an RTCP feedback message.

In accordance with embodiments (see for example claim 16), at the beginning of an immersive media session , the apparatus is to negotiate with the sender a value of the certain threshold based on the prediction capabilities at the apparatus and/or the sender.

In accordance with embodiments (see for example claim 17), the prediction capabilities include a per-viewpoint prediction accuracy, wherein the per-viewpoint prediction accuracy may classify a viewpoint to be predicted harder than another viewpoint dependent on content characteristics of the viewpoints, like a number of salient areas which the user is most likely to view.

In accordance with embodiments (see for example claim 18),

the apparatus is to signal, e.g., via SDP, to the sender an accuracy, e.g., in the form of a drift over time or an overlay of prediction and reality, and the lookahead time with which the apparatus performs the viewport prediction and/or viewpoint prediction, so as to allow the sender to decide whether the sender accepts the viewport prediction and/or viewpoint prediction from the apparatus and/or whether the sender performs the viewport prediction and/or the viewpoint prediction, and

the apparatus is to receive, e.g., via SDP, from the sender a signaling indicating whether the viewport prediction and/or the viewpoint prediction is to be performed by the apparatus and/or by the sender.

In accordance with embodiments (see for example claim 19),

the apparatus is to decide whether the sender or the apparatus performs the viewport prediction and/or the viewpoint prediction, and

the apparatus is signal to the sender, e.g., via SDP, an indicating whether the viewport prediction and/or the viewpoint prediction is to be performed by the apparatus and/or by the sender.

In accordance with embodiments (see for example claim 20), during the immersive media session,

in case the viewport prediction and/or viewpoint prediction is to be performed by the sender, the apparatus is to receive from the sender a request for certain parameters, e.g., viewing direction, viewpoint, reporting interval, speed or acceleration, required at the sender for performing the viewport prediction and/or the viewpoint prediction, and

in case the viewport prediction and/or the viewpoint prediction is to be performed by the apparatus, the apparatus is to receive from the sender certain prediction information to be used by the apparatus about certain viewing directions and/or certain viewpoints, e.g., based on the sender’s knowledge about content characteristics, e.g., picture domain saliency analysis, statistical analysis of user behavior, a-priori knowledge of script scenes.

In accordance with embodiments (see for example claim 21), in case a scene includes multiple viewpoints and the apparatus is to perform prediction, the apparatus is to analyze previous sensor data and determine whether it is more likely that a switch is to occur inside a current viewpoint or that the viewpoint will change.

In accordance with embodiments (see for example claim 22), the apparatus is to send, e.g., in a RTCP report, to the sender an error or drift indication that signals that the received video data for the certain viewing direction and/or the certain viewpoint does not match an actual viewing orientation and/or an actual viewpoint.

In accordance with embodiments (see for example claim 23), the apparatus is to signal a worst case drift or an average drift, wherein the average drift is signaled as the ratio of a predicted viewport or viewpoint and a real viewing orientation or viewpoint position over a certain time period, and the worst case drift is signaled as the maximum drift value attained over a certain time period.

In accordance with embodiments (see for example claim 24), in case the drift is in a specific direction, e.g., the predicted viewport and/or the predicted viewpoint corresponds to a smaller movement in the predicted direction, the apparatus is to signal the direction of the drift.

In accordance with embodiments (see for example claim 25), in case the apparatus processes the first video data and the average drift is over a certain threshold for a certain time period or the worst case drift exceeds a certain threshold, the apparatus is to decide to switch from first video data to the second video data.

In accordance with embodiments (see for example claim 26), the apparatus is to use Foveated Rendering, and to signal respective parameters used in the Foveated Rendering algorithm to the sender, so as to allow the sender to provide a content matching the operation mode of the foveated rendering.

In accordance with embodiments (see for example claim 27), the parameters used in the Foveated Rendering algorithm comprise:

- a downgrading function used as a parameterized function of the quality based on a distance to the center of the viewing direction, or

- regions or distance thresholds that lead to downgrading of the quality for the content, or

- a temporal distribution of an eye motion area averaged over a time period, e.g. 95% of the time the viewing direction is gazing at an area covering 30% of the viewport, so as to allow the sender to adapt the transmission, e.g., encode outer parts, which are usually not gazed at by the user, with a lower pixel density.

Sender/Server for immersive media presentation

The present invention provides (see for example claim 28) an apparatus for providing immersive media content to a receiver, wherein

the apparatus is to

receive from the receiver an indication of a certain viewing direction and/or a certain viewpoint for displaying the immersive content at the receiver, and

transmit to the receiver video data representing the immersive content for the certain viewing direction and/or for the certain viewpoint.

In accordance with embodiments (see for example claim 29), the apparatus is to provide

(i) first video data representing a 2D viewport version of the certain viewing direction and/or the certain viewpoint of the video data representing the immersive content, or

(ii) second video data representing at least a part of the immersive content to be transmitted, wherein, in case the first video data is to be provided, render the video data, encode the rendered video data and transmit the encoded video data to the receiver, and

wherein, in case the second video data is to be provided, encode the video data, without rendering, encode one or more messages describing parameters of the immersive content, e.g., Supplementary Enhancement Information, SEI, messages, and transmit the encoded video data and the encoded one or more messages to the receiver.

In accordance with embodiments (see for example claim 30), the apparatus is to provide to the receiver the first video data or the second video data dependent on an latency between the receiver and the sender.

In accordance with embodiments (see for example claim 31),

in case the apparatus provides the first video data and the second video data using the same format and provides for dynamic switching from a first processing mode for providing the 2D viewport version to a second processing mode for providing the non-rendered the video data representing the immersive content, the apparatus is to

receive from the receiver a request message, e.g., an RTCP message, for switching between the first and second modes, either immediately or at a certain time following the message, and

responsive to the request, switch the processing mode for the video and provide to the receiver video processed according to the new mode, and

in case the apparatus provides the first video data representing the 2D viewport version in a first format and the second video data representing the non-rendered part of the immersive content in a second format, the apparatus is to

receive from the receiver a request message, e.g., an RTCP message, for switching between the first and second formats, either immediately or at a certain time following the message, and

responsive to the request, send to the receiver video using the first format or the second format.

In accordance with embodiments (see for example claim 32), the latency comprises one or more of:

- an end-to-end latency, the end-to-end latency comprising one or more of a network latency, a rendering latency, and a coding latency,

- a motion-to-photon, MTP, latency, the MTP latency being a time from a detection of a change in the certain viewing direction and/or in the certain viewpoint at the receiver until displaying the rendered video data for the new viewing direction and/or the new viewpoint, wherein the MTP latency may be reduced by a prediction look ahead time.

In accordance with embodiments (see for example claim 33),

the apparatus is to provide to the receiver the first video data, in case the latency is below or at a certain threshold, e.g., 15ms to 20ms, and

the apparatus is to provide to the receiver the second video data, in case the latency is above the certain threshold.

In accordance with embodiments (see for example claim 34), the certain threshold is one or more of:

- a network latency,

- an end-to-end latency,

- a maximum or an acceptable motion-to-photon, MTP latency,

- an MTP latency corresponding to a predefined Quality of Experience, QoE,

- MTP latency reduced by a prediction lookahead time indicative of a temporal capability of a predictor to look into the future.

In accordance with embodiments (see for example claim 35), at the beginning of an immersive media session, when the latency is still unknown, the apparatus is to provide only the second video data, until the latency is known or may be estimated reliably.

In accordance with embodiments (see for example claim 36),

at the beginning of an immersive media session, the apparatus is to negotiate with the receiver, and

when negotiating with the receiver, the apparatus is to send to the receiver, using for example the Session Description Protocol, SDP, one or more parameters of the immersive content, e.g., Supplementary Enhancement Information, SEI, messages.

In accordance with embodiments (see for example claim 37),

the one or more SDP messages of the sender further include an indication that the video data or format is dynamically switchable between (i) the first video data, and (ii) the second video data, and

during the immersive media session, the apparatus is to send respective video data packets, like Real Time Transport Protocol, RTP, packets, wherein a video data packet may be marked, e.g., using an RTP header extension, so as to indicate a switching between the first video data and the second video data, the marked video data packet indicating

an immediate switching between the first video data and the second video data, or a certain time until switching between the first video data and the second video data.

In accordance with embodiments (see for example claim 38), the apparatus comprises a predictor providing a viewport prediction and/or a viewpoint prediction, or the apparatus is to receive from the receiver the viewport prediction and/or the viewpoint prediction, the viewport prediction and/or the viewpoint prediction indicating a change from the current viewing direction and/or viewpoint of the user of the receiver to a new viewing direction and/or viewpoint of the user to happen after the lookahead time.

In accordance with embodiments (see for example claim 39), viewpoint changes are

- constrained, e.g., to multiple discrete viewpoints a user may access, or

- unconstrained, e.g., a user is allowed to fully navigate inside a virtual scene.

In accordance with embodiments (see for example claim 40), responsive to the viewport prediction and/or the viewpoint prediction, the apparatus is to determine a specific viewport and/or viewpoint to be provided, e.g., based on a prediction accuracy, the look-ahead time and a Round Trip Time, RTT.

In accordance with embodiments (see for example claim 41 ), at the beginning of an immersive media session, the apparatus is to negotiate with the receiver a value of the certain threshold based on the prediction capabilities at the apparatus and/or the sender.

In accordance with embodiments (see for example claim 42), the prediction capabilities include a per-viewpoint prediction accuracy, wherein the per-viewpoint prediction accuracy may classify a viewpoint to be predicted harder than another viewpoint dependent on content characteristics of the viewpoints, like a number of salient areas which the user is most likely to view.

In accordance with embodiments (see for example claim 43), the apparatus is to

receive, e.g., via SDP, from the receiver an accuracy, e.g., in the form of a drift over time or an overlay of prediction and reality, and the lookahead time with which the receiver performs the viewport prediction and/or the viewpoint prediction,

decide whether the apparatus accepts the viewport prediction and/or the viewpoint prediction from the receiver or whether the apparatus performs the viewport prediction and/or the viewpoint prediction, and

signal to the receiver, e.g., via SDP, whether the viewport prediction and/or the viewpoint prediction is to be performed by the apparatus and/or the receiver.

In accordance with embodiments (see for example claim 44),

CLAIMS

1. An apparatus for presenting immersive media content, wherein

the apparatus is to

obtain from a sender video data representing immersive content for a certain viewing direction and/or for a certain viewpoint, and

display the video data representing the immersive content for the certain viewing direction and/or for the certain viewpoint.

2. The apparatus of claim 1 , wherein, to obtain the video data from the sender, the apparatus is to

- signal to the sender the certain viewing direction and/or the certain viewpoint, and

- receive from the sender the video data for the certain viewing direction and/or the certain viewpoint.

3. The apparatus of claim 1 or 2, wherein the apparatus comprises

a display device, e.g., a HMD, to display to a user the video data for the certain viewing direction and/or the certain viewpoint,

a sensor to detect the viewing direction and/or the viewpoint of the user, and

a processor to signal the detected viewing direction and/or the certain viewpoint to the sender and to process the received video data for display on the display device.

4. The apparatus of any one of the preceding claims, wherein the apparatus is to receive from the sender for the certain viewing direction and/or the certain viewpoint of the video data representing the immersive content

(i) first video data rendered by the sender and representing a 2D viewport version of video data representing the immersive content for the certain viewing direction and/or for the certain viewpoint, or

(ii) second video data not rendered by the sender and representing at least a part of the immersive content to be transmitted by the sender.

5. The apparatus of claim 4, wherein, during an immersive media session, the apparatus is to receive from the sender the first video data or the second video data dependent on a latency between the receiver and the sender.

6. The apparatus of claim 5, wherein the latency comprises one or more of:

- an end-to-end latency, the end-to-end latency comprising one or more of a network latency, a rendering latency, and a coding latency,

- a motion-to-photon, MTP, latency, the MTP latency being a time from a detection of a change in the certain viewing direction and/or in the certain viewpoint at the receiver until displaying the rendered video data for the new viewing direction and/or the new viewpoint, wherein the MTP latency may be reduced by a prediction look ahead time.

7. The apparatus of claim 5 or 6, wherein

the apparatus is to receive from the sender the first video data, in case the latency is below or at a certain threshold, e.g., 15ms to 20ms, and

the apparatus is to receive from the sender the second video data, in case the latency is above the certain threshold.

8. The apparatus of claim 7, wherein

in case the sender provides the first video data representing the 2D viewport version in a first format and the second video data representing the non-rendered immersive content in a second format, the apparatus is to send to the sender a message, e.g., an RTCP message, requesting switching between the first and second formats, either immediately or at a certain time following the message, or

in case the sender provides the first video data and the second video data using the same format and provides for dynamic switching from a first processing mode for providing the 2D viewport version to a second processing mode for providing the non- rendered part of the video data representing the immersive content, the apparatus is to send to the sender a message, e.g., an RTCP message, requesting switching

between the first and second modes, either immediately or at a certain time following the message.

9. The apparatus of claim 7 or 8, wherein the certain threshold is one or more of:

- a network latency,

- an end-to-end latency,

- a maximum or an acceptable motion-to-photon, MTP latency,

- an MTP latency corresponding to a predefined Quality of Experience, QoE,

- MTP latency reduced by a prediction lookahead time indicative of a temporal capability of a predictor to look into the future.

10. The apparatus of any one of claims 5 to 9, wherein at the beginning of an immersive media session, when the latency is still unknown, the apparatus is to accept only the second video data, until the latency is known or may be estimated reliably.

11. The apparatus of any one of the preceding claims, wherein

at the beginning of an immersive media session, the apparatus is to negotiate with the sender, and

when negotiating with the sender, the apparatus is to receive from the sender, using for example the Session Description Protocol, SDP, one or more parameters of the video data representing the immersive content, e.g., Supplementary Enhancement Information, SEI, messages.

12. The apparatus of claim 11 , wherein

when negotiating with the sender, the apparatus is to receive from the sender further an indication that the video data or format is dynamically switchable between (i) the first video data, and (ii) the second video data, and

during the immersive media session, the apparatus is to receive respective video data packets, like Real Time Transport Protocol, RTP, packets, wherein a video data packet may be marked, e.g., using an RTP header extension, so as to indicate a

switching between the first video data and the second video data, the marked video data packet indicating

- an immediate switching between the first video data and the second video data, or

- a certain time until switching between the first video data and the second video data.

13. The apparatus of any one of the preceding claims, wherein the apparatus comprises a predictor providing a viewport prediction and/or a viewpoint prediction, or the apparatus is to receive from the sender the viewport prediction and/or the viewpoint prediction, the viewport prediction indicating a change from the current viewing direction of the user to a new viewing direction of the user to happen after the lookahead time, and the viewpoint prediction indicating a change from the current viewpoint of the user to a new viewpoint of the user to happen after the lookahead time.

14. The apparatus of claim 13, wherein viewpoint changes are

- constrained, e.g., to multiple discrete viewpoints a user may access, or

- unconstrained, e.g., a user is allowed to fully navigate inside a virtual scene.

15. The apparatus of claim 12 or 13, wherein, responsive to the viewport prediction and/or viewpoint prediction, the apparatus is to determine a specific viewport and/or viewpoint to be signaled, e.g., based on a prediction accuracy, the lookahead time and a Round Trip Time, RTT, and to signal the specific viewport and/or viewpoint to the sender using a feedback message, e.g., an RTCP feedback message.

16. The apparatus of any one of claims 13 to 15, wherein, at the beginning of an immersive media session , the apparatus is to negotiate with the sender a value of the certain threshold based on the prediction capabilities at the apparatus and/or the sender.

17. The apparatus of claim 16, wherein the prediction capabilities include a per-viewpoint prediction accuracy, wherein the per-viewpoint prediction accuracy may classify a viewpoint to be predicted harder than another viewpoint dependent on content characteristics of the viewpoints, like a number of salient areas which the user is most likely to view.

18. The apparatus of claim 16 or 17, wherein

the apparatus is to signal, e.g., via SDP, to the sender an accuracy, e.g., in the form of a drift over time or an overlay of prediction and reality, and the lookahead time with which the apparatus performs the viewport prediction and/or viewpoint prediction, so as to allow the sender to decide whether the sender accepts the viewport prediction and/or viewpoint prediction from the apparatus and/or whether the sender performs the viewport prediction and/or the viewpoint prediction, and

the apparatus is to receive, e.g., via SDP, from the sender a signaling indicating whether the viewport prediction and/or the viewpoint prediction is to be performed by the apparatus and/or by the sender.

19. The apparatus of any one of claims 16 to 18, wherein

the apparatus is to decide whether the sender or the apparatus performs the viewport prediction and/or the viewpoint prediction, and

the apparatus is signal to the sender, e.g., via SDP, an indicating whether the viewport prediction and/or the viewpoint prediction is to be performed by the apparatus and/or by the sender.

20. The apparatus of any one of claims 16 to 19, wherein, during the immersive media session,

in case the viewport prediction and/or viewpoint prediction is to be performed by the sender, the apparatus is to receive from the sender a request for certain parameters, e.g., viewing direction, viewpoint, reporting interval, speed or acceleration, required at the sender for performing the viewport prediction and/or the viewpoint prediction, and

in case the viewport prediction and/or the viewpoint prediction is to be performed by the apparatus, the apparatus is to receive from the sender certain prediction information to be used by the apparatus about certain viewing directions and/or certain viewpoints, e.g., based on the sender’s knowledge about content

characteristics, e.g., picture domain saliency analysis, statistical analysis of user behavior, a-priori knowledge of script scenes.

21. The apparatus of any one of claims 13 to 20, wherein, in case a scene includes multiple viewpoints and the apparatus is to perform prediction, the apparatus is to analyze previous sensor data and determine whether it is more likely that a switch is to occur inside a current viewpoint or that the viewpoint will change.

22. The apparatus of any one of the preceding claims, wherein the apparatus is to send, e.g., in a RTCP report, to the sender an error or drift indication that signals that the received video data for the certain viewing direction and/or the certain viewpoint does not match an actual viewing orientation and/or an actual viewpoint.

23. The apparatus of claim 22, wherein the apparatus is to signal a worst case drift or an average drift, wherein the average drift is signaled as the ratio of a predicted viewport or viewpoint and a real viewing orientation or viewpoint position over a certain time period, and the worst case drift is signaled as the maximum drift value attained over a certain time period.

24. The apparatus of claim 22 or 23, wherein, in case the drift is in a specific direction, e.g., the predicted viewport and/or the predicted viewpoint corresponds to a smaller movement in the predicted direction, the apparatus is to signal the direction of the drift.

25. The apparatus of any one of claims 22 to 24, wherein, in case the apparatus processes the first video data and the average drift is over a certain threshold for a certain time period or the worst case drift exceeds a certain threshold, the apparatus is to decide to switch from first video data to the second video data.

26. The apparatus of any one of the preceding claims, wherein the apparatus is to use Foveated Rendering, and to signal respective parameters used in the Foveated Rendering algorithm to the sender, so as to allow the sender to provide a content matching the operation mode of the foveated rendering.

27. The apparatus of claim 26, wherein the parameters used in the Foveated Rendering algorithm comprise:

- a downgrading function used as a parameterized function of the quality based on a distance to the center of the viewing direction, or

- regions or distance thresholds that lead to downgrading of the quality for the content, or

- a temporal distribution of an eye motion area averaged over a time period, e.g.

95% of the time the viewing direction is gazing at an area covering 80% of the viewport, so as to allow the sender to adapt the transmission, e.g., encode outer parts, which are usually not gazed at by the user, with a lower pixel density.

28. An apparatus for providing immersive media content to a receiver, wherein

the apparatus is to

receive from the receiver an indication of a certain viewing direction and/or a certain viewpoint for displaying the immersive content at the receiver, and

transmit to the receiver video data representing the immersive content for the certain viewing direction and/or for the certain viewpoint.

29. The apparatus of claim 28, wherein the apparatus is to provide

(i) first video data representing a 2D viewport version of the certain viewing direction and/or the certain viewpoint of the video data representing the immersive content, or

(ii) second video data representing at least a part of the immersive content to be transmitted,

wherein, in case the first video data is to be provided, render the video data, encode the rendered video data and transmit the encoded video cata to the receiver, and

wherein, in case the second video data is to be provided, encode the video data, without rendering, encode one or more messages describing parameters of the immersive content, e.g., Supplementary Enhancement Information, SEI, messages, and transmit the encoded video data and the encoded one or more messages to the receiver.

30. The apparatus of 29, wherein the apparatus is to provide to the receiver the first video data or the second video data dependent on an latency between the receiver and the sender.

31. The apparatus of claim 30, wherein

in case the apparatus provides the first video data and the second video data using the same format and provides for dynamic switching from a first processing mode for providing the 2D viewport version to a second processing mode for providing the non- rendered the video data representing the immersive content, the apparatus is to

receive from the receiver a request message, e.g., an RTCP message, for switching between the first and second modes, either immediately or at a certain time following the message, and

responsive to the request, switch the processing mode for the video and provide to the receiver video processed according to the new mode, and

in case the apparatus provides the first video data representing the 2D viewport version in a first format and the second video data representing the non-rendered part of the immersive content in a second format, the apparatus is to

receive from the receiver a request message, e.g., an RTCP message, for switching between the first and second formats, either immediately or at a certain time following the message, and

responsive to the request, send to the receiver video using the first format or the second format.

32. The apparatus of claim 5, wherein the latency comprises one or more of:

- an end-to-end latency, the end-to-end latency comprising one or more of a network latency, a rendering latency, and a coding latency,

- a motion-to-photon, MTP, latency, the MTP latency being a time from a detection of a change in the certain viewing direction and/or in the certain viewpoint at the receiver until displaying the rendered video data for the new viewing direction

and/or the new viewpoint, wherein the MTP latency may be reduced by a prediction look ahead time.

33. The apparatus of any one of claims 30 to 32, wherein

the apparatus is to provide to the receiver the first video data, in case the latency is below or at a certain threshold, e.g., 15ms to 20ms, and

the apparatus is to provide to the receiver the second video data, in case the latency is above the certain threshold.

34. The apparatus of claim 33, , wherein the certain threshold is one or more of:

- a network latency,

- an end-to-end latency,

- a maximum or an acceptable motion-to-photon, MTP latency,

- an MTP latency corresponding to a predefined Quality of Experience, QoE,

- MTP latency reduced by a prediction lookahead time indicative of a temporal capability of a predictor to look into the future.

35. The apparatus of any one of claims 31 to 34, wherein at the beginning of an immersive media session, when the latency is still unknown, the apparatus is to provide only the second video data, until the latency is known or may be estimated reliably.

36. The apparatus of any one of claims 28 to 35, wherein

at the beginning of an immersive media session, the apparatus is to negotiate with the receiver, and

when negotiating with the receiver, the apparatus is to send to the receiver, using for example the Session Description Protocol, SDP, one or more parameters of the immersive content, e.g., Supplementary Enhancement Information, SEI, messages.

37. The apparatus of claim 36, wherein

the one or more SDP messages of the sender further include an indication that the video data or format is dynamically switchable between (i) the first video data, and (ii) the second video data, and

during the immersive media session, the apparatus is to send respective video data packets, like Real Time Transport Protocol, RTP, packets, wherein a video data packet may be marked, e.g., using an RTP header extension, so as to indicate a switching between the first video data and the second video data, the marked video data packet indicating

- an immediate switching between the first video data and the second video data, or

- a certain time until switching between the first video data and the second video data.

38. The apparatus of any one of claims 28 to 37, wherein the apparatus comprises a predictor providing a viewport prediction and/or a viewpoint prediction, or the apparatus is to receive from the receiver the viewport prediction and/or the viewpoint prediction, the viewport prediction and/or the viewpoint prediction indicating a change from the current viewing direction and/or viewpoint of the user of the receiver to a new viewing direction and/or viewpoint of the user to happen after the lookahead time.

39. The apparatus of claim 38, wherein viewpoint changes are

- constrained, e.g., to multiple discrete viewpoints a user may access, or

- unconstrained, e.g., a user is allowed to fully navigate inside a virtual scene.

40. The apparatus of claim 38 or 39, wherein, responsive to the viewport prediction and/or the viewpoint prediction, the apparatus is to determine a specific viewport and/or viewpoint to be provided, e.g., based on a prediction accuracy, the look-ahead time and a Round Trip Time, RTT.

41 The apparatus of any one of claims 38 to 40, wherein, at the beginning of an immersive media session, the apparatus is to negotiate with the receiver a value of the certain threshold based on the prediction capabilities at the apparatus and/or the sender.

42. The apparatus of claim 41 , wherein the prediction capabilities include a per-viewpoint prediction accuracy, wherein the per-viewpoint prediction accuracy may classify a

viewpoint to be predicted harder than another viewpoint dependent on content characteristics of the viewpoints, like a number of salient areas which the user is most likely to view.

43. The apparatus of claim 41 or 42, wherein the apparatus is to

receive, e.g., via SDP, from the receiver an accuracy, e.g., in the form of a drift over time or an overlay of prediction and reality, and the lookahead time with which the receiver performs the viewport prediction and/or the viewpoint prediction,

decide whether the apparatus accepts the viewport prediction and/or the viewpoint prediction from the receiver or whether the apparatus performs the viewport prediction and/or the viewpoint prediction, and

signal to the receiver, e.g., via SDP, whether the viewport prediction and/or the viewpoint prediction is to be performed by the apparatus and/or the receiver.

44. The apparatus of any one of claims 41 to 43, wherein

in case the viewport prediction and/or the viewpoint prediction is to be performed by the apparatus, the apparatus is to receive from the receiver certain parameters, e.g., viewing direction, viewpoint, reporting interval, speed or acceleration, required at the sender for performing the viewport prediction and/or the viewpoint prediction, and

in case the viewport prediction and/or the viewpoint prediction is to be performed by the receiver, the apparatus is to send to the receiver certain prediction information to be used by the apparatus about certain viewing directions and/or viewpoint, e.g., based on the sender’s knowledge about content characteristics, e.g., picture domain saliency analysis, statistical analysis of user behavior, a-priori knowledge of script scenes.

45. The apparatus of any one of claims 38 to 44, wherein, in case a scene includes multiple viewpoints and the sender is to perform prediction, the apparatus is to receive a feedback from the receiver about the current viewing direction and position inside the scene, and to combine the feedback with statistics of other users or content

information, e.g. at which spatial area of a certain viewport the users are more likely to change their viewpoints, for determining whether it is more likely that a switch is to occur inside a current viewpoint or that the viewpoint will change.

46. The apparatus of any one of claims 28 to 45, wherein

the apparatus is to receive, e.g., in a RTCP report, from the receiver an error or drift indication that signals that the received video data for the certain viewing direction and/or the certain viewpoint does not match an actual viewing orientation and/or an actual viewpoint at the apparatus, and

responsive to the error or drift, the apparatus is to adapt, e.g., a margin or prefetch used.

47. The apparatus of claim 46, wherein the apparatus is to receive a worst case drift or an average drift, wherein the average drift is signaled as the ratio of a predicted viewport or viewpoint and a real viewing orientation or viewpoint position over a certain time period, and the worst case drift is signaled as the maximum drift value attained over a certain time period.

48. The apparatus of 46 or 47, wherein, in case the drift is in a specific direction, e.g., the predicted viewport and/or the predicted viewpoint corresponds to a smaller movement in the predicted direction, the apparatus is to receive the direction of the drift and to adapt its prediction, e.g., by adding a prefetch in the direction of the mismatched prediction.

49. The apparatus of any one of claims 28 to 48, wherein the receiver uses Foveated Rendering, and the apparatus is to receive respective parameters used in the Foveated Rendering algorithm from the receiver, and provide a content matching the operation mode of the foveated rendering.

50. The apparatus of any one of the preceding claims, wherein the immersive content includes one or more of:

- 3DoF, three Degrees of Freedom, content, e.g. one or more 360° videos,

- 6DoF, six Degrees of Freedom, content, e.g. captured volumetric objects, like real objects, or volumetric videos of, e.g., real objects,

- 3D objects generated, e.g., using computer graphics, like Computer-Generated Imagery, CGI.

51. The apparatus of any one of the preceding claims, wherein the immersive content to be transmitted by the sender or received by the receiver includes one or more of:

- in case of a 360° video or a 360° graphic, a projected video transmission, e.g., a part of the full 360° video transmited using a particular projection,

- in case of a volumetric object or a volumetric video, a 3D data transmission for the full volumetric object or for a part of the volumetric object in a certain 3D format, e.g., as a point cloud or as a mesh,

- in case of 3D computer graphics, e.g., games, a complete scene, e.g., multiple volumetric objects, in a certain 3D format such as multiple point clouds or meshes.

52. The apparatus of any one of the preceding claims, wherein the immersive content is to be identified by

- a certain Supplementary Enhancement Information, SEI, parameter, e.g., the sprop-sei parameter,

- an indication of a particular video codec or profile, or

- by an additional attribute in the Session Description Protocol, SDP, e.g., “videoformat 3DoF” or“videoformat 6DoF” or“ videoformat Volumetric”.

53. The apparatus of any one of the preceding claims, wherein, in case the immersive content represents a volumetric scene including one or more volumetric objects, the immersive content includes a plurality of bitstreams for describing respective properties of the volumetric object, e.g., at least a texture bit stream and a geometry bitstream, or a compressed mesh bit stream and a texture bitstream.

54. The apparatus of claim 53, wherein the use of the different bitstreams is signaled using, e.g., the SDP, wherein the SDP may contain information about the different kind of bitstreams and possible variants of the bitstreams.

55. The apparatus of claim 53 or 54, wherein the plurality of bitstreams describing respective properties of a volumetric object are associated with each other using, e.g., the grouping mechanisms of the SDP.

56. A system, comprising:

a sender including an apparatus of any one of claims 28 to 55, and

a receiver including an apparatus of any one of claims 1 to 27 or 50 to 55.

57. A method for presenting immersive media content, the method comprising;

obtaining, by a receiver, from a sender video data representing immersive content for a certain viewing direction and/or for a certain viewpoint, and

displaying, at the receiver, the video data representing the immersive content for the certain viewing direction and/or for the certain viewpoint.

58. A method for providing immersive media content, the method comprising:

receiving, at a sender, an indication of a certain viewing direction and/or a certain viewpoint for displaying the immersive content at the receiver, and

transmitting, by the sender, to the receiver video data representing the immersive content for the certain viewing direction and/or for the certain viewpoint.

59. The method of claim 57 or 58, wherein the receiver includes an apparatus of any one of claims 1 to 27 or 50 to 55 and/or wherein the sender includes an apparatus of any one of claims 28 to 55.

60. A computer program product comprising instructions which, when the program is executed by a computer, cause the computer to carry out the method of any one of claims 57 to 59.

61. An apparatus for a 360° video communication with a sender, wherein

the apparatus is to

obtain from the sender video data dependent on a certain viewing direction of a 360° video, and

display the video data representing the certain viewing direction of the 360° video.

62. The apparatus of claim 61 wherein, to obtain the video data from the sender, the apparatus is to

signal to the sender the certain viewing direction of the 360° video, and

receive from the sender the video data for the certain viewing direction of the 360° video.

63. The apparatus of claim 61 or 62, wherein the apparatus comprises

a display device, e.g., a HMD, to display to a user the video data for the certain viewing direction of the 360s video,

a sensor to detect the viewing direction of the user, and

a processor to signal the detected viewing direction to the sender and to process the received video data for display on the display device.

64. The apparatus of any one claims 61 to 63, wherein the apparatus is to request from the sender for the certain viewing direction of the 360° video (i) first video data rendered by the sender and representing a 2D viewport version of the certain viewing direction of the 360° video or (ii) second video data not rendered by the sender and representing at least a part of the 360° video to be transmitted by the sender using a certain projection.

65. The apparatus of claim 64, wherein, during the session of the 360° video communication, the apparatus is to request from the sender the first video data or the second video data dependent on an end-to-end latency between the receiver and the sender.

66. The apparatus of claim 65, wherein the end-to-end latency is a time from a detection of a change in the certain viewing direction at the receiver until displaying the rendered video data for the new viewing direction.

67. The apparatus of claim 65 or 66, wherein

the apparatus is to request from the sender the first video data, in case the end-to- end latency is below or at a certain threshold, e.g., 15ms to 20ms, and

the apparatus is to request from the sender the second video data, in case the end- to-end latency is above the certain threshold.

68. The apparatus of claim 67, wherein

in case the sender provides the first video data representing the 2D viewport version in a first format and the second video data representing the non-rendered part of the 360° video in a second format, the apparatus is to send to the sender a message, e.g., an RTCP message, requesting switching between the first and second formats, either immediately or at a certain time following the message, or

in case the sender provides the first video data and the second video data using the same format and provides for dynamic switching from a first processing mode for providing the 2D viewport version to a second processing mode for providing the non- rendered part of the 360° video, the apparatus is to send to the sender a message, e.g., an RTCP message, requesting switching between the first and second modes, either immediately or at a certain time following the message.

69. The apparatus of claim 67 or 68, wherein the certain threshold is a maximum or acceptable motion-to-photon, MTP, latency yielding, e.g., a predefined Quality of Experience, QoE, or the MTP latency plus a prediction lookahead time indicative of a temporal capability of a predictor to look into the future.

70. The apparatus of any one of claims 65 to 69, wherein at the beginning of a session of the 360° video communication, when the end-to-end latency is still unknown, the apparatus is to accept only the second video data, until the end-to-end latency is known or may be estimated reliably.

71. The apparatus of any one of claims61 to 70, wherein

at the beginning of a session of the 360° video communication, the apparatus is to negotiate with the sender, and

when negotiating with the sender, the apparatus is to receive from the sender, using for example the Session Description Protocol, SDP, one or more parameters of the 360° video, e.g., Supplementary Enhancement Information, SEI, messages indicating one or more of a projection type, a rotation and region-wise packing, RWP, constraints.

72. The apparatus of claim 71 , wherein, when negotiating with the sender, using for example the SDP, the apparatus is to

include one or more additional parameters of the 360° video according to the capabilities of the apparatus, and/or

modify or remove, according to the capabilities of the apparatus, one or more of the parameters of the 360° video,

transmit to the sender the parameters of the 360° video so as to allow the sender to encode the projected video according to the transmitted.

73. The apparatus of claim 72, wherein

the one or more of the parameters of the 360° video comprise Region-Wise Packing, RWP, parameters, and the apparatus is to include one or more new elements into the SDP message so as to constrain RWP formats to the capabilities of the apparatus,

wherein the RWP formats may indicate, for example, one or more of the following constraints:

• rwp-max-num-packed-regions indicating a maximum number of packed regions,

• rwp-min-proj-region-width/height indicating a minimum width/height of a projected region ,

rwp-min-packed-region-width/height indicating a minimum width/height of a packed region ,

• rwp-allowed-transform-types indicating allowed transform types,

• rwp-guard-band- flag-constraint indicating a guard band around a packed region,

• rwp-max-scaling- factor indicating a maximum scaling factor for a packed region .

74. The apparatus of any one of claims 71 to 73, wherein

when negotiating with the sender, the apparatus is to receive from the sender further an indication that the video data or format is dynamically switchable between (i) the first video data, and (ii) the second video data, and

during the session of the 360° video communication, the apparatus is to receive respective video data packets, like Real Time Transport Protocol, RTP, packets, wherein a video data packet may be marked, e.g., using an RTP header extension, so as to indicate a switching between the first video data and the second video data, the marked video data packet indicating

an immediate switching between the first video data and the second video data, or

a certain time until switching between the first video data and the second video data.

75. The apparatus of any one of claims 61 to 74, wherein the apparatus comprises a viewport predictor providing a viewport prediction, or the apparatus is to receive from the sender the viewport prediction, the viewport prediction indicating a change from the current viewing direction of the user to a new viewing direction of the user to happen after the lookahead time.

76. The apparatus of claim 75, wherein, responsive to the viewport prediction, the apparatus is to determine a specific viewport to be requested, e.g., based on a prediction accuracy, the lookahead time and a Round Trio Time, RTT, and to signal the specific viewport to the sender using a feedback message, e.g., an RTCP feedback message.

77. The apparatus of claim 75 or 76, wherein, at the beginning of a session of the 360° video communication, the apparatus is to negotiate with the sender a value of the certain threshold based on the prediction capabilities at the apparatus and/or the sender.

78. The apparatus of claim 77, wherein

the apparatus is to signal, e.g., via SDP, to the sender an accuracy, e.g., in the form of a drift over time or an overlay of prediction and reality, and the lookahead time with which the apparatus performs the viewport prediction, so as to allow the sender to decide whether the sender accepts the viewport prediction from the apparatus or whether the sender performs the viewport prediction, and

the apparatus is to receive, e.g., via SDP, from the sender a signaling indicating whether the viewport prediction is to be performed by the apparatus or by the sender.

79. The apparatus of claim 77, wherein

the apparatus is to decide whether the sender or the apparatus performs the viewport prediction, and

the apparatus is signal to the sender, e.g., via SDP, an indicating whether the viewport prediction is to be performed by the apparatus or by the sender.

80. The apparatus of any one of claims 77 to 79, wherein, during the session of the 360° video communication,

in case the viewport prediction is to be performed by the sender, the apparatus is to receive from the sender a request for certain parameters, e.g., viewing direction, reporting interval, speed or acceleration, required at the sender for performing the viewport prediction, and

in case the viewport prediction is to be performed by the apparatus, the apparatus is to receive from the sender certain prediction information to be used by the apparatus about certain viewing directions or certain regions, e.g., based on the sender’s

knowledge about content characteristics, e.g., picture domain saliency analysis, statistical analysis of user behavior, a-priori knowledge of script scenes.

81. The apparatus of any one claims 61 to 80, wherein

the video data matches the viewport size exactly, thereby matching the field of view, FoV, of the display device, or

the video data includes a margin area around the viewport, the margin area being a certain percentage of the viewport.

82. The apparatus of claim 81 , wherein, during of a session of the 360° video communication, if the viewport size includes the margin, the apparatus is to receive an indication of a lens/distortion parameter used for rendering to assist the apparatus in cropping/warping the viewport.

83. The apparatus of claim 81 or 82, wherein, at the beginning of a session of the 360° video communication, the apparatus is to negotiate with the sender the dimension and/or the margin area of the video data.

84. The apparatus of any one claims 61 to 83, wherein the apparatus is to send, e.g., in a RTCP report, to the sender an error or drift indication that signals that the received video data for the certain viewing direction does not match an actual viewing orientation at the apparatus.

85. The apparatus of claim 84, wherein the apparatus is to signal a worst case drift or an average drift, wherein the average drift is signaled as the ratio of a predicted viewport and a real viewing orientation over a certain time period, and the worst case drift is signaled as the maximum drift value attained over a certain time period.

86. The apparatus of claim 84 or 85, wherein, in case the drift is in a specific direction, e.g., the predicted viewport corresponds to a smaller movement in the predicted direction, the apparatus is to signal the direction of the drift.

87. The apparatus of any one of claims 84 to 86, wherein, in case the apparatus processes the first video data and the average drift is over a certain threshold for a certain time period or the worst case drift exceeds a certain threshold, the apparatus is to decide to switch from first video data to the second video data.

88. The apparatus of any one claims 61 to 87, wherein the apparatus is to use Foveated Rendering, and to signal respective parameters used in the Foveated Rendering algorithm to the sender, so as to allow the sender to provide a content matching the operation mode of the foveated rendering.

89. The apparatus of claim 88, wherein the parameters used in the Foveated Rendering algorithm comprise:

a downgrading function used as a parameterized function of the quality based on a distance to the center of the viewing direction, or

regions or distance thresholds that lead to downgrading of the quality for the content, or

a temporal distribution of an eye motion area averaged over a time period, e.g. 95% of the time the viewing direction is gazing at an area covering 80% of the viewport, so as to allow the sender to adapt the transmission, e.g., encode outer parts, which are usually not gazed at by the user, with a lower pixel density.

90. An apparatus for a 360° video communication with a receiver, wherein

the apparatus is to

receive from the receiver an indication of a certain viewing direction of a 360° video at the receiver, and

transmit video data for the certain viewing direction of a 360° video to the receiver.

91. The apparatus of claim 90, wherein the apparatus is to

provide (i) first video data representing a 2D viewport version of the certain viewing direction of the 360° video or (ii) second video data representing at least a part of the 360° video to be transmitted using a certain projection,

in case the first video data is to be provided, render the video data, encode the rendered video data and transmit the encoded video data to the receiver, and

in case the second video data is to be provided, encode the video data using a certain projection, without rendering, encode one or more messages describing parameters of the 360° video, e.g., Supplementary Enhancement Information, SEI, messages indicating a projection type, a rotation and region-wise packing, RWP, constraints, and transmit the encoded video data and the encoded one or more messages to the receiver.

92. The apparatus of 91 , wherein the apparatus is to provide to the receiver the first video data or the second video data dependent on an end-to-end latency between the receiver and the sender.

93. The apparatus of claim 92, wherein

in case the apparatus provides the first video data and the second video data using the same format and provides for dynamic switching from a first processing mode for providing the 2D viewport version to a second processing inode for providing the non- rendered part of the 360° video, the apparatus

receive from the receiver a request message, e.g., an RTCP message, for switching between the first and second modes, either immediately or at a certain time following the message, and

responsive to the request, switch the processing mode for the video and provide to the receiver video processed according to the new mode, and

in case the apparatus provides the first video data representing the 2D viewport version in a first format and the second video data representing the non-rendered part of the 360° video in a second format, the apparatus is to

receive from the receiver a request message, e.g., an RTCP message, for switching between the first and second formats, either immediately or at a certain time following the message, and

responsive to the request, send to the receiver video using the first format or the second format.

94. The apparatus of claim 92 or 94, wherein the end-to-end latency is a time from a detection of a change in the certain viewing direction at the receiver until displaying the rendered video data for the new viewing direction.

95. The apparatus of any one of claims 92to 94, wherein

the apparatus is to provide to the receiver the first video data, in case the end-to-end latency is below or at a certain threshold, e.g., 15ms to 20ms, and

the apparatus is to provide to the receiver the second video data, in case the end-to- end latency is above the certain threshold.

96. The apparatus of claim 95, wherein the certain threshold is a maximum or acceptable motion-to-photon, MTP, latency yielding, e.g., a predefined Quality of Experience, QoE, or the MTP latency plus a prediction lookahead time indicative of a temporal capability of a predictor to look into the future.

97. The apparatus of any one of claims 93 to 96, wherein at the beginning of a session of the 360° video communication, when the end-to-end latency is still unknown, the apparatus is to provide only the second video data, until the end-to-end latency is known or may be estimated reliably.

98. The apparatus of any one of claims 90 to 97, wherein

at the beginning of a session of the 360° video communication, the apparatus is to negotiate with the receiver, and

when negotiating with the receiver, the apparatus is to send to the receiver, using for example the Session Description Protocol, SDP, one or more parameters of the 360° video, e.g., Supplementary Enhancement Information, SEI, messages indicating one or more of a projection type, a rotation and region-wise packing, RWP, constraints.

99. The apparatus of claim 98, wherein, when negotiating with the receiver, using for example the SDP, the apparatus is to

receive from the receiver one or more additional parameters of the 360° video according to the capabilities of the receiver, and/or one or more parameters of the 360° modified or reduced in number, according to the capabilities of the receiver, and

schedule encoding the projected video according to the received parameters.

100. The apparatus of claim 99, wherein

the one or more of the parameters of the 360° video comprise Region-Wise Packing, RWP, parameters, and the apparatus is to include one or more new elements into the SDP message so as to constrain RWP formats to the capabilities of the apparatus,

wherein the RWP formats may indicate, for example, one or more of the following constraints:

• rwp-max-num-packed-regions indicating a maximum number of packed regions,

• rwp-min-proj-region-width/height indicating a minimum width/height of a projected region,

• rwp-min-packed-region-width/height indicating a minimum width/height of a packed region ,

• rwp-allowed-transform-types indicating allowed transform types ,

• rwp-guard-band-flag-constraint indicating a guard band around a packed region.

101. The apparatus of claim 99 or 100, wherein

the one or more SDP messages of the sender further include an indication that the video data or format is dynamically switchable between (i) the first video data, and (ii) the second video data, and

during the session of the 360° video communication, the apparatus is to send respective video data packets, like Real Time Transport Protocol, RTP, packets, wherein a video data packet may be marked, e.g., using an RTP header extension, so as to indicate a switching between the first video data and the second video data, the marked video data packet indicating

an immediate switching between the first video data and the second video data, or

a certain time until switching between the first video data and the second video data.

102. The apparatus of any one of claims 90 to 101 wherein the apparatus comprises a viewport predictor providing a viewport prediction, or the apparatus is to receive from the receiver the viewport prediction, the viewport prediction indicating a change from the current viewing direction of the user of the receiver to a new viewing direction of the user to happen after the lookahead time.

103. The apparatus of claim 102, wherein, responsive to the viewport prediction, the apparatus is to determine a specific viewport to be provided, e.g., based on a prediction accuracy, the look-ahead time and a Round Trip Time, RTT.

104. The apparatus of claim 102 or 103, wherein, at the beginning of a session of the 360° video communication, the apparatus is to negotiate with the receiver a value of the certain threshold based on the prediction capabilities at the apparatus and/or the sender.

105. The apparatus of claim 104, wherein the apparatus is to

receive, e.g., via SDP, from the receiver an accuracy, e.g., in the form of a drift over time or an overlay of prediction and reality, and the lookahead time with which the receiver performs the viewport prediction,

decide whether the apparatus accepts the viewport prediction from the receiver or whether the apparatus performs the viewport prediction, and

signal to the receiver, e.g., via SDP, whether the viewport prediction is to be performed by the apparatus or the receiver.

106. The apparatus of 104 or 105, wherein

in case the viewport prediction is to be performed by the apparatus, the apparatus is to receive from the receiver certain parameters, e.g., viewing direction, reporting interval, speed or acceleration, required at the sender for performing the viewport prediction, and

in case the viewport prediction is to be performed by the receiver, the apparatus is to send to the receiver certain prediction information to be used by the apparatus about certain viewing directions or certain regions, e.g., based on the sender’s knowledge about content characteristics, e.g., picture domain saliency analysis, statistical analysis of user behavior, a-priori knowledge of script scenes.

107. The apparatus of any one of claims 90 to 106, wherein

the first video data matches the viewport size exactly, thereby matching the field of view, FoV, of the display device, or

the first video data includes a margin area around the viewport, the margin area being a certain percentage of the viewport.

108. The apparatus of claim 107, wherein, during of a session of the 360° video communication, if the viewport size includes the margin, the apparatus is to send to the receiver an indication of a tens/distortion parameter used for rendering to assist the receiver in cropping/warping the viewport.

109. The apparatus of claim 107 or 108, wherein the apparatus is to negotiate with the receiver the dimension and/or the margin area of the first video data.

110. The apparatus of any one of claims 90 to 109, wherein

the apparatus is to receive, e.g., in a RTCP report, from the receiver an error or drift indication that signals that the received video data for the certain viewing direction does not match an actual viewing orientation at the apparatus, and

responsive to the error or drift, the apparatus is to adapt, e.g., a margin or prefetch used, or to change the viewing orientation specific projection, e.g., to have a bigger or smaller high-quality content coverage.

1 11. The apparatus of claim 110, wherein the apparatus is to receive a worst case drift or an average drift, wherein the average drift is signaled as the ratio of a predicted viewport and a real viewing orientation over a certain time period, and the worst case drift is signaled as the maximum drift value attained over a certain time period.

112. The apparatus of 110 or 111 , wherein, in case the drift is in a specific direction, e.g., the predicted viewport corresponds to a smaller movement in the predicted direction, the apparatus is to receive the direction of the drift and to adapt its prediction, e.g., by adding a prefetch in the direction of the mismatched prediction.

113. The apparatus of any one of claims 90 to 102, wherein the receiver uses Foveated Rendering, and the apparatus is to receive respective parameters used in the Foveated Rendering algorithm from the receiver, and provide a content matching the operation mode of the foveated rendering.

114. A 360° video communication system, comprising:

a sender including an apparatus of any one of claims 90 to 113, and

a receiver including an apparatus of any one of claims 61 to 89.

115. A method for a 360° video communication, the method comprising:

obtaining, by a receiver, video data from a sender dependent on a certain viewing direction of a 360° video at the receiver, and

displaying, at the receiver the video data representing the certain viewing direction of the 360° video.

116. A method for a 360° video communication, the method comprising:

receiving, at a sender, an indication from a receiver of a certain viewing direction of a 360° video at the receiver, and

transmitting, by the sender, video data for the certain viewing direction of a 360° video to the receiver.

117. The method of claim 115 or 116, wherein the receiver includes an apparatus of any one of claims 61 to 89 and/or wherein the sender includes an apparatus of any one of claims 90 to 113.

118. A computer program product comprising instructions which, when the program is executed by a computer, cause the computer to carry out the method of any one of claims 1 15 to 117.

Documents

Application Documents

#	Name	Date
1	202117053161.pdf	2021-11-18
2	202117053161-STATEMENT OF UNDERTAKING (FORM 3) [18-11-2021(online)].pdf	2021-11-18
3	202117053161-REQUEST FOR EXAMINATION (FORM-18) [18-11-2021(online)].pdf	2021-11-18
4	202117053161-NOTIFICATION OF INT. APPLN. NO. & FILING DATE (PCT-RO-105-PCT Pamphlet) [18-11-2021(online)].pdf	2021-11-18
5	202117053161-FORM 18 [18-11-2021(online)].pdf	2021-11-18
6	202117053161-FORM 1 [18-11-2021(online)].pdf	2021-11-18
7	202117053161-DRAWINGS [18-11-2021(online)].pdf	2021-11-18
8	202117053161-DECLARATION OF INVENTORSHIP (FORM 5) [18-11-2021(online)].pdf	2021-11-18
9	202117053161-COMPLETE SPECIFICATION [18-11-2021(online)].pdf	2021-11-18
10	202117053161-FORM-26 [01-02-2022(online)].pdf	2022-02-01
11	202117053161-RELEVANT DOCUMENTS [15-02-2022(online)].pdf	2022-02-15
12	202117053161-Proof of Right [15-02-2022(online)].pdf	2022-02-15
13	202117053161-FORM 13 [15-02-2022(online)].pdf	2022-02-15
14	202117053161-FORM 3 [14-04-2022(online)].pdf	2022-04-14
15	202117053161-FER.pdf	2022-05-11
16	202117053161-FORM 3 [13-10-2022(online)].pdf	2022-10-13
17	202117053161-OTHERS [11-11-2022(online)].pdf	2022-11-11
18	202117053161-Information under section 8(2) [11-11-2022(online)].pdf	2022-11-11
19	202117053161-FER_SER_REPLY [11-11-2022(online)].pdf	2022-11-11
20	202117053161-DRAWING [11-11-2022(online)].pdf	2022-11-11
21	202117053161-CLAIMS [11-11-2022(online)].pdf	2022-11-11
22	202117053161-ABSTRACT [11-11-2022(online)].pdf	2022-11-11
23	202117053161-Information under section 8(2) [27-03-2023(online)].pdf	2023-03-27
24	202117053161-FORM 3 [19-04-2023(online)].pdf	2023-04-19
25	202117053161-FORM 3 [12-05-2023(online)].pdf	2023-05-12
26	202117053161-Information under section 8(2) [19-06-2023(online)].pdf	2023-06-19
27	202117053161-Information under section 8(2) [10-07-2023(online)].pdf	2023-07-10
28	202117053161-Information under section 8(2) [22-09-2023(online)].pdf	2023-09-22
29	202117053161-FORM 3 [12-10-2023(online)].pdf	2023-10-12
30	202117053161-Information under section 8(2) [06-12-2023(online)].pdf	2023-12-06
31	202117053161-Information under section 8(2) [15-02-2024(online)].pdf	2024-02-15
32	202117053161-US(14)-HearingNotice-(HearingDate-20-02-2025).pdf	2025-01-29
33	202117053161-Correspondence to notify the Controller [29-01-2025(online)].pdf	2025-01-29
34	202117053161-FORM-26 [13-02-2025(online)].pdf	2025-02-13
35	202117053161-FORM 3 [13-02-2025(online)].pdf	2025-02-13
36	202117053161-Written submissions and relevant documents [07-03-2025(online)].pdf	2025-03-07
37	202117053161-PatentCertificate19-03-2025.pdf	2025-03-19
38	202117053161-IntimationOfGrant19-03-2025.pdf	2025-03-19

Search Strategy

1	search_1105E_11-05-2022.pdf