Optimizing Audio Delivery For Virtual Reality Applications

< Back

Optimizing Audio Delivery For Virtual Reality Applications

Abstract: There are disclosed techniques, systems, methods and instructions for a virtual reality, VR, augmented reality, AR, mixed reality, MR, or 360-degree video environment. In one example, the system (102) comprises at least one media video decoder configured to decode video signals from video streams for the representation of VR, AR, MR or 360-degree video environment scenes to a user. The system comprises at least one audio decoder (104) configured to decode audio signals (108) from at least one audio stream (106). The system (102) is configured to request (112) at least one audio stream (106) and/or one audio element of an audio stream and/or one adaptation set to a server (120) on the basis of at least the user's current viewport and/or head orientation and/or movement data and/or interaction metadata and/or virtual positional data (110).

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #

Filing Date

08 April 2020

Publication Number

25/2020

Publication Type

INA

Invention Field

COMMUNICATION

Status

lsdavar@vsnl.com

Parent Application

Patent Number

Legal Status

Grant Date

2024-07-22

Renewal Date

Applicants

FRAUNHOFER-GESELLSCHAFT ZUR FÖRDERUNG DER ANGEWANDTEN FORSCHUNG E.V.

Hansastraße 27c 80686 München

Inventors

1. MURTAZA, Adrian

Radbrunengasse 1 90403 Nürnberg

2. FUCHS, Harald

Amselstr. 5 91341 Röttenbach

3. CZELHAN, Bernd

Luitpoldstr. 65 91052 Erlangen

4. PLOGSTIES, Jan

Sandleithe 39a 90768 Fürth

5. AGNELLI, Matteo

Pirckheimerstr. 11 90408 Nürnberg

6. HOFMANN, Ingo

Campestraße 21 90419 Nürnberg

Specification

Description (

Introduction

In a Virtual Reality (VR) environment or similarly in an Augmented Reality (AR) or Mixed Reality (MR) or 360-degree video environments, the user may usually visualise full 360-degree content using for example a Head Mounted Display (HMD) and listen to it over headphones (or similarly over loudspeakers, including correct rendering dependent to its position).

In a simple use case, the content is authored in such a way that only one audio/video scene (i.e., 360-degree video for example) is reproduced at a certain moment in time. The audio/video scene has a fixed location (e.g., a sphere with the user positioned in the center), and the user may not move in the scene, but it may only rotate his head in various directions (yow, pitch, roll). In this case, different video and audio is played back (different viewports are displayed) to the user based on the orientation of his head.

While for video, the video content is delivered for the entire 360-degree scene, together with metadata for describing the rendering process (e.g., stitching

information, projection mapping, etc.) and selected based on the current user's viewport, for audio the content is the same for the entire scene. Based on the metadata the audio content is adapted to the current user's viewport (e.g., an audio object is rendered differently based on the viewport/user orientation information). It should be noted that 360-degree content refers to any type of content that comprises in more than one viewing angle at the same moment in time, that the user may choose from (for example by his head orientation or by using a remote-control device).

In a more complex scenario, when the user may move in the VR scene, or "jump" from one scene to the next one, the audio content might also change (e.g., audio sources which are not audible in one scene may become audible in the next scene -"a door is opened"). With existing systems, complete audio scenes may be encoded into one stream and, if needed, into additional streams (dependent on the main

stream). Such systems are known as Next Generation Audio systems (e.g., MPEG-H 3D Audio). Examples of such use cases may contain:

• Example 1 : The user selects to enter a new room, and the entire audio/video scene changes

• Example 2: The user moves in the VR scene, opens the door and walks

through, implying a transition of audio from one scene to the next scene required

For the purpose of describing this scenario, the notion of Discrete Viewpoints in space is introduced, as discrete location in space (or in the VR environment), for which different audio/video content is available.

The "straight-forward" solution is to have a real-time encoder which changes the encoding (number of audio elements, spatial information, etc.) based on feedback from the playback device about user position/orientation. This solution would imply, for example in a streaming environment, a very complex communication between a client and server:

• The client (which usually is assumed to be using only simple logic) would require advanced mechanisms for conveying not only requests for different streams, but also complex information about encoding details that would enable processing of the right content based on the user's position.

• The Media Server is usually pre-populated with different streams (formatted in a specific way that allows for "segment-wise" delivery) and the main function of the server is to provide information about the available streams and cause their delivery when requested. In order to enable scenarios that allow the encoding based on the feedback from the playback device, the Media Server would require advanced communication links with multiple live media encoders, and the capacity to create all the signalling information on the fly (e.g., Media Presentation Description) that could change in real time.

Although such system could be imagined, its complexity and computational requirements are beyond the functionality and features of equipment and systems available today or even in that will be developed in the next decades.

Alternatively, the content representing the complete VR environment ("the complete world") could be delivered all the time. This would solve the problem, but would require an enormous bitrate that is beyond the capacity of the available

communications links.

This is complex for a real-time environment, and in order to enable such use cases using available systems, alternative solutions are proposed that enable this functionality with a low complexity.

2. Terminology and Definitions

The following terminology is used in the technical field:

• Audio Elements: audio signals that may be represented for example as audio objects, audio channels, scene based audio (Higher Order Ambisonics - HOA), or any combination of all.

• Region-of-lnterest (ROI): One region of the video content (or of the

environment displayed or simulated) that is of interest to the user at one moment in time. This may be commonly a region on a sphere for example, or a polygonal selection from a 2D map. The ROI identifies a specific region for a particular purpose, defining the borders of an object under consideration.

• User position information: location information (e.g., x, y, z coordinates), orientation information (yow, pitch, roll), direction and speed of movement, etc.

• Viewport: Part of the spherical video that is currently displayed and viewed by the user.

• Viewpoint: the center point of the Viewport.

• 360-degree video (also known as immersive video or spherical video):

represents in the context of this document a video content that contains more than one view (i.e., viewport) in one direction at the same moment in time. Such content may be created, for example, using an omnidirectional camera or a collection of cameras. During playback the viewer has control of the viewing direction.

• Media Presentation Description (MPD) is a syntax e.g. XML containing information about media segments, their relationships and information necessary to choose between them.

• Adaptation Sets contain a media stream or set of media streams. In the

simplest case, one Adaptation Set containing all audio and video for the content, but to reduce bandwidth, each stream can be split into a different Adaptation Set. A common case is to have one video Adaptation Set, and multiple audio Adaptation Sets (one for each supported language). Adaptation Sets can also contain subtitles or arbitrary metadata.

• Representations allow an Adaptation Set to contain the same content

encoded in different ways. In most cases, Representations will be provided in multiple bitrates. This allows clients to request the highest quality content that they can play without waiting to buffer. Representations can also be encoded with different codecs, allowing support for clients with different supported codecs.

In the context of this application the notions of the Adaptation Sets are used more generic, sometimes referring actually to the Representations. Also, the media streams (audio/video streams) are generally encapsulated first into Media segments that are the actual media files played by the client (e.g., DASH client). Various formats can be used for the Media segments, such as ISO Base Media File Format (ISOBMFF), which is similar to the MPEG-4 container format, or MPEG-2 Transport Stream (TS). The encapsulation into Media Segments and in different

Representations/Adaptation Sets is independent of the methods described in here, the methods apply to all various options.

Additionally, the description of the methods in this document is centred around a DASH Server-Client communication, but the methods are generic enough to work with other delivery environments, such as MMT, MPEG-2 TS, DASH-ROUTE, File Format for file playback etc.

In general terms, an adaptation set is at a higher layer with respect to a stream and may comprise metadata (e.g., associated to positions). A stream may comprise a plurality of audio elements. An audio scene may me associated to a plurality of streams delivered as part of a plurality of adaptation sets.

3. Current solutions

Current solutions are:

[1]. I SO/I EC 23008-3:2015, Information technology ~ High efficiency coding and media delivery in heterogeneous environments ~ Part 3: 3D audio

[2]. N 16950, Study of ISO/I EC DIS 23000-20 Omnidirectional Media Format

The current solutions are limited in provide independent VR experience at one fixed location which allows user to change his orientation but not to move in the VR environment.

Summary

According to an embodiment a system for a virtual reality, VR, augmented reality, AR, mixed reality, MR, or 360-degree video environment may be configured to receive video and audio streams to be reproduced in a media consumption device, wherein the system may comprise: at least one media video decoder configured to decode video signals from video streams for the representation of VR, AR, MR or 360-degree video environment scenes to a user, and at least one audio decoder configured to decode audio signals from at least one audio stream, wherein the system may be configured to request at least one audio stream and/or one audio element of an audio stream and/or one adaptation set to a server on the basis of at least the user's current viewport and/or head orientation and/or movement data / and/or interaction metadata and/or virtual positional data.

According to an aspect the system may be configured to provide the server with the user's current viewport and/or head orientation and/or movement data and/or interaction metadata and/or virtual positional data so as to obtain the at least one audio stream and/or one audio element of an audio stream and/or one adaptation set from the server.

An embodiment may be configured so that at least one scene is associated to at least one audio element, each audio element being associated to a position and/or area in the visual environment where the audio element is audible, so that different audio streams are provided for different user's positions and/or viewports and/or head orientations and/or movement data and/or interaction metadata and/or virtual positional data in the scene.

According to another aspect the system may be configured to decide whether at least one audio element of an audio stream and/or one adaptation set is to be reproduced for the current user's viewport and/or head orientation and/or movement data and/or interaction metadata and/or virtual position in the scene, and wherein the system may be configured to request and/or to receive the at least one audio element at the current user's virtual position.

According to an aspect the system may be configured to predictiveiy decide whether at least one audio element of an audio stream and/or one adaptation set will become relevant and/or audible based on at least the user's current viewport and/or head orientation and/or movement data and/or interaction metadata and/or virtual positional data, and wherein the system may be configured to request and/or to receive the at least one audio element and/or audio stream and/or adaptation set at a particular user's virtual position before the predicted user's movement and/or interaction in the scene, wherein the system may be configured to reproduce the at least on audio element and/or audio stream, when received, at the particular user's virtual position after the user's movement and/or interaction in the scene.

An embodiment of the system may be configured to request and/or to receive the at least one audio element at a lower bitrate and/or quality level, at the user's virtual position before a user's movement and/or interaction in the scene, wherein the system may be configured to request and/or to receive the at least one audio element at a higher bitrate and/or quality level, at the user's virtual position after the user's movement and/or interaction in the scene.

According to an aspect the system may be configured so that at least one audio element is associated to at least one scene, each audio element being associated to a position and/or area in the visual environment associated to the scene, wherein the system may be configured to request and/or receive streams at higher bitrate and/or quality for audio elements closer to the user than for audio elements more distant from the user.

According to an aspect in the system at least one audio element may be associated to at least one scene, the at last one audio element being associated to a position and/or area in the visual environment associated to the scene, wherein the system may be configured to request different streams at different bitrates and/or quality levels for audio elements based on their relevance and/or auditability level at each user's virtual position in the scene, wherein the system may be configured to request an audio stream at higher bitrate and/or quality level for audio elements which are more relevant and/or more audible at the current user's virtual position, and/or an audio stream at lower bitrate and/or quality level for audio elements which are less relevant and/or less audible at the current user's virtual position.

In an embodiment in the system at least one audio element may be associated to a scene, each audio element being associated to a position and/or area in the visual environment associated to the scene, wherein the system may be configured to periodically send to the server the user's current viewport and/or head orientation and/or movement data and/or interaction metadata and/or virtual positional data, so that: for a first position, a stream at higher bitrate and/or quality is provided, from the server, and for a second position, a stream at lower bitrate and/or quality is provided, from the server, wherein the first position is closer to the at least one audio element than the second position.

In an embodiment the system a plurality of scenes may be defined for multiple visual environments such as adjacent and/or neighbouring environments, so that first streams are provided associated to a first, current scene and, in case of user's transition to a second, further scene, to provide both the streams associated to the first scene and the second streams associated to the second scene.

In an embodiment the system a plurality of scenes may be defined for a first and a second visual environments, the first and second environments being adjacent and/or neighbouring environments, wherein first streams associated to the first scene are provided, from the server, for the reproduction of the first scene in case of the user's position or virtual position being in a first environment associated to the first scene, second streams associated to the second scene are provided, from the server, for the reproduction of the second scene in case of the user's position or virtual position being in a second environment associated to the second scene, and both first streams associated to the first scene and second streams associated to the second scene are provided in case of the user's position or virtual position being in a transitional position between the first scene and the second scene.

In an embodiment the system a plurality of scenes may be defined for a first and a second visual environments, which are adjacent and/or neighbouring environments, wherein the system is configured to request and/or receive first streams associated to a first scene associated to the first environment, for the reproduction of the first scene in case of the user's virtual position being in the first environment, wherein the system may be configured to request and/or receive second streams associated to the second scene associated to the second environment, for the reproduction of the second scene in case of the user's virtual position being in the second environment, and wherein the system may be configured to request and/or receive both first streams associated to the first scene and second streams associated to the second scene in case of the user's virtual position being in a transitional position between the first environment and the second environment.

According to an aspect the system may be configured so that the first streams associated to the first scene are obtained at a higher bitrate and/or quality when the user is in the first environment associated to the first scene, while the second streams associated to the second scene associated to the second environment are obtained at a lower bitrate and/or quality when the user is in the beginning of a transitional position from the first scene to the second scene, and the first streams associated to the first scene are obtained at a lower bitrate and/or quality and the second streams associated to the second scene are obtained at a higher bitrate and/or quality when the user is in the end of a transitional position from the first scene to the second scene, wherein the lower bitrate and/or quality is lower than the higher bitrate and/or quality.

According to an aspect the system may be configured so that a plurality of scenes may be defined for multiple environments such as adjacent and/or neighbouring environments, so that the system may obtain the streams associated to a first current scene associated to a first, current environment, and, in case the distance of the user's position or virtual position from a boundary of the scene is below a

predetermined threshold, the system further may obtain audio streams associated to a second, adjacent and/or neighbouring environment associated to the second scene.

According to an aspect the system may be configured so that a plurality of scenes may be defined for multiple visual environments, so that the system requests and/or obtains the streams associated to the current scene at a higher bitrate and/or quality and the streams associated to the second scene at a lower bitrate and/or quality, wherein the lower bitrate and/or quality is lower than the higher bitrate and/or quality.

According to an aspect the system may be configured so that a plurality of N audio elements may be defined, and, in case the user's distance to the position or area of these audio elements is larger than a predetermined threshold, the N audio elements are processed to obtain a smaller number M of audio elements ( =2), each audio element being associated to a position and/or area in the visual environment, wherein the at least at least one plurality of N audio elements is provided in at least one representation at high bitrate and/or quality level, and wherein the at least at least one plurality of N audio elements is provided in at least one representation at low bitrate and/or quality level, where the at least one representation is obtained by processing the N audio elements to obtain a smaller number M of audio elements (M=2), each audio element being associated to a position and/or area in the visual environment,

wherein the at least at least one plurality of N audio elements is provided in at least one representation at high bitrate and/or quality level, and

wherein the at least at least one plurality of N audio elements is provided in at least one representation at low bitrate and/or quality level, where the at least one representation is obtained by processing the N audio elements to obtain a smaller number M of audio elements (M

Documents

Orders

Section	Controller	Decision Date

Application Documents

#	Name	Date
1	202037015367-IntimationOfGrant22-07-2024.pdf	2024-07-22
1	202037015367-STATEMENT OF UNDERTAKING (FORM 3) [08-04-2020(online)].pdf	2020-04-08
2	202037015367-FORM 1 [08-04-2020(online)].pdf	2020-04-08
2	202037015367-PatentCertificate22-07-2024.pdf	2024-07-22
3	202037015367-Written submissions and relevant documents [21-05-2024(online)].pdf	2024-05-21
3	202037015367-FIGURE OF ABSTRACT [08-04-2020(online)].pdf	2020-04-08
4	202037015367-DRAWINGS [08-04-2020(online)].pdf	2020-04-08
4	202037015367-Correspondence to notify the Controller [23-04-2024(online)].pdf	2024-04-23
5	202037015367-FORM-26 [23-04-2024(online)].pdf	2024-04-23
5	202037015367-DECLARATION OF INVENTORSHIP (FORM 5) [08-04-2020(online)].pdf	2020-04-08
6	202037015367-US(14)-HearingNotice-(HearingDate-07-05-2024).pdf	2024-04-08
6	202037015367-COMPLETE SPECIFICATION [08-04-2020(online)].pdf	2020-04-08
7	202037015367.pdf	2020-04-10
7	202037015367-FORM 3 [24-11-2023(online)].pdf	2023-11-24
8	202037015367-MARKED COPIES OF AMENDEMENTS [17-06-2020(online)].pdf	2020-06-17
8	202037015367-Information under section 8(2) [02-11-2023(online)].pdf	2023-11-02
9	202037015367-FORM 13 [17-06-2020(online)].pdf	2020-06-17
9	202037015367-Information under section 8(2) [10-08-2023(online)].pdf	2023-08-10
10	202037015367-Annexure [17-06-2020(online)].pdf	2020-06-17
10	202037015367-FORM 3 [03-05-2023(online)].pdf	2023-05-03
11	202037015367-AMMENDED DOCUMENTS [17-06-2020(online)].pdf	2020-06-17
11	202037015367-Information under section 8(2) [03-05-2023(online)].pdf	2023-05-03
12	202037015367-FORM 18 [09-07-2020(online)].pdf	2020-07-09
12	202037015367-FORM 3 [12-11-2022(online)].pdf	2022-11-12
13	202037015367-FORM-26 [28-07-2020(online)].pdf	2020-07-28
13	202037015367-Information under section 8(2) [12-11-2022(online)].pdf	2022-11-12
14	202037015367-Information under section 8(2) [29-09-2022(online)].pdf	2022-09-29
14	202037015367-PA ORIGINAL-(07-08-2020).PDF	2020-08-07
15	202037015367-FORM-26 [07-08-2020(online)].pdf	2020-08-07
15	202037015367-Information under section 8(2) [01-08-2022(online)].pdf	2022-08-01
16	202037015367-CLAIMS [14-05-2022(online)].pdf	2022-05-14
16	202037015367-Proof of Right [18-08-2020(online)].pdf	2020-08-18
17	202037015367-Information under section 8(2) [04-09-2020(online)].pdf	2020-09-04
17	202037015367-FER_SER_REPLY [14-05-2022(online)].pdf	2022-05-14
18	202037015367-Information under section 8(2) [17-03-2021(online)].pdf	2021-03-17
18	202037015367-OTHERS [14-05-2022(online)].pdf	2022-05-14
19	202037015367-FORM 3 [06-05-2022(online)].pdf	2022-05-06
19	202037015367-Information under section 8(2) [16-07-2021(online)].pdf	2021-07-16
20	202037015367-Information under section 8(2) [03-03-2022(online)].pdf	2022-03-03
20	202037015367-Information under section 8(2) [23-09-2021(online)].pdf	2021-09-23
21	202037015367-FER.pdf	2021-10-18
21	202037015367-FORM 4(ii) [16-02-2022(online)].pdf	2022-02-16
22	202037015367-Information under section 8(2) [19-01-2022(online)].pdf	2022-01-19
22	202037015367-Information under section 8(2) [24-11-2021(online)].pdf	2021-11-24
23	202037015367-Information under section 8(2) [19-01-2022(online)].pdf	2022-01-19
23	202037015367-Information under section 8(2) [24-11-2021(online)].pdf	2021-11-24
24	202037015367-FORM 4(ii) [16-02-2022(online)].pdf	2022-02-16
24	202037015367-FER.pdf	2021-10-18
25	202037015367-Information under section 8(2) [03-03-2022(online)].pdf	2022-03-03
25	202037015367-Information under section 8(2) [23-09-2021(online)].pdf	2021-09-23
26	202037015367-FORM 3 [06-05-2022(online)].pdf	2022-05-06
26	202037015367-Information under section 8(2) [16-07-2021(online)].pdf	2021-07-16
27	202037015367-Information under section 8(2) [17-03-2021(online)].pdf	2021-03-17
27	202037015367-OTHERS [14-05-2022(online)].pdf	2022-05-14
28	202037015367-FER_SER_REPLY [14-05-2022(online)].pdf	2022-05-14
28	202037015367-Information under section 8(2) [04-09-2020(online)].pdf	2020-09-04
29	202037015367-CLAIMS [14-05-2022(online)].pdf	2022-05-14
29	202037015367-Proof of Right [18-08-2020(online)].pdf	2020-08-18
30	202037015367-FORM-26 [07-08-2020(online)].pdf	2020-08-07
30	202037015367-Information under section 8(2) [01-08-2022(online)].pdf	2022-08-01
31	202037015367-FORM-26 [28-07-2020(online)].pdf	2020-07-28
31	202037015367-Information under section 8(2) [29-09-2022(online)].pdf	2022-09-29
32	202037015367-FORM 18 [09-07-2020(online)].pdf	2020-07-09
32	202037015367-Information under section 8(2) [12-11-2022(online)].pdf	2022-11-12
33	202037015367-AMMENDED DOCUMENTS [17-06-2020(online)].pdf	2020-06-17
33	202037015367-FORM 3 [12-11-2022(online)].pdf	2022-11-12
34	202037015367-Annexure [17-06-2020(online)].pdf	2020-06-17
34	202037015367-Information under section 8(2) [03-05-2023(online)].pdf	2023-05-03
35	202037015367-FORM 13 [17-06-2020(online)].pdf	2020-06-17
35	202037015367-FORM 3 [03-05-2023(online)].pdf	2023-05-03
36	202037015367-MARKED COPIES OF AMENDEMENTS [17-06-2020(online)].pdf	2020-06-17
36	202037015367-Information under section 8(2) [10-08-2023(online)].pdf	2023-08-10
37	202037015367.pdf	2020-04-10
37	202037015367-Information under section 8(2) [02-11-2023(online)].pdf	2023-11-02
38	202037015367-FORM 3 [24-11-2023(online)].pdf	2023-11-24
38	202037015367-COMPLETE SPECIFICATION [08-04-2020(online)].pdf	2020-04-08
39	202037015367-US(14)-HearingNotice-(HearingDate-07-05-2024).pdf	2024-04-08
39	202037015367-DECLARATION OF INVENTORSHIP (FORM 5) [08-04-2020(online)].pdf	2020-04-08
40	202037015367-FORM-26 [23-04-2024(online)].pdf	2024-04-23
40	202037015367-DRAWINGS [08-04-2020(online)].pdf	2020-04-08
41	202037015367-FIGURE OF ABSTRACT [08-04-2020(online)].pdf	2020-04-08
41	202037015367-Correspondence to notify the Controller [23-04-2024(online)].pdf	2024-04-23
42	202037015367-FORM 1 [08-04-2020(online)].pdf	2020-04-08
42	202037015367-Written submissions and relevant documents [21-05-2024(online)].pdf	2024-05-21
43	202037015367-PatentCertificate22-07-2024.pdf	2024-07-22
43	202037015367-STATEMENT OF UNDERTAKING (FORM 3) [08-04-2020(online)].pdf	2020-04-08
44	202037015367-IntimationOfGrant22-07-2024.pdf	2024-07-22

Search Strategy

1	SearchStreategyE_16-08-2021.pdf