Method And Encoder And Decoder For Sample Accurate Representation Of An Audio Signal
Abstract:
A method for providing information on the validity of encoded audio data is disclosed, the encoded audio data being a series of coded audio data units. Each coded audio data unit can contain information on the valid audio data. The method comprises: providing either information on a coded audio data level which describes the amount of data at the beginning of an audio data unit being invalid, or providing information on a coded audio data level which describes the amount of data at the end of an audio data unit being invalid, or providing information on a coded audio data level which describes both the amount of data at the beginning and the end of an audio data unit being invalid. A method for receiving encoded data including information on the validity of data and providing decoded output data is also disclosed. Furthermore, a corresponding encoder and a corresponding decoder are disclosed.
Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence
METHOD AND ENCODER AND DECODER FOR GAP - LESS PLAYBACK OF AN AUDIO SIGNAL
Description
Technical Field
Embodiments of the invention relate to the field of source coding of an audio signal. More
specifically, embodiments of the invention relate to a method for encoding information on the
original valid audio data and an associated decoder. More specifically, embodiments of the
invention provide the recovery of the audio data with their original duration.
Background of the Invention
Audio encoders are typically used to compress an audio signal for transmission or storage.
Depending on the coder used, the signal can be encoded lossless (allowing perfect
reconstruction) or lossy (for imperfect but sufficient reconstruction). The associated decoder
inverts the encoding operation and creates the perfect or imperfect audio signal. When
literature mentions artifacts, then typically the loss of information is meant, which is typical
for lossy coding. These include a limited audio bandwidth, echo and ringing artifacts and
other information, which may be audible or masked due to the properties of human hearing.
Summary of the Invention
The problem tackled by this invention relates to another set of artifacts, which are typically
not covered in audio coding literature: additional silence periods at the beginning and the end
of an encoding. Solutions for these artifacts exist, which are often referred to as gap-less
playback methods. The sources for these artifacts are at first the coarse granularity of coded
audio data where e.g. one unit of coded audio data always contains information for 1024
original un-coded audio samples. Secondly, the digital signal processing is often only possible
with algorithmic delays due to the digital filters and filter banks involved.
Many applications do not require the recovery of the originally valid samples. Radio
broadcasts, for example, are normally not problematic, since the coded audio stream is
continuous and a concatenation of separate encodings does not happen. TV broadcasts are
also often statically configured, and a single encoder is used before transmission. The extra
silence periods become however a problem, when several pre-encoded streams are spliced
together (as used for ad-insertion), when audio-video synchronization becomes an issue, for
the storage of compressed data, where the decoding shall not exhibit the extra audio samples
in the beginning and the end (especially for loss-less encoding requiring a bit-exact
reconstruction of the original uncompressed audio data), and for editing in the compressed
domain.
While many users already adapted to these extra silence periods, other users complain about
the extra silence, which is especially problematic when several encodings are concatenated
and formerly uncompressed gap-less audio data becomes interrupted when being encoded and
decoded. It is an object of the invention to provide an improved approach allowing the
removal of unwanted silence at the beginning and end of encodings.
Video coding using differential coding mechanisms, using I-frames, P-frames and B-frames,
is not introducing any extra frames in the beginning or end. In contrast, the audio encoder
typically has additional pre-pending samples. Depending on their number, they may lead to a
perceptible loss of audio-video synchronization. This is often referred to as the lip-sync
problem, the mismatch between the experienced motion of a speaker's mouth and the heard
sound. Many applications tackle this problem by having an adjustment for lip-sync, which has
to be done by the user since it's highly variable, depending on the codec in use and its settings.
It is an object of the invention to provide an improved approach allowing a synchronized
playback of audio and video.
Digital broadcasts became more heterogeneous in the past, with regional differences and
personalized programs and adverts. A main broadcast stream is hence replaced and spliced
with a local or user-specific content, which may be a live stream or pre-encoded data. The
splicing of these streams mainly depends on the transmission system; however, the audio can
often not be spliced perfectly, as wanted, due to the unknown silence periods. A current
method is often to leave the silence periods in the signal, although these gaps in the audio
signal can be perceived. It is an object of the invention to provide an improved approach
allowing splicing of two compressed audio streams.
Editing is normally done in the uncompressed domain, where the editing operations are wellknown.
If the source material is however an already lossy coded audio signal, then even
simple cut operations require a complete new encoding, resulting in tandem coding artifacts.
Hence, tandem decoding and encoding operations should be avoided. It is an object of the
invention to provide an improved approach allowing cutting of a compressed audio stream.
A different aspect is the erasure of invalid audio samples i systems that require a protected
data path. The protected media path is used to enforce digital rights management and to
ensure data integrity by using encrypted communication between the components of a system.
In these systems this requirement can be fulfilled only if non-constant durations of an audio
data unit become possible, since only at trusted elements within the protected media path
audio editing operations can be applied. These trusted elements are typically only the
decoders and the rendering elements.
Embodiments of the invention provide a method for providing information on the validity of
encoded audio data, the encoded audio data being a series of coded audio data units, wherein
each coded audio data unit can contain information on the valid audio data, the method
comprising:
providing either information on a coded audio data level which describes the amount
of data at the beginning of an audio data unit being invalid,
or providing information on a coded audio data level which describes the amount of
data at the end of an audio data unit being invalid,
or providing information on a coded audio data level which describes both the amount
of data at the beginning and the end of an audio data unit being invalid.
Further embodiments of the invention provide an encoder for providing the information on
the validity of data:
wherein the encoder is configured to apply the method for providing information on
the validity of data.
Further embodiments of the invention provide a method for receiving encoded data including
the information on the validity of data and providing decoded output data, the method
comprising:
receiving encoded data with either information on a coded audio data level which
describes the amount of data at the beginning of an audio data unit being invalid,
or information on a coded audio data level which describes the amount of data at the
end of an audio data unit being invalid,
or information on a coded audio data level which describes both the amount of data at
the beginning and the end of an audio data unit being invalid;
and providing decoded output data which only contains the samples not marked as
invalid,
or containing all audio samples of the coded audio data unit and providing information
to the application which part of the data is valid.
Further embodiments of the invention provide a decoder for receiving encoded data and
providing decoded output data, the decoder comprising:
an input for receiving a series of encoded audio data units with a plurality of encoded
audio samples therein, where some audio data units contain information on the validity of
data, the information being formatted as described in the method for receiving encoded audio
data including information on the validity of data,
a decoding portion coupled to the input and configured to apply the information on the
validity of data,
an output for providing decoded audio samples, where either only the valid audio
samples are provided,
or where information on the validity of the decoded audio samples is provided.
Embodiments of the invention provide a computer readable medium for storing instructions
for executing at least one of the methods in accordance with embodiments of the invention.
The invention provides a novel approach for providing the information on the validity of data,
differing from existing approaches that are outside the audio subsystem and/or approaches
that only provide a delay value and the duration of the original data.
Embodiments of the invention are advantageous as they are applicable within the audio
encoder and decoder, which are already dealing with compressed and uncompressed audio
data. This enables systems to compress and decompress only valid data, as mentioned above,
that do not need further audio signal processing outside the audio encoder and decoder.
Embodiments of the invention enable signaling of valid data not only for file-based
applications but also for stream-based and live applications, where the duration of the valid
audio data is not known at the beginning of the encoding.
In accordance with embodiments of the invention the encoded stream contains validity
information on an audio data unit level, which can be an MPEG-4 AAC Audio Access Unit.
To conserve compatibility to existing decoders the information is put into a portion of the
Access Unit which is optional and can be ignored by decoders not supporting the validity
information. Such a portion is the extension payload of an MPEG-4 AAC Audio Access Unit.
The invention is applicable to most existing audio coding schemes, including MPEG-1 Layer
3 Audio (MP3), and future audio coding schemes which work on a block basis and/or suffer
from algorithmic delay.
In accordance with embodiments of the invention, a novel approach for the removal of invalid
data is provided. The novel approach is based on already existing information available to the
encoder, the decoder and the system layers embedding encoder or decoder.
Brief Description of the Drawings
Embodiments according to the invention will be subsequently be described taking reference to
the enclosed figures in which:
Fig. 1 illustrates an HE AAC decoder behaviour: dual-rate mode;
Fig. 2 illustrates an information exchange between a Systems Layer entity and an
audio decoder;
Fig. 3 shows a schematic flow diagram of a method for providing information on the
validity of encoded audio data according to a first possible embodiment;
Fig. 4 shows a schematic flow diagram of a method for providing information on the
validity of encoded audio data according to a second possible embodiment of
the teachings disclosed herein;
Fig. 5 shows a schematic flow diagram of a method for providing information on the
validity of encoded audio data according to a third possible embodiment of the
teachings disclosed herein;
Fig. 6 shows a schematic flow diagram of a method for receiving encoded data
including the information on the validity of data according to an embodiment
of the teachings disclosed herein;
Fig. 7 shows a schematic flow diagram of the method for receiving encoded data
according to another embodiment of the teachings disclosed herein;
Fig. 8 shows an input/output diagram of an encoder according to an embodiment of
the teachings disclosed herein;
Fig. 9 shows a schematic input/output diagram of an encoder according to another
embodiment of the teachings disclosed herein;
Fig. 10 shows a schematic block diagram of a decoder according to an embodiment of
the teachings disclosed herein; and
Fig. 11 shows a schematic block diagram of a decoder according to another
embodiment of the teachings disclosed herein.
Detailed Description of Illustrative Embodiments
Fig. 1 shows the behavior of a decoder with respect to the access units (AU) and associated
composition units (CU). The decoder is connected to an entity denominated "Systems" that
receives an output generated by the decoder. As an example, the decoder shall be assumed to
function under the HE-AAC (High Efficiency - Advanced Audio Coding) standard. A HEAAC
decoder is essentially an AAC decoder followed by an SBR (Spectral Band Reduction)
"post processing" stage. The additional delay imposed by the SBR tool is due to the QMF
bank and the data buffers within the SBR tool. It can be derived by the following formula:
Dela SBR-TOOL = LAnalysisFilter ~ NAnalysisChannels + 1 + Delaybuffer
where
NAnalysisChannels = 32, LAnalysisFilter = 320 and delaybuffer = 6 X 32.
This means that the delay imposed by the SBR tool (at the input sampling rate, i.e., the output
sampling rate of the AAC) is
DelaysBR-TooL = 320 - 32 + 1 + 6 x 32 = 481
samples.
Typically, the SBR tool runs in the "upsampling" (or "dual rate") mode, in which case the 481
sample delay at the AAC sampling rate translates to a 962 sample delay at the SBR output
rate. It could also operate at the same sampling rate as the AAC output (denoted as
"downsampled SBR mode"), in which case the additional delay is only 481 samples at the
SBR output rate. There is a "backwards compatible" mode in which the SBR tool is neglected
and the AAC output is the decoder output. In this case there is no additional delay.
Fig. 1 shows the decoder behavior for the most common case in which the SBR tool runs in
upsampling mode and the additional delay is 962 output samples. This delay corresponds to
approximately 47% of the length of the upsampled AAC frame (after SBR processing). Note
that Tl is the time stamp associated with CU 1 after the delay of 962 samples, that is, the time
stamp for the first valid sample of HE AAC output. Further note that if HE AAC is running in
"downsampled SBR mode" or "single-rate" mode, the delay would be 481 samples but the
time stamp would be identical since in single-rate mode the CU's are half the number of
samples so that the delay is still 47% of the CU duration.
For all of the available signaling mechanisms (i.e., implicit signaling, backward compatible
explicit signaling, or hierarchical explicit signaling) if the decoder is HE-AAC then it must
convey to Systems any additional delay incurred by SBR processing, otherwise the lack of an
indication from the decoder indicates that the decoder is AAC. Hence, Systems can adjust the
time stamp so as to compensate for the additional SBR delay.
The following section describes how an encoder and decoder for a transform-based audio
codec relate to MPEG Systems and proposes an additional mechanism to ensure identity of
the signal after an encoder-decoder round-trip except "coding artifacts" - especially in the
presence of codec extensions. Employing the described techniques ensures a predictable
operation from a Systems point of view and also removes the need for additional proprietary
"gapless" signaling, normally necessary to describe the encoder's behavior.
In this section, reference is made to the following standards:
[1] ISO/IEC TR 14496-24:2007: Information Technology - Coding of audio-visual objects -
Part 24: Audio and systems interaction
[2] ISO/IEC 14496-3:2009 Information Technology - Coding of audio-visual objects - Part 3:
Audio
[3] ISO/IEC 14496-12:2008 Information Technology - Coding of audio-visual objects - Part
12: ISO base media file format
Briefly [1] is described in this section. Basically, AAC (Advanced Audio Coding) and its
successors HE AAC, HE AAC v2 are codecs that do not have a 1:1 correspondence between
compressed and uncompressed data. The encoder adds additional audio samples to the
beginning and to the end of the uncompressed data and also produces Access Units with
compressed data for these, in addition to the Access Units covering the uncompressed original
data. A standards compliant decoder would then generate an uncompressed data stream
containing the additional samples, being added by the encoder.
[1] describes how existing tools of the ISO base media file format [3] can be reused to mark
the valid range of the decompressed data so that (besides codec artifacts) the original
uncompressed stream can be recovered. The marking is accomplished by using an edit list
with an entry, containing the valid range after the decoding operation.
Since this solution was not ready in time, proprietary solutions for marking the valid period
are now wide-spread in use (just to name two: Apple iTunes and Ahead Nero). It could be
argued that the proposed method in [1] is not very practical and suffers from the problem that
edit lists were originally meant for a different - potentially complex - purpose for which only
a few implementations are available.
In addition, [1] shows how pre-roU of data can be handled by using ISO FF (ISO File Format)
sample groups [3]. Pre-roll does not mark which data is valid but how many Access Units (or
samples in the ISO FF nomenclature) are to be decoded prior to decoder output at an arbitrary
point in time. For AAC this is always one sample (i.e., one Access Unit) in advance due to
overlapping windows in the MDCT domain, hence the value for pre-roll is - 1 for all Access
Units.
Another aspect relates to the additional look-ahead of many encoders. The additional lookahead
depends e.g. on internal signal processing within the encoder that tries to create real¬
time output. One option for taking into account the additional look-ahead may be to use the
edit list also for the encoder look-ahead delay.
As mentioned before it is questionable whether the original purpose of the edit list tool was to
mark the originally valid ranges within a media. [1] is silent on the implications of further
editing the file with edit lists, hence it can be assumed that using the edit list for the purpose
of [1] adds some fragility.
As a side note, proprietary solutions and solutions for MP3 audio were all defining the
additional end-to-end delay and the length of the original uncompressed audio data, very
similar to the Nero and iTunes solutions mentioned before and what the edit list is used for in
[1]·
In general, [1] is silent on the correct behavior of real-time streaming applications, which do
not use the MP4 file format, but require timestamps for correct audio video synchronization
and often operate in a very dumb mode. There timestamps are often set incorrectly and hence
a knob is required at the decoding device to bring everything back in sync.
The interface between MPEG-4 Audio and MPEG-4 Systems is described in more detail in
the following paragraphs.
Every access unit delivered to the audio decoder from the Systems interface shall result in a
corresponding composition unit delivered from the audio decoder to the systems interface,
i.e., the compositor. This shall include start-up and shut-down conditions, i.e., when the
access unit is the first or the last in a finite sequence of access units.
For an audio composition unit, ISO/IEC 14496-1 subclause 7.1.3.5 Composition Time Stamp
(CTS) specifies that the composition time applies to the -th audio sample within the
composition unit. The value of n is 1 unless specified differently in the remainder of this
subclause.
For compressed data, like HE-AAC coded audio, which can be decoded by different decoder
configurations, special attention is needed. In this case, decoding can be done in a backwardcompatible
fashion (AAC only) as well as in an enhanced fashion (AAC+SBR). In order to
ensure that composition time stamps are handled correctly (so that audio remains
synchronized with other media), the following applies:
• If compressed data permits both backward-compatible and enhanced decoding, and if
the decoder is operating in a backwards-compatible fashion, then the decoder does not
have to take any special action. In this case, the value of n is 1.
• If compressed data permits both backward-compatible and enhanced decoding, and if
the decoder is operating in enhanced fashion such that it is using a post-processor that
inserts some additional delay (e.g., the SBR post-processor in HE-AAC), then it must
ensure that this additional time delay incurred relative to the backwards-compatible
mode, as described by a corresponding value of n, is taken into account when
presenting the composition unit. The value of n is specified in the following table.
The description of the Interface between Audio and Systems has proven to work reliably,
covering most of today's use-cases. If one looks carefully however, two issues are not
mentioned:
• In many systems the timestamp origin is the value zero. Pre-roll AUs are not assumed
to exist, although e.g. AAC has an inherent minimum encoder-delay of one Access
Unit that requires one Access Unit in front of the Access Unit at timestamp zero. For
the MP4 file format a solution for this problem is described in [1].
• Non-integer durations of the frame size are not covered. The AudioSpecificConfig()
structure allows the signaling of a small set of framesizes which describe the filter
bank lengths, e.g. 960 and 1024 for AAC. Real-world data, however, does typically
not fit onto a grid of fixed framesizes and hence an encoder has to pad the last frame.
These two left-out issues became a problem recently, with the advent of advanced multimedia
applications that require the splicing of two AAC streams or the recovery of the range of valid
samples after an encoder-decoder round-trip - especially in the absence of the MP4 file
format and the methods described in [1].
To overcome the problems mentioned before, pre-roll, post-roll and all other sources have to
be described properly. In addition a mechanism for non-integer multiples of the framesize is
needed to have sample-accurate audio representations.
Pre-roll is required initially for a decoder so that it is able to decode the data fully. As an
example, AAC requires a pre-roll of 1024 samples (one Access Unit) before the decoding of
an Access Unit so that the output samples of the overlap-add operation represent the desired
original signal, as illustrated in [1]. Other audio codecs may have different pre-roll
requirements.
Post-roll is equivalent to pre-roll with the difference that more data after the decoding of an
Access Unit is to be fed to the decoder. The cause for post-roll is codec extensions which
raise a codec's efficiency in exchange for algorithmic delay, such as listed in the table above.
Since a dual-mode operation is often desired, the pre-roll remains constant so that a decoder
without the extensions implemented can fully utilize the coded data. Hence, pre-roll and
timestamps relate to the legacy decoder capabilities. Post-roll is then required in addition for a
decoder supporting these extensions, since the internally existing delay line has to be flushed
to retrieve the entire representation of the original signal. Unfortunately, post-roll is decoder
dependent. It is however possible to handle pre-roll and post-roll independent of the decoder
if the pre-roll and post-roll values are known to the systems layer and the decoder's output of
pre-roll and post-roll can be dropped there.
With respect to a variable audio frame size, since audio codecs always encode blocks of data
with a fixed number of samples, a sample-accurate representation becomes only possible by
further signaling on the Systems level. Since it is easiest for a decoder to handle sampleaccurate
trimming, it seems desirable to have the decoder cut a signal. Hence, an optional
extension mechanism is proposed which allows the trimming of output samples by the
decoder.
Regarding a vendor-specific encoder delay, MPEG only specifies the decoder operation,
whereas encoders are only provided informally. This is one of the advantages of MPEG
technologies, where encoders can improve over time to fully utilize the capabilities of a
codec. The flexibility in designing an encoder has however lead to delay interoperability
problems. Since encoders typically need a preview of the audio signal to make smarter
encoding decisions, this is highly vendor-specific. Reasons for this encoder delay are e.g.
block-switching decisions, which require a delay of the possible window overlaps and other
optimizations, which are mostly relevant for real-time encoders.
File-based encoding of offline available content does not require this delay which is only
relevant when real-time data is encoded, nevertheless, most encoders do prepend silence also
to the beginning of offline encodings.
One part of the solution for this problem is the correct setting of timestamps on the systems
layer so that these delays are irrelevant and have e.g. negative timestamp values. This can also
be accomplished with the edit list, as proposed in [1].
The other part of the solution is an alignment of the encoder delay to frame boundaries, so
that an integer number of Access Units with e.g. negative timestamps can be skipped initially
(besides the pre-roll Access Units).
The teachings disclosed herein also relate to the industrial standard ISO/IEC 14496-3:2009,
subpart 4, section 4.1 .1.2. According to the teachings disclosed herein, the following is
proposed: When present, a post-decoder trimming tool selects a portion of the reconstructed
audio signal, so that two streams can be spliced together in the coded domain and sampleaccurate
reconstruction becomes possible within the Audio layer.
The input to the post-decoder trimming tool is:
• The time domain reconstructed audio signal
• The post-trim control information
The output of the post-decoder trimming tool is:
The time domain reconstructed audio signal
If the post-decoder trimming tool is not active, the time domain reconstructed audio signal is
passed directly to the output of the decoder. This tool is applied after any previous audio
coding tool.
The following table illustrates a proposed syntax of a data structure extension_payload() that
may be used to implement the teachings disclosed herein.
Syntax No. of bits Mnemonic
extension_payload(cnt)
{
extension_type; uimsbf
align = 4;
switch( extension_type ) {
case EXT TRIM:
return trim_info();
case EXT_DYNAMIC_RANGE:
return dynamic_range_info();
case EXT_SAC_DATA:
return sac_extension_data(cnt);
case EXT_SBR_DATA:
return sbr_extension_data(id_aac, 0); Note 1
case EXT SBR DATA CRC:
return sbr_extension_data(id_aac, 1); Note 1
case EXT_FILL_DATA:
fu nibble; /* must be '0000' */ 4 uimsbf
for (i=0; i