3. Algorithm Description: Music Detection Module
The music detection function is a key innovation within Annex C+, specifically designed to enhance the codec’s performance and audio quality when handling non-speech audio, such as music, during Annex E’s DTX operation. Its primary role is to identify music segments and override the VAD’s decision, preventing musical passages from being incorrectly classified as silence and subsequently suppressed.
3.1 Operational Principles
The music detection function is a new procedure performed immediately after the VAD makes its initial decision. Its core behavior is to analyze a set of parameters and, if music is detected, force the VAD’s final decision to “speech.”
This function is active only during Annex E operation in conjunction with Annex B DTX. However, its internal parameters are updated continuously, regardless of the active coding mode, to maintain an accurate state of the audio signal’s characteristics over time.
3.2 Parameter Computation
The algorithm’s first main part involves the computation of several parameters that characterize the input signal. These parameters form the basis of the music detection logic.
- Partial Normalized Residual Energy: An energy measurement (Lenergy) of the LPC residual signal, indicating the energy of the current frame after linear prediction.
- Spectral Difference: A measure (SD) of the spectral change between the current frame’s reflection coefficients and a running mean of past coefficients, indicating spectral stationarity.
- Open loop Pitch Lag Correction: A process applied to the open loop pitch lag to correct for common estimation errors such as pitch doubling or tripling, ensuring a more accurate pitch track.
- Pitch Lag Standard Deviation: A statistical measure (std) of the stability of the pitch lag over the last five frames. A low standard deviation suggests a stable, periodic signal, characteristic of voiced speech or music.
- Running Mean of Pitch Gain: A smoothed average (mPgain) of the pitch gain values, reflecting the strength of the pitch periodicity over time.
- Pitch Lag Smoothness and Voicing Strength Indicator (Pflag): A logical flag derived from the pitch lag standard deviation and the running mean of pitch gain. It provides a binary indication of whether the signal exhibits strong and stable voicing characteristics.
- Stationarity Counters: A set of counters track the persistence of various signal properties over consecutive frames:
- count_consc_rflag: Tracks frames where specific reflection coefficient and pitch gain conditions are met.
- count_music: Tracks frames where the previous frame used backward adaptive LPC and the current frame is “speech.”
- count_consc: Tracks consecutive frames where count_music remains zero. The running mean, mcount_music, is reset to 0 if count_consc exceeds 500 or if count_consc_rflag exceeds 150.
- count_pflag: Tracks the number of frames within a 64-frame window where Pflag is active.
- count_consc_pflag: Tracks consecutive frames where Pflag is continuously inactive. The running mean, mcount_pflag, is reset to 0 if count_consc_pflag exceeds 100 or if count_consc_rflag exceeds 150.
3.3 Classification Logic
Based on the parameters computed, the classification logic determines if the initial VAD decision (Vad_deci) should be reverted. If the VAD classifies a frame as “non-speech,” the music detection module evaluates a set of conditions based on the calculated parameters. The decision is changed from “non-speech” to “VOICE” if specific thresholds for spectral difference (SD), the difference between current and mean residual energy (Lenergy – mEnergy), absolute energy (LLenergy), and running mean counters (mcount_pflag, mcount_music) are met.
This logic is designed to identify signals that, while not matching typical speech patterns, exhibit the harmonic stability and consistent energy characteristic of music. It is critical to note a fundamental constraint of this module: it only has the capability to change a VAD decision from “non-speech” to “speech,” and not vice versa. This ensures that genuine speech is never misclassified as silence.
The successful implementation of these algorithms relies on the careful management of codec state variables, especially during periods of discontinued transmission.