Output Inconstancy of Feature set #46

ademasi · 2022-05-11T12:35:59Z

Hi, I am using the python library to interact with opensmile.

smile = opensmile.Smile(
    feature_set=opensmile.FeatureSet.ComParE_2016 ,
    feature_level=opensmile.FeatureLevel.Functionals,
    verbose=True
)

When I use smile.process_file or smile.process_signal with a file opened through librosa, the outputted features are different. I don't understand how this is possible as it is the same file, so the extraction should be the same if I pass a file or a signal. I am using librosa because the signal is also used somewhere else in my code.

What do you advise ?

hagenw · 2022-05-11T12:53:56Z

Yes, you are right.

Here a minimal example how to reproduce (even without librosa):

import audiofile
import numpy as np
import opensmile


np.random.seed(0)

smile = opensmile.Smile(
    feature_set=opensmile.FeatureSet.ComParE_2016 ,
    feature_level=opensmile.FeatureLevel.Functionals,
    verbose=True
)

sampling_rate = 16000
signal = np.random.normal(size=(1, sampling_rate))
audiofile.write('test.wav', signal, sampling_rate)

f1 = smile.process_file('test.wav')
f2 = smile.process_signal(signal, sampling_rate)

and then

>>> f1['audspec_lengthL1norm_sma_maxPos']
file      start   end            
test.wav  0 days  0 days 00:00:01    0.430108
Name: audspec_lengthL1norm_sma_maxPos, dtype: float32
>>> f2['audspec_lengthL1norm_sma_maxPos']
start   end            
0 days  0 days 00:00:01    0.0
Name: audspec_lengthL1norm_sma_maxPos, dtype: float32

Michele1996 · 2022-07-15T19:01:09Z

Hi, does someone have an answer for that? I would like to use from_signal as it is faster than from_file

jonasvdd · 2023-01-11T09:05:43Z

Hi,

I also encountered the above issue!

My workaround is:

assuming that opensmile it's wav parsing is correct.
When I want to pass an (wav) array to opensmiles process_signal, I use torchaudio it's load function instead of librosa. torchaudio its function gives exactly the same smile results as the wav.

# load the wav data and convert to 32b float
arr,  fs = torchaudio.load(WAV_PATH, normalize=true)
arr = arr.numpy().ravel()

I also remarked significant differences in feature values when resampling the signal!

e.g., In the visualization below I used the raw 44.1kHz and a 16kHz sinc-resampled variant from the signal to extract GeMAPSv01b LLD's. 📷 ⬇️

legend:

Smile-orig-n: using 44.1kHz data
Smile-16kHz-n: u sing 16kHz data

it seems that the GeMAPs F0semitone is more robustly extracted in the 16KhZ variant? (less 60 peaks)

Is this behavior normal?

hagenw · 2023-01-11T10:17:56Z

assuming that opensmile it's wav parsing is correct.

If you are using the Python version, then the WAV parsing of opensmile is not used as the file is read with audiofile first and then internally processed with https://github.com/audeering/opensmile-python/blob/c64837d6fdfa62f1810ba00ed0f44d2c2bd7ddd1/opensmile/core/smile.py#L263-L326

The code that reproduces the error here at #46 (comment) returns different results as I did not normalize the magnitude of the audio.
When I repeat with ensuring the amplitude is in the range -1..1 I get:

import audiofile
import numpy as np
import opensmile


np.random.seed(0)

smile = opensmile.Smile(
    feature_set=opensmile.FeatureSet.ComParE_2016 ,
    feature_level=opensmile.FeatureLevel.Functionals,
    verbose=True
)

sampling_rate = 16000
signal = np.random.normal(size=(1, sampling_rate))
signal = signal / (np.max(np.max(np.abs(signal))) + 10 ** -9)
audiofile.write('test.wav', signal, sampling_rate)

f1 = smile.process_file('test.wav')
f2 = smile.process_signal(signal, sampling_rate)

and then

>>> f1['audspec_lengthL1norm_sma_maxPos']
file      start   end            
test.wav  0 days  0 days 00:00:01    0.430108
Name: audspec_lengthL1norm_sma_maxPos, dtype: float32
>>> f2['audspec_lengthL1norm_sma_maxPos']
start   end            
0 days  0 days 00:00:01    0.430108
Name: audspec_lengthL1norm_sma_maxPos, dtype: float32

hagenw · 2023-01-11T10:24:19Z

The problem with librosa is that it automatically converts the sampling rate when you don't specify it during loading, e.g. when I load the 16.000 Hz test file I generated above I get:

>>> import librosa
>>> signal, sampling_rate = librosa.load('test.wav')
>>> sampling_rate
22050

If I then execute opensmile, I get a different result:

>>> f3 = smile.process_signal(signal, sampling_rate)
>>> f3['audspec_lengthL1norm_sma_maxPos']
start   end            
0 days  0 days 00:00:01    0.434783
Name: audspec_lengthL1norm_sma_maxPos, dtype: float32

To avoid this you have to tell librosa the desired sampling rate during loading or use None to get the sampling rate from the file:

>>> signal, sampling_rate = librosa.load('test.wav', sr=None)
>>> sampling_rate
16000

If you then use opensmile, you get the desired result:

>>> f4 = smile.process_signal(signal, sampling_rate)
>>> f4['audspec_lengthL1norm_sma_maxPos']
start   end            
0 days  0 days 00:00:01    0.430108
Name: audspec_lengthL1norm_sma_maxPos, dtype: float32

jonasvdd · 2023-01-17T14:25:00Z

Hi, indeed, when you use None with the librosa, you get the same results as just parsing the .wav file, thanks for helping with that ;).

But my second question was more tailored towards the (rather large) differences in OpenSMILE LLD values when using resampling?

If you click on the 📷 🔝 which I sent in my previous image; you can see

rather significant changes in the jitter and shimmer
- There is a common trend that jitter values for the 44.1kHz data are really high when someone begins/ends a voiced segment (see red VAD-line of upper subplot as reference for voiced regions)
some differences in the F0-semitone
The 44.1kHz data has some peaks to 60 (which would imply a peak F0 of 880Hz, which is not feasible, see 👨🏼‍💻 ⬇️ )

What are the possible explanations for these differences, and which sampling-rate is recommended to work with when using OpneSMILE? (In the majority of research papers I find resampling to 16kHz as a preprocessing step, but I would presume that, for features such as jitter and shimmer, a higher (thus more temporal accurate) rate should result in better results?

Looking forward to your response and kind regards,
Jonas

jonasvdd · 2023-01-17T15:32:27Z

For reference, I'm using the GeMAPSv01b LLD config

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Output Inconstancy of Feature set #46

Output Inconstancy of Feature set #46

ademasi commented May 11, 2022

hagenw commented May 11, 2022

Michele1996 commented Jul 15, 2022 •

edited

Loading

jonasvdd commented Jan 11, 2023 •

edited

Loading

hagenw commented Jan 11, 2023

hagenw commented Jan 11, 2023 •

edited

Loading

jonasvdd commented Jan 17, 2023 •

edited

Loading

jonasvdd commented Jan 17, 2023

Output Inconstancy of Feature set #46

Output Inconstancy of Feature set #46

Comments

ademasi commented May 11, 2022

hagenw commented May 11, 2022

Michele1996 commented Jul 15, 2022 • edited Loading

jonasvdd commented Jan 11, 2023 • edited Loading

hagenw commented Jan 11, 2023

hagenw commented Jan 11, 2023 • edited Loading

jonasvdd commented Jan 17, 2023 • edited Loading

jonasvdd commented Jan 17, 2023

Michele1996 commented Jul 15, 2022 •

edited

Loading

jonasvdd commented Jan 11, 2023 •

edited

Loading

hagenw commented Jan 11, 2023 •

edited

Loading

jonasvdd commented Jan 17, 2023 •

edited

Loading