Merge pull request #4 from DigitalPhonetics/speechbrain_asr_eval

Include speechbrain ASR and B2 into main
DigitalPhonetics · Dec 24, 2023 · 720f190 · 720f190
2 parents 3a0c3b7 + 8a20e01
commit 720f190
Show file tree

Hide file tree

Showing 813 changed files with 3,005 additions and 309 deletions.
diff --git a/README.md b/README.md
@@ -1,13 +1,28 @@
 # [VoicePAT: Voice Privacy Anonymization Toolkit](http://arxiv.org/abs/2309.08049)
 
-**Note: This repository and its documentation are still under construction but can already be used for both anonymization and evaluation. We welcome all contributions to introduce more generation methods or evaluation metrics to the VoicePAT framework. If you are interested in contributing, please leave comments on a GitHub issue.**
+**Note: This repository and its documentation are still under construction but can already be used for both 
+anonymization and evaluation. We welcome all contributions to introduce more generation methods or evaluation metrics to the VoicePAT framework. 
+If you are interested in contributing, please leave comments on a GitHub issue.**
 
 VoicePAT is a toolkit for speaker anonymization research. It is based on the framework(s) by the [VoicePrivacy Challenges](https://github.com/Voice-Privacy-Challenge/Voice-Privacy-Challenge-2022) but contains the following improvements:
 
-* It consists of **two separate procedures for anonymization and evaluation**. This means that the generation of anonymized speech is independent of the evaluation of anonymization systems. Both processes do not need to be executed in the same run or with the same settings. Of course, you need to perform the anonymization of evaluation data with one system before you can evaluate it but this could have happened at an earlier time and with an external codebase.
-* Anonymization and evaluation procedures are **structured as pipelines** consisting of separate **modules**. Each module may have a selection of different models or algorithm to fulfill its role. The settings for each procedure / pipeline are defined exclusively in configuration files. See the *Usage* section below for more information.
-* **Evaluation models** have been exchanged by models based on [SpeechBrain](https://github.com/speechbrain/speechbrain/) and [ESPnet](https://github.com/espnet/espnet/) which are **more powerful** than the previous Kaldi-based models. Furthermore, we added new techniques to make evaluation significantly **more efficient**.
-* The framework is written in **Python**, making it easy to include and adapt other Python-based models, e.g., using PyTorch. When using the framework, you do not need in-depth knowledge about anything outside the Python realm. (Disclaimer: While being written in Python, the ASR evaluation is currently included with an ESPnet-based model which in turn is based on Kaldi. However, you do not need to modify that part of the code for using or changing the ASR model and ESPnet is currently working on a Kaldi-free version.)
+* It consists of **two separate procedures for anonymization and evaluation**. This means that the generation of 
+  anonymized speech is independent of the evaluation of anonymization systems. Both processes do not need to be 
+  executed in the same run or with the same settings. Of course, you need to perform the anonymization of evaluation 
+  data with one system before you can evaluate it but this could have happened at an earlier time and with an 
+  external codebase.
+* Anonymization and evaluation procedures are **structured as pipelines** consisting of separate **modules**. Each 
+  module may have a selection of different models or algorithm to fulfill its role. The settings for each procedure 
+  / pipeline are defined exclusively in configuration files. See the *Usage* section below for more information.
+* **Evaluation models** have been exchanged by models based on [SpeechBrain](https://github.com/speechbrain/speechbrain/) and [ESPnet](https://github.com/espnet/espnet/) which are **more powerful** than the 
+  previous Kaldi-based models. Furthermore, we added new techniques to make evaluation significantly **more 
+  efficient**.
+* The framework is written in **Python**, making it easy to include and adapt other Python-based models, e.g., using 
+  PyTorch. When using the framework, you do not need in-depth knowledge about anything outside the Python realm 
+  (Disclaimer: While being written in Python, the ASR evaluation is currently included with an ESPnet-based model 
+  which in turn is based on Kaldi. However, you do not need to modify that part of the code for using or 
+  changing the ASR model and ESPnet is currently working on a Kaldi-free version.)
+
 
 ## Installation
 
@@ -43,10 +58,13 @@ Running an anonymization pipeline is done like this:
 ```
 python run_anonymization.py --config anon_ims_sttts_pc.yaml --gpu_ids 0,1 --force_compute
 ```
+This will perform all computations that support parallel computing on the gpus with ID 0 and 1, and on GPU 0 
+otherwise. If no gpu_ids are specified, it will run only on GPU 0 or CPU, depending on whether cuda is available. 
+`--force_compute` causes all previous computations to be run again. In most cases, you can delete that flag from the 
+command to speed up the anonymization.
 
-This will perform all computations that support parallel computing on the gpus with ID 0 and 1, and on GPU 0 otherwise. If no gpu_ids are specified, it will run only on GPU 0 or CPU, depending on whether cuda is available. `--force_compute` causes all previous computations to be run again. In most cases, you can delete that flag from the command to speed up the anonymization.
-
-Pretrained models for this anonymization can be found at [https://github.com/DigitalPhonetics/speaker-anonymization/releases/tag/v2.0](https://github.com/DigitalPhonetics/speaker-anonymization/releases/tag/v2.0) and earlier releases.
+Pretrained models for this anonymization can be found at [https://github.
+com/DigitalPhonetics/speaker-anonymization/releases/tag/v2.0](https://github.com/DigitalPhonetics/speaker-anonymization/releases/tag/v2.0) and earlier releases.
 
 ### Evaluation
 
@@ -72,6 +90,10 @@ Pretrained evaluation models can be found in release v1.
 Several parts of this toolkit are based on or use code from external sources, i.e.,
 
 * [VoicePrivacy Challenge 2022](https://github.com/Voice-Privacy-Challenge/Voice-Privacy-Challenge-2022), [ESPnet](https://github.com/espnet/espnet/), [SpeechBrain](https://github.com/speechbrain/speechbrain/) for evaluation
-* the [GAN-based anonymization system by IMS (University of Stuttgart)](https://github.com/DigitalPhonetics/speaker-anonymization) for anonymization
+* the [GAN-based anonymization system by IMS (University of Stuttgart)](https://github.com/DigitalPhonetics/speaker-anonymization) 
+  for 
+  anonymization
+
+See the READMEs for [anonymization](anonymization/README.md) and [evaluation](evaluation/README.md) for more 
+information.
 
-See the READMEs for [anonymization](anonymization/README.md) and [evaluation](evaluation/README.md) for more information.
diff --git a/anonymization/modules/dsp/anonymise_dir_mcadams_rand_seed.py b/anonymization/modules/dsp/anonymise_dir_mcadams_rand_seed.py
@@ -0,0 +1,142 @@
+#!/usr/bin/env python3.0
+# -*- coding: utf-8 -*-
+"""
+@author: Jose Patino, Massimiliano Todisco, Pramod Bachhav, Nicholas Evans
+Audio Security and Privacy Group, EURECOM
+modified version (N.T.)
+"""
+import os
+import librosa
+import numpy as np
+import scipy
+import wave
+import argparse
+from pathlib import Path
+import matplotlib.pyplot as plt
+import random
+from kaldiio import ReadHelper
+import shutil
+
+def load_utt2spk(path):
+    assert os.path.isfile(path), f'File does not exist {path}'
+    table = np.genfromtxt(path, dtype='U')
+    utt2spk = {utt: spk for utt, spk in table}
+    return utt2spk
+
+def process_data(dataset_path, anon_level, settings):
+
+    utt2spk = None
+    if anon_level == 'spk':
+        utt2spk = load_utt2spk( dataset_path / 'utt2spk')
+
+    output_path = Path(str(dataset_path) + settings['anon_suffix'])
+    if os.path.exists(output_path):
+        shutil.rmtree(output_path)
+    shutil.copytree(dataset_path, output_path)
+    if not os.path.exists(output_path /  'wav'):
+        os.makedirs(output_path / 'wav')
+    wav_scp = dataset_path / 'wav.scp'
+    path_wav_scp_out = output_path / 'wav.scp'
+    with open(path_wav_scp_out, 'wt', encoding='utf-8') as writer:
+        with ReadHelper(f'scp:{wav_scp}') as reader:
+            print(reader)
+            for utid, (freq, samples) in reader:
+                print(utid)
+                output_file = os.path.join(output_path / 'wav', f'{utid}.wav')
+                print(output_file)
+                if os.path.exists(output_file):
+                    print('file already exists')
+                    continue
+                samples = samples / (np.iinfo(np.int16).max + 1)
+                if anon_level == 'spk':
+                    assert utid in utt2spk, f'Failed to find speaker ID for utterance {utid}'
+                    spid = utt2spk[utid]
+                    random.seed(np.abs(hash(spid)))
+                rand_mc_coeff = random.uniform(settings['mc_coeff_min'], settings['mc_coeff_max'])
+
+                samples = anonym(freq=freq, samples=samples, 
+                    winLengthinms=settings['winLengthinms'],
+                    shiftLengthinms=settings['shiftLengthinms'], 
+                    lp_order=settings['n_coeffs'], mcadams=rand_mc_coeff)
+
+                with wave.open(output_file, 'wb') as stream:
+                    stream.setframerate(freq)
+                    stream.setnchannels(1)
+                    stream.setsampwidth(2)
+                    stream.writeframes(samples)
+                print(f'{utid} {output_file}', file=writer)
+    print('Done')
+
+def anonym(freq, samples, winLengthinms=20, shiftLengthinms=10, lp_order=20, mcadams=0.8):
+
+
+    print(mcadams)
+    eps = np.finfo(np.float32).eps
+    samples = samples + eps
+
+    # simulation parameters
+    winlen = np.floor(winLengthinms * 0.001 * freq).astype(int)
+    shift = np.floor(shiftLengthinms * 0.001 * freq).astype(int)
+    length_sig = len(samples)
+
+    # fft processing parameters
+    NFFT = 2 ** (np.ceil((np.log2(winlen)))).astype(int)
+    # anaysis and synth window which satisfies the constraint
+    wPR = np.hanning(winlen)
+    K = np.sum(wPR) / shift
+    win = np.sqrt(wPR / K)
+    Nframes = 1 + np.floor((length_sig - winlen) / shift).astype(int) # nr of complete frames   
+
+    # carry out the overlap - add FFT processing
+    sig_rec = np.zeros([length_sig]) # allocate output+'ringing' vector
+
+    for m in np.arange(1, Nframes):
+        # indices of the mth frame
+        index = np.arange(m * shift, np.minimum(m * shift + winlen, length_sig))    
+        # windowed mth frame (other than rectangular window)
+        frame = samples[index] * win 
+        # get lpc coefficients
+        a_lpc = librosa.core.lpc(frame + eps, order=lp_order)
+        # get poles
+        poles = scipy.signal.tf2zpk(np.array([1]), a_lpc)[1]
+        #index of imaginary poles
+        ind_imag = np.where(np.isreal(poles) == False)[0]
+        #index of first imaginary poles
+        ind_imag_con = ind_imag[np.arange(0, np.size(ind_imag), 2)]
+
+        # here we define the new angles of the poles, shifted accordingly to the mcadams coefficient
+        # values >1 expand the spectrum, while values <1 constract it for angles>1
+        # values >1 constract the spectrum, while values <1 expand it for angles<1
+        # the choice of this value is strongly linked to the number of lpc coefficients
+        # a bigger lpc coefficients number constraints the effect of the coefficient to very small variations
+        # a smaller lpc coefficients number allows for a bigger flexibility
+        new_angles = np.angle(poles[ind_imag_con]) ** mcadams
+        #new_angles = np.angle(poles[ind_imag_con])**path[m]
+
+        # make sure new angles stay between 0 and pi
+        new_angles[np.where(new_angles >= np.pi)] = np.pi        
+        new_angles[np.where(new_angles <= 0)] = 0  
+
+        # copy of the original poles to be adjusted with the new angles
+        new_poles = poles
+        for k in np.arange(np.size(ind_imag_con)):
+            # compute new poles with the same magnitued and new angles
+            new_poles[ind_imag_con[k]] = np.abs(poles[ind_imag_con[k]]) * np.exp(1j * new_angles[k])
+            # applied also to the conjugate pole
+            new_poles[ind_imag_con[k] + 1] = np.abs(poles[ind_imag_con[k] + 1]) * np.exp(-1j * new_angles[k])            
+
+        # recover new, modified lpc coefficients
+        a_lpc_new = np.real(np.poly(new_poles))
+        # get residual excitation for reconstruction
+        res = scipy.signal.lfilter(a_lpc,np.array(1),frame)
+        # reconstruct frames with new lpc coefficient
+        frame_rec = scipy.signal.lfilter(np.array([1]),a_lpc_new,res)
+        frame_rec = frame_rec * win    
+
+        outindex = np.arange(m * shift, m * shift + len(frame_rec))
+        # overlap add
+        sig_rec[outindex] = sig_rec[outindex] + frame_rec
+    sig_rec = (sig_rec / np.max(np.abs(sig_rec)) * (np.iinfo(np.int16).max - 1)).astype(np.int16)
+    return sig_rec
+    #scipy.io.wavfile.write(output_file, freq, np.float32(sig_rec))
+    #awk -F'[/.]' '{print $5 " sox " $0 " -t wav -R -b 16 - |"}' > data/$dset$anon_data_suffix/wav.scp
diff --git a/anonymization/modules/speaker_embeddings/anonymization/pool_anon.py b/anonymization/modules/speaker_embeddings/anonymization/pool_anon.py
@@ -19,7 +19,7 @@
 logger = logging.getLogger(__name__)
 
 REVERSED_GENDERS = {
-    "m": "f", 
+    "m": "f",
     "f": "m"
 }
 
@@ -29,8 +29,8 @@ class PoolAnonymizer(BaseAnonymizer):
     An implementation of the 'Pool' anonymization method, that is based on the
     primary baseline of the Voice Privacy Challenge 2020.
 
-    For every source x-vector, an anonymized x-vector is computed by finding 
-    the N farthest x-vectors in an external pool (LibriTTS train-other-500) 
+    For every source x-vector, an anonymized x-vector is computed by finding
+    the N farthest x-vectors in an external pool (LibriTTS train-other-500)
     according to the PLDA distance, and by averaging N∗ randomly selected
     vectors among them. In the baseline, we use:
         N = 200,
@@ -51,7 +51,7 @@ def __init__(
         scaling: str = None,
         stats_per_dim_path: Union[str, PathLike] = None,
         distance_model_path: Union[str, PathLike] = "distances/plda/libritts_train_other_500_xvector",
-        embed_model_path: Union[str, PathLike] = None,
+        emb_model_path: Union[str, PathLike] = None,
         save_intermediate: bool = False,
         suffix: str = "_anon",
         **kwargs,
@@ -63,11 +63,11 @@ def __init__(
 
             device (Union[str, torch.device, int, None]): Device to use for
                 the procedure, e.g. 'cpu', 'cuda', 'cuda:0', etc.
-            
-            model_name (str): Name of the model, used for distances that 
+
+            model_name (str): Name of the model, used for distances that
                 require a model (e.g., PLDA).
 
-            pool_data_dir (Union[str, PathLike]): Path to the audio data 
+            pool_data_dir (Union[str, PathLike]): Path to the audio data
             which will be used for x-vector pool extraction.
 
             pool_vec_path (Union[str, PathLike]): Path to the stored
@@ -81,10 +81,10 @@ def __init__(
             distance (str): Distance measure, either 'plda' or 'cosine'.
 
             cross_gender (bool): Whether to switch genders of the speakers
-                during anonymization. 
+                during anonymization.
 
-            proximity (str): Proximity measure, determining which vectors in 
-                the pool are the 'fittest', can be either 'farthest', 
+            proximity (str): Proximity measure, determining which vectors in
+                the pool are the 'fittest', can be either 'farthest',
                 'nearest' or 'center'.
 
             scaling (str): Scaling method to use, can be either 'minmax' or
@@ -97,7 +97,7 @@ def __init__(
             distance_model_path (Union[str, PathLike]): Path to the stored
                 distance model (required for PLDA).
 
-            embed_model_path (Union[str, PathLike]): Path to the directory
+            emb_model_path (Union[str, PathLike]): Path to the directory
                 containing the speaker embedding model.
 
             save_intermediate (bool): Whether to save intermediate results.
@@ -113,7 +113,7 @@ def __init__(
 
         self.model_name = model_name if model_name else f"pool_{vec_type}"
 
-        self.N = N 
+        self.N = N
         self.N_star = N_star
         self.proximity = proximity
         self.cross_gender = cross_gender
@@ -123,7 +123,7 @@ def __init__(
         self.pool_embeddings = self._load_pool_embeddings(
             pool_data_dir=Path(pool_data_dir).expanduser(),
             pool_vec_path=Path(pool_vec_path).expanduser(),
-            embed_model_path=Path(embed_model_path).expanduser(),
+            emb_model_path=Path(emb_model_path).expanduser(),
         )
         self.pool_genders = {
             gender: [
@@ -149,15 +149,15 @@ def __init__(
         self.scaling = scaling
         self.stats_per_dim_path = stats_per_dim_path or Path()
 
-    def _load_pool_embeddings(self, pool_data_dir, pool_vec_path, embed_model_path):
+    def _load_pool_embeddings(self, pool_data_dir, pool_vec_path, emb_model_path):
         logger.debug(pool_data_dir)
         if pool_vec_path.exists():
             pool_embeddings = SpeakerEmbeddings(
                 vec_type=self.vec_type, emb_level="spk", device=self.device
             )
             pool_embeddings.load_vectors(pool_vec_path)
         else:
-            extraction_settings = {"vec_type": self.vec_type, "emb_level": "spk", "embed_model_path": embed_model_path}
+            extraction_settings = {"vec_type": self.vec_type, "emb_level": "spk", "emb_model_path": emb_model_path}
             emb_extractor = SpeakerExtraction(
                 results_dir=pool_vec_path,
                 devices=[self.device],

diff --git a/anonymization/modules/speaker_embeddings/anonymization/utils/WGAN/wgan_qc.py b/anonymization/modules/speaker_embeddings/anonymization/utils/WGAN/wgan_qc.py
@@ -261,16 +261,18 @@ def sample(self, num_samples):
         # Remove color channel
         return generated_data.data.cpu().numpy()[:, 0, :, :]
 
-    def save_model_checkpoint(self, model_path, model_parameters, timestampStr):
+    def save_model_checkpoint(self, model_path, model_parameters, timestampStr, dataset_mean, dataset_std):
         # dateTimeObj = datetime.now()
         # timestampStr = dateTimeObj.strftime("%d-%m-%Y-%H-%M-%S")
         name = '%s_%s' % (timestampStr, 'wgan')
         model_filename = os.path.join(model_path, name)
         torch.save({
-            'generator_state_dict'       : self.G.state_dict(),
-            'critic_state_dict'          : self.D.state_dict(),
-            'gen_optimizer_state_dict'   : self.G_opt.state_dict(),
+            'generator_state_dict': self.G.state_dict(),
+            'critic_state_dict': self.D.state_dict(),
+            'gen_optimizer_state_dict': self.G_opt.state_dict(),
             'critic_optimizer_state_dict': self.D_opt.state_dict(),
-            'model_parameters'           : model_parameters,
-            'iterations'                 : self.num_steps
+            'model_parameters': model_parameters,
+            'iterations': self.num_steps,
+            'mean': dataset_mean,
+            'std': dataset_std
             }, model_filename)
diff --git a/anonymization/modules/speaker_embeddings/speaker_extraction.py b/anonymization/modules/speaker_embeddings/speaker_extraction.py
@@ -26,7 +26,7 @@ def __init__(self, devices: list, settings: dict, results_dir: Path = None, mode
         self.save_intermediate = save_intermediate
         self.force_compute = force_compute if force_compute else settings.get('force_compute_extraction', False)
 
-        self.embed_model_path = settings['embed_model_path']
+        self.emb_model_path = settings['emb_model_path']
         self.vec_type = settings['vec_type']
         self.emb_level = settings['emb_level']
 
@@ -42,7 +42,7 @@ def __init__(self, devices: list, settings: dict, results_dir: Path = None, mode
 
         self.model_hparams = {
             'vec_type': self.vec_type,
-            'model_path': self.embed_model_path,
+            'model_path': self.emb_model_path,
         }
 
         self.extractors = [create_extractors(hparams=self.model_hparams, device=device) for device, process in zip(cycle(devices), range(len(devices)))]

diff --git a/anonymization/modules/tts/IMSToucan/requirements.txt b/anonymization/modules/tts/IMSToucan/requirements.txt
@@ -15,7 +15,7 @@ pyworld
 scipy
 segments
 sentencepiece
-sklearn
+scikit-learn
 sounddevice
 SoundFile
 speechbrain==0.5.10