You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The model seems to process the entire audio at once, which leads to high vram usage for long audio. I was trying to compute MERT on a 9:58 audio with an A100 80GB GPU, and it tried to allocate 90GB of vram.
Is it possible to split the audio first, process each segment and obtain the same results? I tried to split the audio into 60s windows using the code below. Even though I managed to make the segmented embedding into the same shape, it seems to give a large mean square error from the original calculation if the entire audio is passed in at once.
window_length=int(self.sr*60) # 60 secondsoverlap_length=int(self.sr*4.987) # 4.987 seconds (5s window - 1 * 75Hz framerate)overlap_frames=int(4.987*75) -1# 75 Hz frame rateembeddings= []
print("Audio shape:", audio.shape)
print("Window length:", window_length)
print("Overlap length:", overlap_length)
print("Overlap frames:", overlap_frames)
# Iterate over audio with overlapforstartinrange(0, audio.shape[0], window_length-overlap_length):
end=start+window_lengthsegment=audio[start:end]
print("Segment:", segment.shape)
# if len(segment) < window_length:# break# Process each segmentinputs=self.processor(segment, sampling_rate=self.sr, return_tensors="pt").to(self.device)
withtorch.no_grad():
out=self.model(**inputs, output_hidden_states=True)
out=torch.stack(out.hidden_states).squeeze() # [13 layers, timeframes, 768]out=out[11] # [timeframes, 768]print("Frames before:", out.shape[0])
# Remove overlap from the end of the segmentifend<audio.shape[0]:
out=out[:-overlap_frames, :]
print("Frames after:", out.shape[0])
embeddings.append(out)
# Stack embeddings for all segmentsout=torch.cat(embeddings, dim=0)
returnout
The text was updated successfully, but these errors were encountered:
Here is the absolute error between the original and the segmented calculations for a 4-minute audio on a graph... it's weird that the overlapping areas are not the only thing that is affected, but the error seems to bleed to the entire rest of the segment.
The model seems to process the entire audio at once, which leads to high vram usage for long audio. I was trying to compute MERT on a 9:58 audio with an A100 80GB GPU, and it tried to allocate 90GB of vram.
Is it possible to split the audio first, process each segment and obtain the same results? I tried to split the audio into 60s windows using the code below. Even though I managed to make the segmented embedding into the same shape, it seems to give a large mean square error from the original calculation if the entire audio is passed in at once.
The text was updated successfully, but these errors were encountered: