Recent results have show that training on synthetic data does allow for reasonable 0 zero shot performance on unseen podcast data. I tried a modal trained purely on synthetic data on a joe rogan podcast, and it worked with some hiccups. This means that we're on the right track here.
I tried to do an experiment where I made the problem harder on synthetic data by adding noise and changing the volume. This was interestingly bad and made the model not work. So now I'm trying to go down the opposite route, make the problem easier for the modal. I'm going to do this through denoising all the audio, this can be considered as a normalisation step.
The model is also struggling a bit more now that I'm training with people interrupting each other. I had a dig into the data and I think that the data is messy, there are a lot of sections that are silence, labelled as speech etc. This can be solved by preprocessing the data with a vad but I've got to the point with my pre-processing that I'm getting sick of how slow it is and manual.
I'm also suspicious that mel spectrograms of overlapping speech are like trying to determine what numbers summed together make up the number 56. I think the model is struggling with this because it's inherently challenging to do, so limiting the window of audio that it can predict to 15 seconds, inherently reduces the chances of more than 3 speakers being in the chunk. 5-6 speakers wasn't working well at all when I increased it to that. This would explain why. It's easier to determine what numbers could make 56 if you know there are only 2 numbers.
I'm quite interested to compare the mel approach again with the DAC. mel was better loss wise but this could have be because of a higher batch size. It may be that DAC is able to encode overlapping speech better.
One other thing is that we're really only interested in speech that contains 2 speakers most o teh time.
My biggest bottle neck right now is that I'm processing around a TB of data on a single machine and it takes a while for me to evaluate that processing training.
Another realisation I have is that, I've proved this quantized time encoded technique works. The next stage really is to clean up the data and train it as a larger model. I realised that what I've built is a decoder only whisper model to some extent that is performing diarisation specifically. What if we were to build a model that could be seen as a swiss army knife of audio annotation. After reading the seamless paper https://ai.meta.com/research/seamless-communication/ and listening to the results, I concluded that the meta team were trying to do way to much here. Doing something similar but focusing on audio annotation tasks would probably be better. This way the encoding of the audio can be lossy (mels) and the model is not generative.
A multi task model that is purely for transcribing speech events all of which will be useful for tts.
Training a model to do all these tasks at the same time will no doubt improve it's performance on the individual tasks (as the whisper paper says!)