I started in music but have since edited thousands of hours of podcasts. I cannot transcribe a track by looking at the waveform, except the word "um" haha. But without playing the audio I can tell you where words start and end, whether a peak is a B or a T or an A or an I sound... And melodyne can add layers to that and tell me the pitch, formants (vowels), quantize the syllables etc. If I can do all this, a computer ought to be able to do the same and more
Hundreds of hours here, and I can't even always reliably spot my own ums. I edit as many out as I possibly can for myself, my co-host and guest, as well as eliminating continuation signaling phrases like "you know" and "like". I also remove uninteresting asides and bits of dead air. This is boring and tedious work but it makes the end result considerably better I think.
I feel like there should be a model that can do much of this for me but I haven't really looked into it, ironically due to laziness, but also because I edit across multiple tracks at this stage, and I'm afraid to feed the model an already mixed stereo track. I'm curious why you still do it manually, if you still do and if you've looked into alternatives.
> I edit as many out as I possibly can for myself, my co-host and guest, as well as eliminating continuation signaling phrases like "you know" and "like". I also remove uninteresting asides and bits of dead air.
Hopefully using Ardour's "Ripple - Interview" mode :))
I use Descript to edit videos/podcasts and it works great for this kind of thing! It transcribes your audio and then you can edit it as if you were editing text.
Yeah, that stuff is just freaking amazing. I don't know what the transcription quality is like, but if I was doing this as a job, and it was good at transcription, I'd definitely be using that all the time.