one another scenario is that if you open the domain from browser, they will do 301 redirect, but for traffic coming from Google/search engine, they will show their actual content.
I would check out react three fiber if you want to see how people are building things like this. It essentially brings a component model to three js development and creates good standards for sharable code since things are just react hooks.
Rapier was brand new when I was making things in R3F 2 years ago. Glad to see how mature it’s gotten!
For such models, is it possible to fine-tune models with multiple images of the main actor?
Sorry, if this question sounds dumb, but I am comparing it with regular image models, where the more images you have, the better output images you generate for the model.
It is possible to fine-tune the model with videos of a specific actor, but not images. You need videos to train the model.
We actually did this in early overfitting experiments (to confirm our code worked!), and it worked surprisingly well. This is exciting to us, because it means we can have actor-specific models that learn the idiosyncratic gestures of particular person.
No, not related. We just took some of Loopy's demo images + audios since they came out 2 days ago and people were aware of them. We want to do an explicit side-by-side at some point, but in the meantime people can make their own comparisons, i.e. compare how the two models perform on the same inputs.
Loopy is a Unet-based diffusion model, ours is a diffusion transformer. This is our own custom foundation model we've trained.
This took me a minute - your output demos are your own, but you included some of their inputs, to make for an easy comparison? Definitely thought you copied their outputs at first and was baffled.
Exactly. Most talking avatar papers re-use each others images + audios in their demo clips. It's just a thing everyone does... we never thought that people would think it means we didn't train our own model!
For whoever wants to, folks can re-make all the videos themselves with our model by extracting the 1st frame and audio.
Yes, exactly! We just wanted to make it easy to compare. We also used some inputs from other famous research papers for comparison (EMO and VASA). But all videos we show on our website/blog are our own. We don't host videos from any other model on our website.
Also, Loopy is not available yet (they just published the research paper). But you can try our model today, and see if it lives up to the examples : )
Examples are very impressive, here's hoping we get an implementation of it on huggingface soon so we can try it out, and even potentially self-host it later.
I know these guys in real life, they've been working on this for months and, unlike the ByteDance paper, have actually shipped something you can try yourself.
Our transformer model was trained to generate videos that are up to 8s in length. However, we can make videos that are longer by using it an an autoregressive manner, and taking the last N frames of output i to seed output (i+1). It is important to use more than just 1 frame. Otherwise ,the direction of movement can suddenly change, which looks very uncanny. Admittedly, the autoregressive approach tends to accumulate errors with each generation.
It is also possible to fine-tine the model so that single generations (one forward pass of the model) are longer than 8s, and we plan to do this. In practice, it just means our batch sizes have to be smaller when training.
Right now, we've limited the public tool to only allow videos up to 30s in length, if that is what you were asking.
Video compression algorithms use key frames. So can’t you do the same thing? Essentially, generate five seconds. Then pull out the last frame. Use some other AI model to enhance it (upscale, consistency with the original character, etc.). Then use that as the input for the next five seconds?
This is a good idea. We have discussed incorporating an additional "identity" signal to the conditioning, but simply enforcing consistency with the original character as a post-processing step would be a lot easier to try. Are there any tools you know of that do that?