-
Notifications
You must be signed in to change notification settings - Fork 10.7k
Add support for Hunyuan-Foley #10052
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Add support for Hunyuan-Foley #10052
Conversation
12824ea to
fee1e57
Compare
…t/ComfyUI into yousef-hunyuan-foley
…t/ComfyUI into yousef-hunyuan-foley
the trimming fn needs an update because of the over-trimming
Kosinkadink
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay on the review!
Comfy and I had trouble getting the model to run, so it would be great if you could provide the checkpoints/models links + input video to be able to replicate your results.
The main feedback here is:
- There are some odd nodes that were added - the Resample Video node returns an improper Video output, and Video To Image node is a duplicate of what 'Get Video Components node does - provides images, audio, and fps.
- The Encode Video node should probably be split in two - it is unclear if only one of the optional inputs should be filled at a time to then be plugged into HunyuanFoleyConditioning. It might be better for the encode video stuff to all go into one node, but I did notice on the workflow image that two different resamples happen before being plugged into Encode Video. Is this some oddity from the source code?
- According to comfy, you only need tokenizer.json to get what you need; the repo the files are pulled from had both original .json/.txt files and their merged versions. You can reference comfy/sd1_tokenizer to see what you'd actually need here.
- See if there can be more code reuse between ClipTextModel and ClapTextModel stuff, if possible.
comfy_extras/nodes_video.py
Outdated
| frames.append(arr) | ||
| next_time += step | ||
|
|
||
| return io.NodeOutput(torch.stack(frames)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The output is written as io.Video, but here it is not using any Video wrapper around the torch frames, so it breaks compatibility with video.
comfy_extras/nodes_audio.py
Outdated
| std[std < 1.0] = 1.0 | ||
| audio /= std | ||
| return ({"waveform": audio, "sample_rate": 44100}, ) | ||
| sample_rate = vae.first_stage_model.decode_sample_rate or 44100 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comfy says that there should be a different way of doing this; he'd prefer that this be turned into a function on the VAE first_stage_model object instead of how it is handled here for maintainability.
comfy_extras/nodes_hunyuan_foley.py
Outdated
| embeds = torch.cat([siglip_encoding_1.cpu(), synchformer_encoding_2.cpu()], dim = 0) | ||
|
|
||
| x = siglip_encoding_1 | ||
| positive_tensor = CpuLockedTensor(torch.cat([torch.zeros_like(embeds), text_encoding_negative]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The use of a custom tensor class should not be necessary. What was the reason this was done? If there is some issue, it should be handled in a different way.
|
@Kosinkadink The EncodeVideo node should have either the VAE or a clip vision as an input, but not both. The two different resamples is from the source code, having different frame rates for both the synchformer and the siglip encoders. I have tried with tokenizer.json only, but AutoTokenizer fell back to a slow tokenizer format, which requires protobuf, giving an error in the process. The CLAP encoder has parts from Align, Swine, and Bert. I reused the Bert classes from the comfy implementation. Attached the input and the workflow I used below: _-.1.mp4 |
…t/ComfyUI into yousef-hunyuan-foley
|
If the node needs only one but not both inputs, then currently the best move would be to make two separate nodes |
|
I am interested also in this implementation. I just searched yesterday for my needs for video 2 audio models and I found this Hunyuan-Foley model and I used it with this custom node: https://github.com/phazei/ComfyUI-HunyuanVideo-Foley.git. It works, but it downloads from huggingface some other files (beside the model's safetensors) in a temporary cache folder that later I deleted by mistake and now I have to download them again. If this PR avoids that would be nice. |
|
@yousef-rafat Could you also link all the models that are needed to run this? The first time I went to review this it took over an hour trying to hunt down the models, so that would make things smooth especially if I move between different computers. Having a link -> comfy folder to put it in would be great! |
@yousef-rafat I played the video but it seems it has no audio? For a video 2 audio model the audio part is the most interesting. |
|
@Kosinkadink This is the code for getting the checkpoints. One for the model and one for the siglip: !pip install huggingface_hub==0.34.0
from huggingface_hub import hf_hub_download
REPO_ID = "yousefg/comfyui-hunyuan-foley"
FILENAME = "merged_dac_model.pth"
try:
file_path = hf_hub_download(
repo_id=REPO_ID,
filename=FILENAME,
repo_type="model",
local_dir=".",
)
print("✅ Downloaded successfully to:", file_path)
except Exception as e:
print("❌ Download failed:", e)
import torch
from transformers import AutoModel
siglip_sd = AutoModel.from_pretrained("google/siglip2-base-patch16-512").vision_model.state_dict()
new_sd = dict()
for k, v in siglip_sd.items():
new_sd[f"vision_model.{k}"] = v
torch.save(new_sd, "siglip_model.pth")` |
|
@jovan2009 The model is made to generate the audio given a video and a text input |
I know what it does, I am already using it with the custom node I mentioned above. I don't like that custom node because dowloads files outside ComfyUI models folder. Files that my file cleaning application deleted as being considered "temporary". Looking at your previous comment I come to the conclusion that your implementation does the same, sort of, as much I can figure out, I am not a software developer. Edit: |
|
@jovan2009 This is a native implementation, so there's no automatic download of files like maybe in custom nodes. You download the main model checkpoint and add in /checkpoints and the siglip model /clip_vision. No automatic or repeated downloads should happen |
|
@Kosinkadink I have split the nodes into Encode Video VAE and Encode Video CLIP |
Also includes support for SynchFormer and Clap encoders, as well as three new video nodes.
