Add support for Hunyuan-Foley #10052

yousef-rafat · 2025-09-27T10:20:34Z

Also includes support for SynchFormer and Clap encoders, as well as three new video nodes.

into yousef-higgsv2

…t/ComfyUI into yousef-hunyuan-foley

the trimming fn needs an update because of the over-trimming

Kosinkadink

Sorry for the delay on the review!

Comfy and I had trouble getting the model to run, so it would be great if you could provide the checkpoints/models links + input video to be able to replicate your results.

The main feedback here is:

There are some odd nodes that were added - the Resample Video node returns an improper Video output, and Video To Image node is a duplicate of what 'Get Video Components node does - provides images, audio, and fps.
The Encode Video node should probably be split in two - it is unclear if only one of the optional inputs should be filled at a time to then be plugged into HunyuanFoleyConditioning. It might be better for the encode video stuff to all go into one node, but I did notice on the workflow image that two different resamples happen before being plugged into Encode Video. Is this some oddity from the source code?
According to comfy, you only need tokenizer.json to get what you need; the repo the files are pulled from had both original .json/.txt files and their merged versions. You can reference comfy/sd1_tokenizer to see what you'd actually need here.
See if there can be more code reuse between ClipTextModel and ClapTextModel stuff, if possible.

Kosinkadink · 2025-10-24T01:34:16Z

comfy_extras/nodes_video.py

+                        frames.append(arr)
+                        next_time += step
+
+            return io.NodeOutput(torch.stack(frames))


The output is written as io.Video, but here it is not using any Video wrapper around the torch frames, so it breaks compatibility with video.

Kosinkadink · 2025-10-24T01:42:59Z

comfy_extras/nodes_audio.py

        std[std < 1.0] = 1.0
        audio /= std
-        return ({"waveform": audio, "sample_rate": 44100}, )
+        sample_rate = vae.first_stage_model.decode_sample_rate or 44100


Comfy says that there should be a different way of doing this; he'd prefer that this be turned into a function on the VAE first_stage_model object instead of how it is handled here for maintainability.

Kosinkadink · 2025-10-24T01:44:32Z

comfy_extras/nodes_hunyuan_foley.py

+        embeds = torch.cat([siglip_encoding_1.cpu(), synchformer_encoding_2.cpu()], dim = 0)
+
+        x = siglip_encoding_1
+        positive_tensor = CpuLockedTensor(torch.cat([torch.zeros_like(embeds), text_encoding_negative])


The use of a custom tensor class should not be necessary. What was the reason this was done? If there is some issue, it should be handled in a different way.

yousef-rafat · 2025-11-19T21:14:35Z

@Kosinkadink The EncodeVideo node should have either the VAE or a clip vision as an input, but not both. The two different resamples is from the source code, having different frame rates for both the synchformer and the siglip encoders.

I have tried with tokenizer.json only, but AutoTokenizer fell back to a slow tokenizer format, which requires protobuf, giving an error in the process.

The CLAP encoder has parts from Align, Swine, and Bert. I reused the Bert classes from the comfy implementation.

Attached the input and the workflow I used below:

_-.1.mp4

new_foly_workflow.json

…t/ComfyUI into yousef-hunyuan-foley

Kosinkadink · 2025-11-19T21:26:19Z

If the node needs only one but not both inputs, then currently the best move would be to make two separate nodes

jovan2009 · 2025-11-19T21:29:28Z

I am interested also in this implementation. I just searched yesterday for my needs for video 2 audio models and I found this Hunyuan-Foley model and I used it with this custom node: https://github.com/phazei/ComfyUI-HunyuanVideo-Foley.git. It works, but it downloads from huggingface some other files (beside the model's safetensors) in a temporary cache folder that later I deleted by mistake and now I have to download them again. If this PR avoids that would be nice.

Kosinkadink · 2025-11-19T21:32:58Z

@yousef-rafat Could you also link all the models that are needed to run this? The first time I went to review this it took over an hour trying to hunt down the models, so that would make things smooth especially if I move between different computers.

Having a link -> comfy folder to put it in would be great!

jovan2009 · 2025-11-19T21:41:50Z

@Kosinkadink The EncodeVideo node should have either the VAE or a clip vision as an input, but not both. The two different resamples is from the source code, having different frame rates for both the synchformer and the siglip encoders.

I have tried with tokenizer.json only, but AutoTokenizer fell back to a slow tokenizer format, which requires protobuf, giving an error in the process.

The CLAP encoder has parts from Align, Swine, and Bert. I reused the Bert classes from the comfy implementation.

Attached the input and the workflow I used below:

_-.1.mp4
new_foly_workflow.json

@yousef-rafat I played the video but it seems it has no audio? For a video 2 audio model the audio part is the most interesting.

yousef-rafat · 2025-11-19T21:43:57Z

@Kosinkadink This is the code for getting the checkpoints. One for the model and one for the siglip:

!pip install huggingface_hub==0.34.0
from huggingface_hub import hf_hub_download

REPO_ID = "yousefg/comfyui-hunyuan-foley"
FILENAME = "merged_dac_model.pth"

try:
    file_path = hf_hub_download(
        repo_id=REPO_ID,
        filename=FILENAME,
        repo_type="model",
        local_dir=".",
    )
    print("✅ Downloaded successfully to:", file_path)
except Exception as e:
    print("❌ Download failed:", e)

import torch

from transformers import AutoModel
siglip_sd = AutoModel.from_pretrained("google/siglip2-base-patch16-512").vision_model.state_dict()
new_sd = dict()
for k, v in siglip_sd.items():
    new_sd[f"vision_model.{k}"] = v
torch.save(new_sd, "siglip_model.pth")`

yousef-rafat · 2025-11-19T21:44:35Z

@jovan2009 The model is made to generate the audio given a video and a text input

jovan2009 · 2025-11-19T21:51:55Z

@jovan2009 The model is made to generate the audio given a video and a text input

I know what it does, I am already using it with the custom node I mentioned above. I don't like that custom node because dowloads files outside ComfyUI models folder. Files that my file cleaning application deleted as being considered "temporary". Looking at your previous comment I come to the conclusion that your implementation does the same, sort of, as much I can figure out, I am not a software developer.

Edit:
Not to mention that I already have the google sigclip model inside ComfyUI folder. What I don't have is enough space on disk to download it again and again for every model that uses it.

yousef-rafat · 2025-11-20T16:00:04Z

@jovan2009 This is a native implementation, so there's no automatic download of files like maybe in custom nodes. You download the main model checkpoint and add in /checkpoints and the siglip model /clip_vision. No automatic or repeated downloads should happen

yousef-rafat · 2025-11-20T16:01:27Z

@Kosinkadink I have split the nodes into Encode Video VAE and Encode Video CLIP

yousef-rafat added 12 commits September 5, 2025 23:47

init

254622d

Merge branch 'master' into yousef-higgsv2

1cff9b8

removed test files

df4b6a2

Merge branch 'yousef-higgsv2' of https://github.com/yousef-rafat/ComfyUI

6e9335d

into yousef-higgsv2

styling fixes

57c15f9

additional styling

f8d4891

.

233e441

bug fixes + added some features

6412422

Merge branch 'master' into yousef-higgsv2

5191fb2

final

2ac8999

Merge branch 'yousef-higgsv2' of https://github.com/yousef-rafat/ComfyUI

fee1e57

into yousef-higgsv2

init

12824ea

yousef-rafat requested a review from Kosinkadink as a code owner September 27, 2025 10:20

yousef-rafat force-pushed the yousef-hunyuan-foley branch from 12824ea to fee1e57 Compare September 27, 2025 10:28

yousef-rafat added 11 commits September 27, 2025 13:58

Delete comfy/autoregressive_sampling.py

a480271

...

786c386

Merge branch 'yousef-hunyuan-foley' of https://github.com/yousef-rafa…

1d24e63

…t/ComfyUI into yousef-hunyuan-foley

.

c951e8f

.

73cdb32

Merge branch 'master' into yousef-hunyuan-foley

3773d0d

fixed a small bug

aaa3bcc

Merge branch 'yousef-hunyuan-foley' of https://github.com/yousef-rafa…

f85e1cf

…t/ComfyUI into yousef-hunyuan-foley

allowed returning frames

8311b15

added clap tokenizer

2ceb9f0

fixed clap location

a6dabd2

yousef-rafat marked this pull request as draft September 28, 2025 16:37

yousef-rafat added 3 commits September 29, 2025 22:44

fixed multiple errors in nodes and model loading

42a265c

removed additional code in video_types

ab01ace

some fixes in model loading and nodes

cc3a138

Kosinkadink added the Core Core team dependency label Sep 30, 2025

yousef-rafat added 11 commits October 1, 2025 23:44

clip vision base support + small fixes

4241f10

work on the conditioning

663d971

large optimizations and some fixes

4b6c081

syncformer fix + some fixes

95d2aae

fixed the syncform logic + condition-related fixes

220c65d

the trimming fn needs an update because of the over-trimming

fixes to make the model work

4c782e3

a lot of fixes + siglip2_base support

e684ff2

small bug fixes

89fc51f

bug fixes for siglip2 to work

4908e74

final changes

4653b90

ruff check

25f7bbe

yousef-rafat marked this pull request as ready for review October 13, 2025 18:11

Kosinkadink reviewed Oct 24, 2025

View reviewed changes

yousef-rafat added 2 commits November 19, 2025 23:01

updated based on feedback

86348ba

Merge branch 'master' into yousef-hunyuan-foley

8b9b956

yousef-rafat added 2 commits November 19, 2025 23:19

ruff check

b012a33

Merge branch 'yousef-hunyuan-foley' of https://github.com/yousef-rafa…

9a732a0

…t/ComfyUI into yousef-hunyuan-foley

yousef-rafat added 2 commits November 20, 2025 16:28

split encodevideo into two nodes

07cd971

forgot about ruff

291b091

Add support for Hunyuan-Foley #10052

Are you sure you want to change the base?

Add support for Hunyuan-Foley #10052

Uh oh!

Conversation

yousef-rafat commented Sep 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Kosinkadink left a comment

Choose a reason for hiding this comment

Uh oh!

Kosinkadink Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

Kosinkadink Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

Kosinkadink Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

yousef-rafat commented Nov 19, 2025

Uh oh!

Kosinkadink commented Nov 19, 2025

Uh oh!

jovan2009 commented Nov 19, 2025

Uh oh!

Kosinkadink commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jovan2009 commented Nov 19, 2025

Uh oh!

yousef-rafat commented Nov 19, 2025

Uh oh!

yousef-rafat commented Nov 19, 2025

Uh oh!

jovan2009 commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yousef-rafat commented Nov 20, 2025

Uh oh!

yousef-rafat commented Nov 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yousef-rafat commented Sep 27, 2025 •

edited

Loading

Kosinkadink commented Nov 19, 2025 •

edited

Loading

jovan2009 commented Nov 19, 2025 •

edited

Loading