Skip to content

Conversation

@yousef-rafat
Copy link
Contributor

@yousef-rafat yousef-rafat commented Sep 27, 2025

Also includes support for SynchFormer and Clap encoders, as well as three new video nodes.
Screenshot 2025-11-20 180422

@yousef-rafat yousef-rafat marked this pull request as draft September 28, 2025 16:37
@Kosinkadink Kosinkadink added the Core Core team dependency label Sep 30, 2025
@yousef-rafat yousef-rafat marked this pull request as ready for review October 13, 2025 18:11
Copy link
Collaborator

@Kosinkadink Kosinkadink left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay on the review!

Comfy and I had trouble getting the model to run, so it would be great if you could provide the checkpoints/models links + input video to be able to replicate your results.

The main feedback here is:

  1. There are some odd nodes that were added - the Resample Video node returns an improper Video output, and Video To Image node is a duplicate of what 'Get Video Components node does - provides images, audio, and fps.
  2. The Encode Video node should probably be split in two - it is unclear if only one of the optional inputs should be filled at a time to then be plugged into HunyuanFoleyConditioning. It might be better for the encode video stuff to all go into one node, but I did notice on the workflow image that two different resamples happen before being plugged into Encode Video. Is this some oddity from the source code?
  3. According to comfy, you only need tokenizer.json to get what you need; the repo the files are pulled from had both original .json/.txt files and their merged versions. You can reference comfy/sd1_tokenizer to see what you'd actually need here.
  4. See if there can be more code reuse between ClipTextModel and ClapTextModel stuff, if possible.

frames.append(arr)
next_time += step

return io.NodeOutput(torch.stack(frames))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The output is written as io.Video, but here it is not using any Video wrapper around the torch frames, so it breaks compatibility with video.

std[std < 1.0] = 1.0
audio /= std
return ({"waveform": audio, "sample_rate": 44100}, )
sample_rate = vae.first_stage_model.decode_sample_rate or 44100
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comfy says that there should be a different way of doing this; he'd prefer that this be turned into a function on the VAE first_stage_model object instead of how it is handled here for maintainability.

embeds = torch.cat([siglip_encoding_1.cpu(), synchformer_encoding_2.cpu()], dim = 0)

x = siglip_encoding_1
positive_tensor = CpuLockedTensor(torch.cat([torch.zeros_like(embeds), text_encoding_negative])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use of a custom tensor class should not be necessary. What was the reason this was done? If there is some issue, it should be handled in a different way.

@yousef-rafat
Copy link
Contributor Author

@Kosinkadink The EncodeVideo node should have either the VAE or a clip vision as an input, but not both. The two different resamples is from the source code, having different frame rates for both the synchformer and the siglip encoders.

I have tried with tokenizer.json only, but AutoTokenizer fell back to a slow tokenizer format, which requires protobuf, giving an error in the process.

The CLAP encoder has parts from Align, Swine, and Bert. I reused the Bert classes from the comfy implementation.

Attached the input and the workflow I used below:

_-.1.mp4

new_foly_workflow.json

@Kosinkadink
Copy link
Collaborator

If the node needs only one but not both inputs, then currently the best move would be to make two separate nodes

@jovan2009
Copy link

I am interested also in this implementation. I just searched yesterday for my needs for video 2 audio models and I found this Hunyuan-Foley model and I used it with this custom node: https://github.com/phazei/ComfyUI-HunyuanVideo-Foley.git. It works, but it downloads from huggingface some other files (beside the model's safetensors) in a temporary cache folder that later I deleted by mistake and now I have to download them again. If this PR avoids that would be nice.

@Kosinkadink
Copy link
Collaborator

Kosinkadink commented Nov 19, 2025

@yousef-rafat Could you also link all the models that are needed to run this? The first time I went to review this it took over an hour trying to hunt down the models, so that would make things smooth especially if I move between different computers.

Having a link -> comfy folder to put it in would be great!

@jovan2009
Copy link

@Kosinkadink The EncodeVideo node should have either the VAE or a clip vision as an input, but not both. The two different resamples is from the source code, having different frame rates for both the synchformer and the siglip encoders.

I have tried with tokenizer.json only, but AutoTokenizer fell back to a slow tokenizer format, which requires protobuf, giving an error in the process.

The CLAP encoder has parts from Align, Swine, and Bert. I reused the Bert classes from the comfy implementation.

Attached the input and the workflow I used below:

_-.1.mp4
new_foly_workflow.json

@yousef-rafat I played the video but it seems it has no audio? For a video 2 audio model the audio part is the most interesting.

@yousef-rafat
Copy link
Contributor Author

@Kosinkadink This is the code for getting the checkpoints. One for the model and one for the siglip:

!pip install huggingface_hub==0.34.0
from huggingface_hub import hf_hub_download

REPO_ID = "yousefg/comfyui-hunyuan-foley"
FILENAME = "merged_dac_model.pth"

try:
    file_path = hf_hub_download(
        repo_id=REPO_ID,
        filename=FILENAME,
        repo_type="model",
        local_dir=".",
    )
    print("✅ Downloaded successfully to:", file_path)
except Exception as e:
    print("❌ Download failed:", e)

import torch

from transformers import AutoModel
siglip_sd = AutoModel.from_pretrained("google/siglip2-base-patch16-512").vision_model.state_dict()
new_sd = dict()
for k, v in siglip_sd.items():
    new_sd[f"vision_model.{k}"] = v
torch.save(new_sd, "siglip_model.pth")`

@yousef-rafat
Copy link
Contributor Author

@jovan2009 The model is made to generate the audio given a video and a text input

@jovan2009
Copy link

jovan2009 commented Nov 19, 2025

@jovan2009 The model is made to generate the audio given a video and a text input

I know what it does, I am already using it with the custom node I mentioned above. I don't like that custom node because dowloads files outside ComfyUI models folder. Files that my file cleaning application deleted as being considered "temporary". Looking at your previous comment I come to the conclusion that your implementation does the same, sort of, as much I can figure out, I am not a software developer.

Edit:
Not to mention that I already have the google sigclip model inside ComfyUI folder. What I don't have is enough space on disk to download it again and again for every model that uses it.

@yousef-rafat
Copy link
Contributor Author

@jovan2009 This is a native implementation, so there's no automatic download of files like maybe in custom nodes. You download the main model checkpoint and add in /checkpoints and the siglip model /clip_vision. No automatic or repeated downloads should happen

@yousef-rafat
Copy link
Contributor Author

@Kosinkadink I have split the nodes into Encode Video VAE and Encode Video CLIP

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Core Core team dependency

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants