This repository was archived by the owner on Jul 9, 2025. It is now read-only.

Description
Hello,
I have a problem where I need to match entities with mixed-data (text, numerical, images) and multiple image inputs for each entity.
I was wondering if it is possible to use a custom architecture for creating the representation, so that I can use a multi-input multimodal architecture.
Thank you