Support text/image multi-modal embeddings (start with OpenClip)

**Describe the feature**
- Support OpenClip embedding to unify text and image vector embedding space, and support cross modality search

**Motivation and use case**
- Multi-modal search

**Additional context**
- to be filled