Two Tower Search Engine

This week's project focuses on building a search engine that processes queries and returns a ranked list of relevant documents. The project leverages the Two Tower architecture and is trained on the Microsoft Machine Reading Comprehension (MS MARCO) dataset.

Overview

The goal is to implement a search engine using a deep learning approach. The architecture will encode queries and documents separately, compare their representations, and rank documents based on relevance.

Dataset

We will use the MS MARCO dataset, a collection of datasets designed for deep learning tasks related to search. You can find more details about the dataset on its official website and download version 1.1 from HuggingFace.

Dataset Preparation Steps:

Download the dataset from MS MARCO.
Split the dataset into training, validation, and test sets.
Extract queries and documents from the dataset.
Generate triples of:
- Queries
- Relevant (positive) documents
- Irrelevant (negative) documents (using negative sampling).
Tokenize the data for input into the model.

Model Architecture

The Two Tower architecture consists of two parallel encoding layers: one for queries and one for documents. The architecture is designed to compare query and document representations effectively.

Key Components:

Token Embedding Layer:
- Pretrained word embeddings (e.g., Word2Vec).
- Optionally fine-tuned or frozen during training.
Encoding Layers:
- Two separate RNN-based layers:
  - One for encoding queries.
  - One for encoding documents (both positive and negative).
- RNNs are used to preserve sequential information, such as word order.
Distance Function:
- Measures similarity between query and document encodings.
- Higher similarity indicates a more relevant document.
Triplet Loss Function:
- Trains the model to minimize the distance between queries and relevant documents.
- Maximizes the distance between queries and irrelevant documents.

Training Process

Input: Tokenized triples of queries, positive documents, and negative documents.
Forward Pass:
- Encode queries and documents using the Two Tower architecture.
- Compute similarity scores using the distance function.
Loss Calculation:
- Use the triplet loss function to optimize the model.
Backpropagation:
- Update model weights to improve performance.

Inference

Pre-cache Document Encodings:
- Encode and store representations of all documents in advance.
Query Encoding:
- Encode the incoming query using the query encoder.
Similarity Scoring:
- Use the distance function to compute similarity between the query and all document encodings.
Ranking:
- Rank documents based on similarity scores.
- Return the top 5 most relevant documents.

Target Model Architecture

Below is a visual representation of the target model architecture:

Getting Started

Prerequisites

Python 3.8+
TensorFlow or PyTorch
HuggingFace Datasets library
Pretrained Word2Vec embeddings

Installation

Clone the repository:

git clone https://github.com/your-repo/ml-institute-week-2-two-towers.git
cd ml-institute-week-2-two-towers

Install dependencies:
```
pip install -r requirements.txt
```

Running the Project

Prepare the dataset:
```
python prepare_dataset.py
```
Train the model:
```
python train.py
```

Run inference:

python inference.py --query "Your search query here"

References

License

This project is licensed under the MIT License. See the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
assets		assets
data		data
notebooks		notebooks
src		src
.gitignore		.gitignore
README.md		README.md
requirements-cpu.txt		requirements-cpu.txt
requirements-gpu.txt		requirements-gpu.txt
setup.py		setup.py
setup.sh		setup.sh
setup.sh.gif		setup.sh.gif

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Two Tower Search Engine

Overview

Dataset

Dataset Preparation Steps:

Model Architecture

Key Components:

Training Process

Inference

Target Model Architecture

Getting Started

Prerequisites

Installation

Running the Project

References

License

About

Uh oh!

Releases

Packages

Languages

andreadellacorte/ml-institute-week-2-two-towers

Folders and files

Latest commit

History

Repository files navigation

Two Tower Search Engine

Overview

Dataset

Dataset Preparation Steps:

Model Architecture

Key Components:

Training Process

Inference

Target Model Architecture

Getting Started

Prerequisites

Installation

Running the Project

References

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages