This week's project focuses on building a search engine that processes queries and returns a ranked list of relevant documents. The project leverages the Two Tower architecture and is trained on the Microsoft Machine Reading Comprehension (MS MARCO) dataset.
The goal is to implement a search engine using a deep learning approach. The architecture will encode queries and documents separately, compare their representations, and rank documents based on relevance.
We will use the MS MARCO dataset, a collection of datasets designed for deep learning tasks related to search. You can find more details about the dataset on its official website and download version 1.1 from HuggingFace.
- Download the dataset from MS MARCO.
- Split the dataset into training, validation, and test sets.
- Extract queries and documents from the dataset.
- Generate triples of:
- Queries
- Relevant (positive) documents
- Irrelevant (negative) documents (using negative sampling).
- Tokenize the data for input into the model.
The Two Tower architecture consists of two parallel encoding layers: one for queries and one for documents. The architecture is designed to compare query and document representations effectively.
-
Token Embedding Layer:
- Pretrained word embeddings (e.g., Word2Vec).
- Optionally fine-tuned or frozen during training.
-
Encoding Layers:
- Two separate RNN-based layers:
- One for encoding queries.
- One for encoding documents (both positive and negative).
- RNNs are used to preserve sequential information, such as word order.
- Two separate RNN-based layers:
-
Distance Function:
- Measures similarity between query and document encodings.
- Higher similarity indicates a more relevant document.
-
Triplet Loss Function:
- Trains the model to minimize the distance between queries and relevant documents.
- Maximizes the distance between queries and irrelevant documents.
- Input: Tokenized triples of queries, positive documents, and negative documents.
- Forward Pass:
- Encode queries and documents using the Two Tower architecture.
- Compute similarity scores using the distance function.
- Loss Calculation:
- Use the triplet loss function to optimize the model.
- Backpropagation:
- Update model weights to improve performance.
- Pre-cache Document Encodings:
- Encode and store representations of all documents in advance.
- Query Encoding:
- Encode the incoming query using the query encoder.
- Similarity Scoring:
- Use the distance function to compute similarity between the query and all document encodings.
- Ranking:
- Rank documents based on similarity scores.
- Return the top 5 most relevant documents.
Below is a visual representation of the target model architecture:
- Python 3.8+
- TensorFlow or PyTorch
- HuggingFace Datasets library
- Pretrained Word2Vec embeddings
- Clone the repository:
git clone https://github.com/your-repo/ml-institute-week-2-two-towers.git cd ml-institute-week-2-two-towers - Install dependencies:
pip install -r requirements.txt
- Prepare the dataset:
python prepare_dataset.py
- Train the model:
python train.py
- Run inference:
python inference.py --query "Your search query here"
This project is licensed under the MIT License. See the LICENSE file for details.
