#conf-2024-workshop-ai-recommendation-engine
git clone https://github.com/jcoyne/code4lib-2024-ai-workshop.git
Using a pretrained model to generate embedding vectors and perform a similarity search
We will select a model that can transform a sentence or short snippet of text into a vector that represents the semantics of the text. This vector is called an embedding. We can use these embeddings to compare semantic meanings.
Sentence Transformers based on the paper https://arxiv.org/abs/1908.10084
The Github of AI things. 🤗
https://huggingface.co/https://huggingface.co/blog/mteb
Breaks down sentence into pieces called tokens. Assigns each token a unique id number.
https://platform.openai.com/tokenizerhttps://github.com/pgvector/pgvector Pgvector is an extension for postgres to support vector types and some similarity algorithms.
CREATE EXTENSION vector;
CREATE TABLE courses(
id SERIAL PRIMARY KEY,
title varchar(200),
description text,
embedding vector(384));
https://medium.com/advanced-deep-learning/understanding-vector-similarity-b9c10f7506de
Problem: Comparing vectors is expensive, especially as the number of vectors grows.
Solution: Indexing using Approximate Nearest Neighbor search. However, this will produce different results for queries.
Problem: Comparing vectors is expensive, especially as the dimensionality of vectors grows.
Solution: Opt for a smaller embedding dimension.
Problem: Embedding models have a limit on the number of tokens
Solution: Multiple embeddings per item.