Build a recommendation engine using AI

Welcome

Is there anyone without a computer?

Skill check?

Python?

SQL Query?

Maths?

Pairing?

#conf-2024-workshop-ai-recommendation-engine

Dependencies

git clone https://github.com/jcoyne/code4lib-2024-ai-workshop.git

Justin Coyne

Stanford University Library

Who are you?

Have you used AI tools like ChatGPT or Github Copilot before?

What was good? What didn't work?

What is AI?

Make computers learn, reason, and act like the human brain?

The problem?

  • Human brains are actually not that much like computers as far as we can tell
  • Humans have biases and so will the models they train
  • Computers cannot be held accountable for mistakes
  • Sometimes they hallucinate
  • Not everyone wants their content used to train AIs

What are we going to do today?

Using a pretrained model to generate embedding vectors and perform a similarity search

Whazzat?

We will select a model that can transform a sentence or short snippet of text into a vector that represents the semantics of the text. This vector is called an embedding. We can use these embeddings to compare semantic meanings.

A 2 dimensional plot that illustrates how concepts have a vector representation

The model

Sentence Transformers based on the paper https://arxiv.org/abs/1908.10084

Hugging face

The Github of AI things. 🤗

https://huggingface.co/

Massive Text Embedding Benchmark (MTEB)

https://huggingface.co/blog/mteb

  • Embedding dimensions - How big is the resulting vector?
  • Max Tokens - How much input can it accept?

Examine a model

sentence-transformers/all-MiniLM-L6-v2

Tokenizer

Breaks down sentence into pieces called tokens. Assigns each token a unique id number.

https://platform.openai.com/tokenizer

Using Pgvector

https://github.com/pgvector/pgvector Pgvector is an extension for postgres to support vector types and some similarity algorithms.

Using Pgvector


						CREATE EXTENSION vector;

						CREATE TABLE courses(
							id          SERIAL PRIMARY KEY,
							title       varchar(200),
							description text,
							embedding   vector(384));
					

Produce an embedding

https://github.com/jcoyne/code4lib-2024-ai-workshop/tree/explore-embeddings

Compairing vectors

https://medium.com/advanced-deep-learning/understanding-vector-similarity-b9c10f7506de

Problem: Comparing vectors is expensive, especially as the number of vectors grows.

Solution: Indexing using Approximate Nearest Neighbor search. However, this will produce different results for queries.

Problem: Comparing vectors is expensive, especially as the dimensionality of vectors grows.

Solution: Opt for a smaller embedding dimension.

Problem: Embedding models have a limit on the number of tokens

Solution: Multiple embeddings per item.

Next Steps

  • Make a CLI or webapp for this
  • Put the embeddings in a search engine like Solr
  • Use another model from the Hugging Face APIs

Questions?