In the span of a few weeks in March, I built three features that would normally take months of effort by multiple engineers: search, categorization, and personalization. Big problems, shared by many products, all solved easily using a foundational piece of AI technology: embeddings.
The magic of embeddings comes from the fact that they're easy to generate and can be efficiently compared to each other to find those that are similar in meaning, allowing us to use them for semantic search, i.e. search based on meaning, as opposed to traditional search, which is based on the co-occurrence of specific words between the query and the target text. This allows us to find non-obvious but highly relevant matches in cases where traditional search cannot, and allows us to do much more.
So what are embeddings? Embeddings are numerical representations of text that capture the text’s meaning in a way that can be analyzed and compared. The meaning of the text is represented as a list or vector of numbers, where the items or dimensions of the vector represent various latent aspects of the text’s meaning, such as its context, relationships to other words, uniqueness, etc.
Embedding vectors are usually hundreds of dimensions long and can capture a lot of meaning, so an embedding can be "close" to multiple concepts in embedding space, perhaps corresponding to multiple meanings of a word, or to different parts of the input text. As we'll see, this property allows us to use embeddings to generate complex and varied result sets without writing complex code.
While exploring embeddings and prototyping features, I had to deal with two sets of concerns: (1) how to convert the objects I wanted to search into embeddings (courses, topics, users), and (2) how to store and search these embeddings - which libraries and models and databases to use.
Below is a quick rundown of both these concerns. I’ll start with what problems I solved, and then describe some technical details of how I set up the system and did the searches.
Please send back any questions, corrections, and ideas to email@example.com. I'm new to this, and while these embedding based features are working well in production, they're early, unoptimized, and certainly can be improved and extended.
First, some context on Maven and our problems
Maven is a marketplace for cohort-based courses. We've got 500+ courses, around 150,000 users with accounts, and many more users come to our site to find a course for the first time. These courses are time-bound, on specific topics, and cost an average of $800 each. Most people will take 1 or 2 courses in a year, so finding the right course matters quite a lot, and it's our job to connect students with the right course for them.
Finding a course typically happens via a search feature, using traditional search tools like Algolia and Elasticsearch. These tools are relatively quick to set up, but have many deficiencies: they take a lot of effort to fine tune, can cost a lot in the case of Algolia, and take additional effort to use for recommendations and personalization. Most importantly, they miss out on non-obvious but highly relevant results because they don’t understand meaning and instead rely on the exact text content of the query string.
When OpenAI released their newest embeddings models
, I thought to use it to tackle the search and recommendations problem. It worked surprisingly well surprisingly quickly, and I expanded the use cases from there.
How I thought about the various applications of embeddings
Since any one course on Maven may not have an upcoming cohort for a while but a similar course may, my first goal was to find similar courses to recommend to viewers of a particular course, which meant doing the following three steps:
turning each course in the catalog into a string of text that described it, i.e.:
course_to_text(course: Course) -> string
generating and saving embeddings for every one of these strings, i.e.:
get_embedding(text: string) -> List[float]
doing a k nearest neighbor (kNN) search across all embeddings to find the closest embeddings and their associated objects, i.e.:
get_related_courses(query_vector: List[float], k: int) -> List[Course]
To turn a course into a string of text, I concatenated text from various parts of the course object. First the name of the course, then the short description, then the copy written by the instructor describing who the course is for, etc. The more detail that describes the course, the better, up to the input limit of your model.
This worked quite well right off the bat. Not only would this technique find the same results as a traditional search setup, it would also find similar courses where the words did not overlap, but the meaning or intent did.
For example, “Anyone can invest now” is a personal finance and portfolio building course, and this system finds the following to be similar: “Invest in your future” by the same instructor, “How to angel invest 1-10k checks”, “Level up your trading game”, “Why you should invest in startups”, and “How to model venture funds”, some of which are close matches, some of which are further away but still related.
One important insight about semantic search with embeddings is that we’re just comparing strings of text to each other based on meaning. The sentences can represent any kind of object, and the objects do not need to be of the same type. While the first use case compared courses to courses, the next three use cases compare different kinds of objects.
With our traditional search setup, there were many cases where a query could have returned a course or two, but because there was no overlap in words between the query and our catalog, we returned nothing. Embeddings fix this.
Given a query, we generate the embedding for the query, and find the closest course embeddings to the query embedding, using the same `get_related_courses(..,)` function used above. As with “similar courses”, this produced the expected and obvious results, and also generated non-obvious, surprising, insightful, “correct” results.
The query for python finance
previously only returned the “Python and Finance” course, but with embeddings, it also returns “Excel for Finance”, and “Finance for founders”, not direct matches, but good, relevant results
The query for nocode websites
previously only returned “Intro to coding for absolute beginners”, but with embeddings it also returns “Level up with Figma”, “Notion masterclass”, “Frugal MVP”
Similarly, the query for matlab
now returns “Intro to R for UX research”, and the query for stoicism now returns “Aristotle’s Politics” and “Can we survive technology”
If a query embedding can match course embeddings, we can also go the other way, from a course to other related text, such as topics. For this I pulled a list of a few dozen topics that we could categorize our courses into. With the help of ChatGPT, I expanded each topic into a sentence describing the topic (the more words and meaning in the input text, the better), and generated embeddings for each topic description.
Now by using the course embedding as the input and the topic embeddings as the targets, we can automatically suggest topics for a course, and can also find courses by topic. These topics can be traditional topics like “product management” or “personal finance”, but also other more interesting and specific kinds of topics, like “technical topics for non-engineers”.
This is where it got particularly surprising and exciting for me. I wondered how I could improve our weekly course roundup emails; if instead of suggesting the same courses to each person on our email list, I could recommend courses more tailored to them.
This worked the same way as the rest: turn the user into a string of text, get the embedding, and find the closest courses and topics.
A user could be represented as: “My job title is [the job title from their profile, e.g. ‘head of marketing’]. I have taken this course: [course name and description]. I have expressed interest in this course: [course name and description]”.
If their title is “head of marketing” and they’ve taken a course about personal finance, the nearby embeddings will contain a mix of courses about marketing and personal finance!
We used these recommendations to run an email test and got 3x the clickthrough rate as compared to the non-personalized course roundup email sent at the same time.
How I stored and searched embeddings
To get these features to work, I had to solve three technical problems: how to get embeddings, where to store them, and how to search for them.
How to get embeddings
To get embeddings, you’ll need a function or API endpoint that takes text and returns the embedding vector. The two directions I explored were OpenAI
, and a number of models from HuggingFace
was where I started and ended. I found it to be better than the alternatives and fast enough, but has the downsides that (1) it’s not private, i.e. you need to send your text to OpenAI, and (2) it returns embeddings with 1536 dimensions, which was a problem for some databases (more on that below)
Where to store embeddings
While prototyping, embeddings can live and be searched in memory, and you don’t need a database. OpenAI’s embeddings example
uses a pandas dataframe
and Python’s pickle
library for persistence. This was good enough for my prototype, but isn’t ideal once you’re sure of a feature because it’s duplicative and less flexible.
Once I was ready for a persistent and shared datastore, I explored a few options before settling on OpenSearch:
We use Postgres as our primary database, and it has the new pg_vector
extension which enables vector search, but this isn’t supported on AWS RDS, so it wasn’t an option for us though I did request the extension. For a new project, Supabase’s Postgres
might be a good option, and they support `pg_vector`.
seems to be the leading vector search service, but their pricing model isn’t great if you want to separate data by type or environment, though you can make that work using metadata.
is a simple new open-source embeddings database which defaults to in-memory storage, but has persistence options.
Redis can be a simple store, either with the embedding as the entire value, or as a value in a hash along with other metadata, or their newer vector search
functions. Overall this works, but is more work than necessary, and not ideal for this use case.
Elasticsearch supports vector similarity search
over a `vector
` field type, but doesn’t index vectors with over 1024 dimensions, an unfortunate limit since OpenAI’s embeddings have 1536 dimensions.
OpenSearch is Amazon’s fork of Elasticsearch v7.10, and supports vector search
with up to 10,000 dimensions, and is offered as a managed service on AWS. I chose this route since it also gives us the rest of Elasticsearch, which I thought might be useful for other kinds of search and filtering applications, and is well documented.
How to search embeddings
Once you’ve got a number of embeddings stored in a database, and a new query embedding you’re trying to find matches for, you would typically do a k nearest neighbor search. Each database listed above has a function to do this.
have a simply named
operator into the SQL
function, and Elasticsearch
have similar ways of doing faster approximate as well as slower exact kNN searches within their
My understanding is elementary here, but here are a few notes and thoughts on kNN searches:
To find similar embeddings, we need some way to tell which embeddings are “close” to other embeddings in their dense multi-dimensional space, i.e. we need a distance function
between embeddings. There are a few of these
, and cosine similarity
seems to be a popular choice.
Calculating distance between a large number of embeddings is a demanding operation, and there are a few different algorithms for doing fast distance calculations for nearest neighbor searches. HNSW
is one of the more popular algorithms for indexing vectors for fast approximate comparisons. Faiss
by Facebook AI Research is a popular implementation of HNSW. It’s available in Opensearch and in other databases and libraries.
For my use case, I wanted to find the nearest neighbors but also (1) optionally filter the results for courses that have an upcoming cohort and (2) sort the results based on rating. To do this, I added metadata to the course object, like average rating, next cohort date, etc.
kNN function likely returns a score for the match, and likely returns bad results too, since the closest thing might be far away. I played around with my results until I found a minimum score cutoff. I’d like to understand this better.
What can you do when computers can understand what things mean?
All the experimentation leading to the features above took about a week, and getting the features into production took another couple of weeks of effort. The tens of thousands of embeddings API calls I made to get to this point cost me just about $10 in total.
It’s incredible to have such a multi-purpose tool, easy to use and cheap to try, and I can imagine a number of other ways to use it at Maven and beyond:
Instead of going from a user to courses, we can also go the other way. When a new course launches, we can find the set of users who are most likely to enjoy it, and email them about it
Within a large cohort, we can find students who are similar to other students, and introduce them to each other
Within a course, we can create embeddings for each piece of content, and use that to search content and answer questions (by passing matching documents to the ChatGPT API as context)
A friend recently did this last idea, and can now ask ChatGPT questions where the answers come from his years of journals, saved articles, and favorite books, powered by embeddings and vector search.
Semantic search quickly went from being a long running dream to a reality, and I'm excited to see the big and small things that we're all able to do with it.
As mentioned above, please send back any questions, corrections, and ideas to firstname.lastname@example.org. I'm new to this, and while these embedding based features are working well in production, they're early, unoptimized, and certainly can be improved and extended.