Under the Hood: Collections of Similar Articles

Preface

We are collecting various development related content, including the news and other hot discussion topics. Sometimes it happens, that a single event causes multiple reactions from various sources. New shiny release, big data leak, controversy, etc. In such cases we get multiple content on the same topic.

Product decided, that it would be nice to focus users attention and reactions to a single place. Hence, collections were introduced - a list of closely related articles made into a single post.

Idea of collections is quite simple, but implementing it required to solve challenging technical problems.

Similar posts search

Determining content similarity isn't a straightforward task. One challenge lies in defining "similarity" itself. Do we solely focus on matching words, or should we delve deeper into the meaning and intent behind the content? Additionally, the vast amount of data can make comparisons computationally expensive while also introducing issues like data bias and skewed distributions, potentially hindering the accuracy of similarity detection.

These complexities gave rise to various solutions, trying to solve the problem for various contexts and use cases. Luckily, recent rise of LLMs provided relatively straightforward and easy to use technology.

Embeddings and vector similarity

In the captivating world of machine learning and data representation, vector embeddings take center stage. They act as a numerical translation of objects or concepts, condensing them into multidimensional spaces. Imagine them as compact packets of information, holding the essence and key features of the original data.

Since embeddings are just multidimensional vectors, we can perform a wide range of mathematical operations on them, including distance calculation. That essentially reduces the aforementioned similarity search problem to some basic number crunching - we calculate the distance between embedding vectors representing two pieces of content, and if distance is below some threshold, it means content is similar.

Technologies used

To generate embeddings themselves we use OpenAI's embeddings model via api. Of course they can be generated via a number of other ways, including even self-hosted open source LLMs.

Embedding vectors can be stored using any conventional way. Performing the calculations on them is the main challenge, since naive unoptimized approach will become unusable at any real world data amounts. For example we perform similarity search for past two days articles. It means doing more than 1000 comparisons for a single similarity check, and burns down to performing over 3,000,000 multiplication and addition operations. We did not perform the benchmarking, so can not present you exact performance difference numbers.

There are many solutions for dealing with vectors, but we preferred to use something we were familiar with. So the choice narrowed down to Postgres pgvector extension and Redis. At the moment of collections development we were still in the process of experimenting with pgvector, so ended up choosing Redis.
So, Redis vector index and search is used for collections, but we use pgvector in some of our other projects, which were started later.

Collection creation, update and publish

Collection creation process is relatively straightforward and has the following logic:

When new post is registered in the system, its embedding vector is generated and stored.
Similarity search between new record and previously saved embeddings is started. It is a computationally intensive task, so we use not a whole content database, but only recent entries, up to a few days old.
If a similar collection is found, post is added to it. Collection entry is updated with a new added post.
If only similar posts are found, new collection is created. Included entries are marked as belonging to collection.
New/updated collection is then processed, including generation of its summary and embedding.
Collection content is published for display to users.

As we see, in some of the steps above, multiple related content entries are to be updated at the same time. If several parallel processes need to update same entries, we have to handle that gracefully, to avoid conflicts and data inconsistency.

Challenges of distributed content processing

One of main challenges in collections feature came from the distributed nature of content processing system. When a new piece of content is added into the system, it is processed in multiple steps by different smaller services. Also multiple content pieces can be processed in parallel at the same time.

Prior to collections introduction, there were no issues with that, because each content piece processing was isolated and independent of others. But with collections there was a need to update content records at the moment of adding to collection and updating a collection with new content. This caused inconsistencies and data overwrites in some edge cases. Following are some cases we had to solve.

Parallel collection creation

One of the relatively frequent issues we ran into, was multiple collections based on the same content. It happened as following: Let's say three similar content pieces A, B and C are added for processing almost at the same time.

A is processed without similarities found.
B finds similarity with A and ends up creating a collection AB.
C is processed before collection AB is done processing, and creates another collection ABC.

It is undesirable, because we want to concentrate all similar posts in the same collection, not to spread them across multiple ones.

Solution was relatively simple - when creating a collection record, in the same transaction all related entries are marked as a part of it. When another entry finds a similarity with an entry which is a part of collection, it is added to the same collection instead of creating a new one.

Content entry metadata overwrite

Another issue was related to a potential data overwrite by concurrent entry processing. It has happened because when processing starts, current entry record data is read and goes through cleanup and enrichment steps. Then entry record is updated. If something is written by a parallel process between the read and update, that data could be lost. Here is an example situation for that to happen:

There is a collection AB.
Entries C and D start processing
Entry C finds similarity with a collection and is added to it.
Collection AB(C) starts update process.
Entry D also finds a similarity with a collection and updates collection record. Collection completes the update and overwrites the addition of D.

We used SELECT FOR UPDATE lock with some additional data merge logic to fix that issue. Before writing, entry is read again and locked. If its update timestamp is a newer one, some metadata fields are updated before writing down. Arrays are merged. Latter is especially important for collections, which can be updated at any time with additional entries added.

Diagram illustrating some of aforementioned techniques in action

Results and conclusions

Engineering wise, collections were a nice challenge. It allowed us to get hands on with a new technology and opened more possibilities for its use in the future. Also, it helped us find some weak spots in content processing architecture and to patch them up.

In general, embeddings are a hot topic in it technology right now along with all the rest AI hype wave and open many exciting possibilities. We can use them to offer more matching content to users, use RAG in generating search answers, perform content clustering by various criteria and a lot more.

Under the Hood: Collections of Similar Articles

Preface