SSARE: Semantic Search Article Recommendation Engine

Open Source Political Intelligence needs a news brain!

On a mission to find the #NeedleInTheHayStack

SSARE stands for Semantic Search Article Recommendation Engine:

An open-source service that comfortably orchestrates:

  • Scraping of arbitrary sourcing scripts
  • Processing into vector representations
  • Named Entity Recognition (like locations, persons, organisation, geo.-pol. entities)
  • Geocoding of recognized locations
  • Storing and querying of news articles

Delivering:

  • An up-to-date Vector Search news retrieval endpoint for RAG/LLM applications
  • An up-to-date news SQL database for various applications
  • A resource to track entities over arbitrary sources with simple sorting scripts (like affiliations, organisations)
  • GeoJSON for article locations or related entities on a map

Spin up your own news brain!

Table of Contents

Semantic Search is a process that retrieves related articles by contextual similarity of situations described in natural language. A retrieval technique which opens up new paradigms of information ranking and retrieval. It is a popular choice to enhance Large Language Models with a "memory" of relevant articles. This project hopes to combine the amenities of a classical SQL database with a vector database and deliver a useful, scalable and collectively engineered data stream that is unprecedented.

SSARE

Introduction

SSARE serves as an efficient and scalable resource for semantic search and article recommendations, catering primarily to political news data.

The engine is adaptable to any article/document, requiring only a sourcing script that outputs the data in the format of a dataframe with the columns:

url | headline | paragraphs | source -- This is all your script needs to produce

Once integrated, SSARE processes these articles using embeddings models of your choice (upcoming, currently hardcoded), stores their vector representations in a Qdrant vector database, and maintains a full copy in a PostgreSQL database.

Furthermore, all articles' text undergoes Named Entity Recognition (NER) where entities such as geo-political entities, affiliations, persons or organisation names are extracted.

The GPE (Geopolitical Entity) tags are then geocoded, e.g., for the recognised location "Berlin" it returns the latitude and longitude and passes a GeoJSON file.

THE FINAL RESULT is a live PostgreSQL database with articles saved and this data schema (as pydantic model):