Semantic Search Revolutionizes R Package Discovery in CRAN/E

Hybrid full-text metadata and semantic search in a redesigned user interface.

news

CRAN/E 2.7 introduces a groundbreaking semantic search capability, transforming how users discover R packages and authors. This update marks a significant leap forward in search precision and relevance, moving beyond simple keyword matching to understanding the underlying meaning of search queries.

It's now possible to search by description of what you want, helping users find packages without knowing their name or author.

The revamped search interface now offers both lexical and semantic search options. Lexical search provides the familiar keyword-based results, while the new semantic search delves deeper, identifying packages and authors based on conceptual similarity, even if they don't share exact keywords. This allows users to discover relevant packages they might have otherwise missed.

Screenshot of the CRAN/E 2.0 start page
Screenshot of the new search view with lexical and semantic search options
Screenshot of the CRAN/E 2.0 start page
Close-up of new references in updatd UI

Embeddings

To power this semantic search, CRAN/E has generated approximately 290,000 embeddings for around 22,000 R packages. These embeddings, created by Gemini's text-embedding-004 model, represent the semantic meaning of package information like titles, descriptions, synopses (created by Gemini Flash 1.5), linked HTML content, PDFs, and Markdown files. This comprehensive approach ensures a rich understanding of each package's purpose and functionality.

All embeddings are efficiently stored and queried within a self-hosted Postgres database using the pgvector extension, allowing for fast and scalable semantic search operations.

Enhanced Search Features

Beyond semantic search, CRAN/E 2.7 also introduces full-text search for package metadata, providing another powerful tool for discovery. The updated search interface clearly indicates the source of each search result, linking directly to the relevant PDF, HTML page, or other reference. This transparency allows users to quickly understand the context of each match and assess its relevance.

Screenshot of the CRAN/E 2.0 start page
Screenshot of search results showing linked references
Screenshot of the CRAN/E 2.0 start page
Screenshot of search results showing author and packages