Selectors

Name: html2rss
Author: html2rss

The selectors scraper gives you fine-grained control over content extraction using CSS selectors.

A valid RSS item requires at least a title or a description.

Basic Configuration

At a minimum, you need an items selector to define the list of articles and a title selector for the article titles.

channel:
  url: "https://example.com"
selectors:
  items:
    selector: ".article"
  title:
    selector: "h1"

Automatic Item Enhancement

To simplify configuration, html2rss can automatically extract the title, url, and image from each item. This feature is enabled by default.

selectors:
  items:
    selector: ".article"
    enhance: true # default: true

Item Ordering

You can control the order of items in your feed:

selectors:
  items:
    selector: ".article"
    order: "reverse" # Reverse the order of items (newest first)

Available options:

"reverse": Reverses the order of items (useful when the website shows oldest items first)
Default: Items appear in the order they are found on the page

RSS 2.0 Selectors

While you can define any named selector, only the following are used in the final RSS feed:

RSS 2.0 Tag	`html2rss` Name	Notes
`title`	`title`
`description`	`description`
`link`	`url`
`author`	`author`
`category`	`categories`
`guid`	`guid`
`enclosure`	`enclosure`
`pubDate`	`published_at`
`comments`	`comments`	⚠️ Not currently implemented

Selector Options

Each selector can be configured with the following options:

Name	Description
`selector`	The CSS selector for the target element.
`extractor`	The extractor to use for this selector.
`attribute`	The attribute name (required for `attribute` extractor).
`static`	The static value (required for `static` extractor).
`post_process`	A list of post-processors to apply to the value.

Extractors

Extractors define how to get the value from a selected element.

text: The inner text of the element (default).
html: The outer HTML of the element.
href: The value of the href attribute.
attribute: The value of a specified attribute.
static: A static value.

Post-Processors

Post-processors manipulate the extracted value.

gsub: Performs a global substitution on a string.
html_to_markdown: Converts HTML to Markdown.
markdown_to_html: Converts Markdown to HTML.
parse_time: Parses a string into a Time object.
parse_uri: Parses a string into a URI object.
sanitize_html: Sanitizes HTML to prevent security vulnerabilities.
substring: Extracts a substring from a string.
template: Creates a new string from a template and other selector values.

Always use the sanitize_html post-processor for any HTML content to prevent security risks.

Advanced Usage

Custom GUID

To create a custom GUID for an item, provide a list of selector names to the guid selector.

selectors:
  title:
    selector: "h1"
  url:
    selector: "a"
    extractor: "href"
  guid:
    - url

Enclosures

To add an enclosure (e.g., an image, audio, or video file) to an item, use the enclosure selector to specify the URL of the file.

selectors:
  items:
    selector: ".post"
  title:
    selector: "h2"
  enclosure:
    selector: "audio"
    extractor: "attribute"
    attribute: "src"
    content_type: "audio/mp3"

For detailed documentation on the Ruby API, see the official YARD documentation.