Skip to content

Selectors

The selectors scraper gives you fine-grained control over content extraction using CSS selectors.

A valid RSS item requires at least a title or a description.

At a minimum, you need an items selector to define the list of articles and a title selector for the article titles.

channel:
url: "https://example.com"
selectors:
items:
selector: ".article"
title:
selector: "h1"

To simplify configuration, html2rss can automatically extract the title, url, and image from each item. This feature is enabled by default.

selectors:
items:
selector: ".article"
enhance: true # default: true

You can control the order of items in your feed:

selectors:
items:
selector: ".article"
order: "reverse" # Reverse the order of items (newest first)

Available options:

  • "reverse": Reverses the order of items (useful when the website shows oldest items first)
  • Default: Items appear in the order they are found on the page

html2rss can follow a single rel="next" pagination chain when you configure selectors.items.pagination.max_pages.

channel:
url: "https://example.com/news"
selectors:
items:
selector: "article"
pagination:
max_pages: 3
title:
selector: "h1"
url:
selector: "a"
extractor: "href"

Behavior:

  • max_pages is the total page budget for the item selector chain, including the initial page.
  • max_pages is capped by the system request ceiling of 10 pages per feed build.
  • Pagination follows strict link[rel~="next"] or a[rel~="next"] targets only.
  • Follow-up pages use the current page's effective origin after redirects.
  • Pagination stops when there is no next link, a page repeats, or the shared request budget is exhausted.
  • The same request safeguards apply to pagination and Browserless navigation, including timeout limits, redirect limits, response-size guards, and private-network denial.

While you can define any named selector, only the following are used in the final RSS feed:

| RSS 2.0 Tag | html2rss Name | Notes | | ------------- | --------------- | ------------------------------ | | title | title | | | description | description | | | link | url | | | author | author | | | category | categories | | | guid | guid | | | enclosure | enclosure | | | pubDate | published_at | | | comments | comments | ⚠️ Not currently implemented |

Each selector can be configured with the following options:

| Name | Description | | -------------- | -------------------------------------------------------- | | selector | The CSS selector for the target element. | | extractor | The extractor to use for this selector. | | attribute | The attribute name (required for attribute extractor). | | static | The static value (required for static extractor). | | post_process | A list of post-processors to apply to the value. |

Extractors define how to get the value from a selected element.

  • text: The inner text of the element (default).
  • html: The outer HTML of the element.
  • href: The value of the href attribute.
  • attribute: The value of a specified attribute.
  • static: A static value.

Post-processors manipulate the extracted value.

  • gsub: Performs a global substitution on a string.
  • html_to_markdown: Converts HTML to Markdown.
  • markdown_to_html: Converts Markdown to HTML.
  • parse_time: Parses a string into a Time object.
  • parse_uri: Resolves a relative URL against channel.url and returns the normalized URL string.
  • sanitize_html: Sanitizes HTML to prevent security vulnerabilities.
  • substring: Extracts a substring from a string.
  • template: Creates a new string from a template and other selector values. Use %{self} for the current selector value.

Always use the sanitize_html post-processor for any HTML content to prevent security risks.

To add categories to an item, provide a list of selector names to the categories selector.

selectors:
genre:
selector: ".genre"
branch:
selector: ".branch"
categories:
- genre
- branch

To create a custom GUID for an item, provide a list of selector names to the guid selector.

selectors:
title:
selector: "h1"
url:
selector: "a"
extractor: "href"
guid:
- url

To add an enclosure (e.g., an image, audio, or video file) to an item, use the enclosure selector to specify the URL of the file.

selectors:
items:
selector: ".post"
title:
selector: "h2"
enclosure:
selector: "audio"
extractor: "attribute"
attribute: "src"
content_type: "audio/mp3"

For detailed documentation on the Ruby API, see the official YARD documentation.