Skip to content

Selectors

The selectors scraper gives you fine-grained control over content extraction using CSS selectors.

A valid RSS item requires at least a title or a description.

At a minimum, you need an items selector to define the list of articles and a title selector for the article titles.

channel:
url: "https://example.com"
selectors:
items:
selector: ".article"
title:
selector: "h1"

To simplify configuration, html2rss can automatically extract the title, url, and image from each item. This feature is enabled by default.

selectors:
items:
selector: ".article"
enhance: true # default: true

You can control the order of items in your feed:

selectors:
items:
selector: ".article"
order: "reverse" # Reverse the order of items (newest first)

Available options:

  • "reverse": Reverses the order of items (useful when the website shows oldest items first)
  • Default: Items appear in the order they are found on the page

While you can define any named selector, only the following are used in the final RSS feed:

RSS 2.0 Taghtml2rss NameNotes
titletitle
descriptiondescription
linkurl
authorauthor
categorycategories
guidguid
enclosureenclosure
pubDatepublished_at
commentscomments⚠️ Not currently implemented

Each selector can be configured with the following options:

NameDescription
selectorThe CSS selector for the target element.
extractorThe extractor to use for this selector.
attributeThe attribute name (required for attribute extractor).
staticThe static value (required for static extractor).
post_processA list of post-processors to apply to the value.

Extractors define how to get the value from a selected element.

  • text: The inner text of the element (default).
  • html: The outer HTML of the element.
  • href: The value of the href attribute.
  • attribute: The value of a specified attribute.
  • static: A static value.

Post-processors manipulate the extracted value.

  • gsub: Performs a global substitution on a string.
  • html_to_markdown: Converts HTML to Markdown.
  • markdown_to_html: Converts Markdown to HTML.
  • parse_time: Parses a string into a Time object.
  • parse_uri: Parses a string into a URI object.
  • sanitize_html: Sanitizes HTML to prevent security vulnerabilities.
  • substring: Extracts a substring from a string.
  • template: Creates a new string from a template and other selector values.

Always use the sanitize_html post-processor for any HTML content to prevent security risks.

To add categories to an item, provide a list of selector names to the categories selector.

selectors:
genre:
selector: ".genre"
branch:
selector: ".branch"
categories:
- genre
- branch

To create a custom GUID for an item, provide a list of selector names to the guid selector.

selectors:
title:
selector: "h1"
url:
selector: "a"
extractor: "href"
guid:
- url

To add an enclosure (e.g., an image, audio, or video file) to an item, use the enclosure selector to specify the URL of the file.

selectors:
items:
selector: ".post"
title:
selector: "h2"
enclosure:
selector: "audio"
extractor: "attribute"
attribute: "src"
content_type: "audio/mp3"

For detailed documentation on the Ruby API, see the official YARD documentation.