Selectors
The selectors scraper gives you fine-grained control over content extraction using CSS selectors.
A valid RSS item requires at least a
titleor adescription.
Basic Configuration
Section titled “Basic Configuration”At a minimum, you need an items selector to define the list of articles and a title selector for the article titles.
channel: url: "https://example.com"selectors: items: selector: ".article" title: selector: "h1"Automatic Item Enhancement
Section titled “Automatic Item Enhancement”To simplify configuration, html2rss can automatically extract the title, url, and image from each item. This feature is enabled by default.
selectors: items: selector: ".article" enhance: true # default: trueItem Ordering
Section titled “Item Ordering”You can control the order of items in your feed:
selectors: items: selector: ".article" order: "reverse" # Reverse the order of items (newest first)Available options:
"reverse": Reverses the order of items (useful when the website shows oldest items first)- Default: Items appear in the order they are found on the page
Paginated Feeds
Section titled “Paginated Feeds”html2rss can follow a single rel="next" pagination chain when you configure selectors.items.pagination.max_pages.
channel: url: "https://example.com/news"selectors: items: selector: "article" pagination: max_pages: 3 title: selector: "h1" url: selector: "a" extractor: "href"Behavior:
max_pagesis the total page budget for the item selector chain, including the initial page.max_pagesis capped by the system request ceiling of 10 pages per feed build.- Pagination follows strict
link[rel~="next"]ora[rel~="next"]targets only. - Follow-up pages use the current page’s effective origin after redirects.
- Pagination stops when there is no next link, a page repeats, or the shared request budget is exhausted.
- The same request safeguards apply to pagination and Browserless navigation, including timeout limits, redirect limits, response-size guards, and private-network denial.
RSS 2.0 Selectors
Section titled “RSS 2.0 Selectors”While you can define any named selector, only the following are used in the final RSS feed:
| RSS 2.0 Tag | html2rss Name | Notes |
|---|---|---|
title | title | |
description | description | |
link | url | |
author | author | |
category | categories | |
guid | guid | |
enclosure | enclosure | |
pubDate | published_at | |
comments | comments | ⚠️ Not currently implemented |
Selector Options
Section titled “Selector Options”Each selector can be configured with the following options:
| Name | Description |
|---|---|
selector | The CSS selector for the target element. |
extractor | The extractor to use for this selector. |
attribute | The attribute name (required for attribute extractor). |
static | The static value (required for static extractor). |
post_process | A list of post-processors to apply to the value. |
Extractors
Section titled “Extractors”Extractors define how to get the value from a selected element.
text: The inner text of the element (default).html: The outer HTML of the element.href: The value of thehrefattribute.attribute: The value of a specified attribute.static: A static value.
Post-Processors
Section titled “Post-Processors”Post-processors manipulate the extracted value.
gsub: Performs a global substitution on a string.html_to_markdown: Converts HTML to Markdown.markdown_to_html: Converts Markdown to HTML.parse_time: Parses a string into aTimeobject.parse_uri: Resolves a relative URL againstchannel.urland returns the normalized URL string.sanitize_html: Sanitizes HTML to prevent security vulnerabilities.substring: Extracts a substring from a string.template: Creates a new string from a template and other selector values. Use%{self}for the current selector value.
Always use the
sanitize_htmlpost-processor for any HTML content to prevent security risks.
Advanced Usage
Section titled “Advanced Usage”Categories
Section titled “Categories”To add categories to an item, provide a list of selector names to the categories selector.
selectors: genre: selector: ".genre" branch: selector: ".branch" categories: - genre - branchCustom GUID
Section titled “Custom GUID”To create a custom GUID for an item, provide a list of selector names to the guid selector.
selectors: title: selector: "h1" url: selector: "a" extractor: "href" guid: - urlEnclosures
Section titled “Enclosures”To add an enclosure (e.g., an image, audio, or video file) to an item, use the enclosure selector to specify the URL of the file.
selectors: items: selector: ".post" title: selector: "h2" enclosure: selector: "audio" extractor: "attribute" attribute: "src" content_type: "audio/mp3"For detailed documentation on the Ruby API, see the official YARD documentation.