Selectors
The selectors scraper gives you fine-grained control over content extraction using CSS selectors.
A valid RSS item requires at least a
titleor adescription.
Basic Configuration
Section titled “Basic Configuration”At a minimum, you need an items selector to define the list of articles and a title selector for the article titles.
channel: url: "https://example.com"selectors: items: selector: ".article" title: selector: "h1"Automatic Item Enhancement
Section titled “Automatic Item Enhancement”To simplify configuration, html2rss can automatically extract the title, url, and image from each item. This feature is enabled by default.
selectors: items: selector: ".article" enhance: true # default: trueItem Ordering
Section titled “Item Ordering”You can control the order of items in your feed:
selectors: items: selector: ".article" order: "reverse" # Reverse the order of items (newest first)Available options:
"reverse": Reverses the order of items (useful when the website shows oldest items first)- Default: Items appear in the order they are found on the page
RSS 2.0 Selectors
Section titled “RSS 2.0 Selectors”While you can define any named selector, only the following are used in the final RSS feed:
| RSS 2.0 Tag | html2rss Name | Notes |
|---|---|---|
title | title | |
description | description | |
link | url | |
author | author | |
category | categories | |
guid | guid | |
enclosure | enclosure | |
pubDate | published_at | |
comments | comments | ⚠️ Not currently implemented |
Selector Options
Section titled “Selector Options”Each selector can be configured with the following options:
| Name | Description |
|---|---|
selector | The CSS selector for the target element. |
extractor | The extractor to use for this selector. |
attribute | The attribute name (required for attribute extractor). |
static | The static value (required for static extractor). |
post_process | A list of post-processors to apply to the value. |
Extractors
Section titled “Extractors”Extractors define how to get the value from a selected element.
text: The inner text of the element (default).html: The outer HTML of the element.href: The value of thehrefattribute.attribute: The value of a specified attribute.static: A static value.
Post-Processors
Section titled “Post-Processors”Post-processors manipulate the extracted value.
gsub: Performs a global substitution on a string.html_to_markdown: Converts HTML to Markdown.markdown_to_html: Converts Markdown to HTML.parse_time: Parses a string into aTimeobject.parse_uri: Parses a string into aURIobject.sanitize_html: Sanitizes HTML to prevent security vulnerabilities.substring: Extracts a substring from a string.template: Creates a new string from a template and other selector values.
Always use the
sanitize_htmlpost-processor for any HTML content to prevent security risks.
Advanced Usage
Section titled “Advanced Usage”Categories
Section titled “Categories”To add categories to an item, provide a list of selector names to the categories selector.
selectors: genre: selector: ".genre" branch: selector: ".branch" categories: - genre - branchCustom GUID
Section titled “Custom GUID”To create a custom GUID for an item, provide a list of selector names to the guid selector.
selectors: title: selector: "h1" url: selector: "a" extractor: "href" guid: - urlEnclosures
Section titled “Enclosures”To add an enclosure (e.g., an image, audio, or video file) to an item, use the enclosure selector to specify the URL of the file.
selectors: items: selector: ".post" title: selector: "h2" enclosure: selector: "audio" extractor: "attribute" attribute: "src" content_type: "audio/mp3"For detailed documentation on the Ruby API, see the official YARD documentation.