Selectors
The selectors
scraper gives you fine-grained control over content extraction using CSS selectors.
A valid RSS item requires at least a
title
or adescription
.
Basic Configuration
Section titled “Basic Configuration”At a minimum, you need an items
selector to define the list of articles and a title
selector for the article titles.
channel: url: "https://example.com"selectors: items: selector: ".article" title: selector: "h1"
Automatic Item Enhancement
Section titled “Automatic Item Enhancement”To simplify configuration, html2rss
can automatically extract the title
, url
, and image
from each item. This feature is enabled by default.
selectors: items: selector: ".article" enhance: true # default: true
Item Ordering
Section titled “Item Ordering”You can control the order of items in your feed:
selectors: items: selector: ".article" order: "reverse" # Reverse the order of items (newest first)
Available options:
"reverse"
: Reverses the order of items (useful when the website shows oldest items first)- Default: Items appear in the order they are found on the page
RSS 2.0 Selectors
Section titled “RSS 2.0 Selectors”While you can define any named selector, only the following are used in the final RSS feed:
RSS 2.0 Tag | html2rss Name | Notes |
---|---|---|
title | title | |
description | description | |
link | url | |
author | author | |
category | categories | |
guid | guid | |
enclosure | enclosure | |
pubDate | published_at | |
comments | comments | ⚠️ Not currently implemented |
Selector Options
Section titled “Selector Options”Each selector can be configured with the following options:
Name | Description |
---|---|
selector | The CSS selector for the target element. |
extractor | The extractor to use for this selector. |
attribute | The attribute name (required for attribute extractor). |
static | The static value (required for static extractor). |
post_process | A list of post-processors to apply to the value. |
Extractors
Section titled “Extractors”Extractors define how to get the value from a selected element.
text
: The inner text of the element (default).html
: The outer HTML of the element.href
: The value of thehref
attribute.attribute
: The value of a specified attribute.static
: A static value.
Post-Processors
Section titled “Post-Processors”Post-processors manipulate the extracted value.
gsub
: Performs a global substitution on a string.html_to_markdown
: Converts HTML to Markdown.markdown_to_html
: Converts Markdown to HTML.parse_time
: Parses a string into aTime
object.parse_uri
: Parses a string into aURI
object.sanitize_html
: Sanitizes HTML to prevent security vulnerabilities.substring
: Extracts a substring from a string.template
: Creates a new string from a template and other selector values.
Always use the
sanitize_html
post-processor for any HTML content to prevent security risks.
Advanced Usage
Section titled “Advanced Usage”Categories
Section titled “Categories”To add categories to an item, provide a list of selector names to the categories
selector.
selectors: genre: selector: ".genre" branch: selector: ".branch" categories: - genre - branch
Custom GUID
Section titled “Custom GUID”To create a custom GUID for an item, provide a list of selector names to the guid
selector.
selectors: title: selector: "h1" url: selector: "a" extractor: "href" guid: - url
Enclosures
Section titled “Enclosures”To add an enclosure (e.g., an image, audio, or video file) to an item, use the enclosure
selector to specify the URL of the file.
selectors: items: selector: ".post" title: selector: "h2" enclosure: selector: "audio" extractor: "attribute" attribute: "src" content_type: "audio/mp3"
For detailed documentation on the Ruby API, see the official YARD documentation.