Selectors
The selectors
scraper gives you fine-grained control over content extraction using CSS selectors.
A valid RSS item requires at least a
title
or adescription
.
Basic Configuration
At a minimum, you need an items
selector to define the list of articles and a title
selector for the article titles.
channel:
url: "https://example.com"
selectors:
items:
selector: ".article"
title:
selector: "h1"
Automatic Item Enhancement
To simplify configuration, html2rss
can automatically extract the title
, url
, and image
from each item. This feature is enabled by default.
selectors:
items:
selector: ".article"
enhance: true # default: true
RSS 2.0 Selectors
While you can define any named selector, only the following are used in the final RSS feed:
RSS 2.0 Tag |
html2rss Name |
---|---|
title |
title |
description |
description |
link |
url |
author |
author |
category |
categories |
guid |
guid |
enclosure |
enclosure |
pubDate |
published_at |
comments |
comments |
Selector Options
Each selector can be configured with the following options:
Name | Description |
---|---|
selector |
The CSS selector for the target element. |
extractor |
The extractor to use for this selector. |
post_process |
A list of post-processors to apply to the value. |
Extractors
Extractors define how to get the value from a selected element.
-
text
: The inner text of the element (default). -
html
: The outer HTML of the element. -
href
: The value of thehref
attribute. -
attribute
: The value of a specified attribute. -
static
: A static value.
Post-Processors
Post-processors manipulate the extracted value.
-
gsub
: Performs a global substitution on a string. -
html_to_markdown
: Converts HTML to Markdown. -
markdown_to_html
: Converts Markdown to HTML. -
parse_time
: Parses a string into aTime
object. -
parse_uri
: Parses a string into aURI
object. -
sanitize_html
: Sanitizes HTML to prevent security vulnerabilities. -
substring
: Extracts a substring from a string. -
template
: Creates a new string from a template and other selector values.
Always use the
sanitize_html
post-processor for any HTML content to prevent security risks.
Advanced Usage
Categories
To add categories to an item, provide a list of selector names to the categories
selector.
selectors:
genre:
selector: ".genre"
branch:
selector: ".branch"
categories:
- genre
- branch
Custom GUID
To create a custom GUID for an item, provide a list of selector names to the guid
selector.
selectors:
title:
selector: "h1"
url:
selector: "a"
extractor: "href"
guid:
- url
Enclosures
To add an enclosure (e.g., an image, audio, or video file) to an item, use the enclosure
selector to specify the URL of the file.
selectors:
enclosure:
selector: "audio"
extractor: "attribute"
attribute: "src"
content_type: "audio/mp3"