Skip to content

Strategy

The strategy key defines how html2rss fetches a website’s content.

  • auto (default): Tries concrete strategies in order: faraday -> botasaurus -> browserless.
  • faraday: Makes a direct HTTP request. It is fast but does not execute JavaScript.
  • browserless: Renders the website in a headless Chrome browser, which is necessary for JavaScript-heavy sites.
  • botasaurus: Delegates fetching to a Botasaurus scrape API. This is opt-in and requires BOTASAURUS_SCRAPER_URL.

strategy is a top-level config key. Request-specific controls live under request.

auto falls back to the next strategy when the current attempt errors or extracts zero items. Use explicit --strategy ... only when you need to force a specific transport for troubleshooting or reproducibility.

The default strategy chain is:

faraday -> botasaurus -> browserless

To use the browserless strategy, you need a running instance of Browserless.io.

You can run a local Browserless.io instance using Docker:

Terminal window
docker run --rm -p 3000:3000 -e "CONCURRENT=10" -e "TOKEN=6R0W53R135510" ghcr.io/browserless/chromium

Set the strategy at the top level of your feed configuration and put request controls under request:

strategy: browserless
request:
max_redirects: 5
max_requests: 6
channel:
url: "https://example.com/app"
selectors:
items:
selector: ".article"
title:
selector: "h2"
url:
selector: "a"
extractor: "href"

Use this split consistently:

  • strategy: selects auto, faraday, browserless, or botasaurus
  • headers: top-level headers shared by all strategies
  • request.max_redirects: redirect limit for the request session
  • request.max_requests: total request budget for the whole feed build
  • request.browserless.*: Browserless-only options
  • request.botasaurus.*: Botasaurus-only options

Example:

strategy: browserless
headers:
User-Agent: "Mozilla/5.0 (compatible; html2rss/1.0)"
request:
max_redirects: 5
max_requests: 6
browserless:
preload:
wait_after_ms: 5000
channel:
url: "https://example.com/app"
selectors:
items:
selector: ".article"
title:
selector: "h2"
url:
selector: "a"
extractor: "href"

Browserless can interact with the page before html2rss captures the final HTML. Configure preload steps under request.browserless.preload.

strategy: browserless
request:
browserless:
preload:
wait_after_ms: 5000
click_selectors:
- selector: ".load-more"
max_clicks: 3
wait_after_ms: 250
scroll_down:
iterations: 5
wait_after_ms: 200
  • wait_after_ms: inserts a fixed wait before or after preload steps
  • click_selectors: clicks matching elements until they disappear or max_clicks is reached
  • scroll_down: scrolls until the page height stops growing or iterations is reached

If preload triggers a real navigation or redirect, html2rss keeps the final document metadata. Relative links and follow-up pagination therefore resolve against the page that was actually rendered after preload completed.

You can also specify the strategy on the command line:

Terminal window
# Set environment variables for your Browserless.io instance
BROWSERLESS_IO_WEBSOCKET_URL="ws://127.0.0.1:3000" BROWSERLESS_IO_API_TOKEN="6R0W53R135510" html2rss feed my_config.yml --strategy browserless ; html2rss feed my_config.yml --max-redirects 5 --max-requests 6 ; html2rss feed my_config.yml

If Browserless cannot connect, html2rss surfaces a Browserless connection failed (...) error with endpoint/token hints.

Check these first:

  • BROWSERLESS_IO_WEBSOCKET_URL is reachable from where html2rss runs
  • BROWSERLESS_IO_API_TOKEN matches your Browserless TOKEN
  • your Browserless service is running and accepting connections

For custom Browserless websocket endpoints, BROWSERLESS_IO_API_TOKEN is mandatory. The local default endpoint (ws://127.0.0.1:3000) can use the default local token 6R0W53R135510.

botasaurus delegates page fetching to a Botasaurus scrape API endpoint. This strategy is explicit opt-in and requires:

  • strategy: botasaurus
  • BOTASAURUS_SCRAPER_URL set to your Botasaurus scrape API base URL (for example http://localhost:4010)
strategy: botasaurus
request:
max_redirects: 5
max_requests: 6
botasaurus:
navigation_mode: auto
max_retries: 2
headless: false
channel:
url: "https://example.com/protected-listing"
auto_source: {}

Supported request.botasaurus options:

  • navigation_mode (auto, get, google_get, google_get_bypass)
  • max_retries (0..3)
  • wait_for_selector
  • wait_timeout_seconds
  • block_images
  • block_images_and_css
  • wait_for_complete_page_load
  • headless
  • proxy
  • user_agent
  • window_size (two integers, for example [1920, 1080])
  • lang
Terminal window
BOTASAURUS_SCRAPER_URL="http://localhost:4010" html2rss auto https://example.com/updates --strategy botasaurus ; html2rss feed my_config.yml --strategy botasaurus

For detailed documentation on the Ruby API, see the official YARD documentation.