Auto Source

The auto_source scraper automatically finds items on a page, so you don’t have to specify CSS selectors.

To enable it, add auto_source: {} to your configuration:

channel:
  url: https://example.com
auto_source: {}

How It Works

auto_source uses the following strategies to find content:

schema: Parses <script type="json/ld"> tags containing structured data (e.g., Schema.org).
semantic_html: Searches for semantic HTML5 tags like <article>, <main>, and <section>.
html: Analyzes the HTML structure to find frequently occurring selectors that are likely to contain the main content.
json_state: Single-page applications often stash pre-rendered article data in <script type="application/json"> tags or global variables such as window.__NEXT_DATA__, window.__NUXT__, or window.STATE. The JSON-state scraper walks those blobs, finds arrays with title/url pairs, and converts them into the same hashes produced by HtmlExtractor.

json_state Limitations: the scraper requires discoverable arrays of hashes containing clear title and url fields. Minified or obfuscated state objects, heavily encoded values, or blobs that require executing embedded functions are ignored.

Fine-Tuning

You can customize auto_source to improve its accuracy.

Scraper Options

Enable or disable specific scrapers and adjust their settings:

auto_source:
  scraper:
    schema:
      enabled: false # default: true
    semantic_html:
      enabled: false # default: true
    json_state:
      enabled: false # default: true
    html:
      enabled: true
      minimum_selector_frequency: 3 # default: 2
      use_top_selectors: 3 # default: 5

Cleanup Options

Remove unwanted items from the results:

auto_source:
  cleanup:
    keep_different_domain: false # default: true
    min_words_title: 4 # default: 3

For detailed documentation on the Ruby API, see the official YARD documentation.