Skip to content

Auto Source

The auto_source scraper automatically finds items on a page, so you don’t have to specify CSS selectors.

To enable it, add auto_source: {} to your configuration:

channel:
url: https://example.com
auto_source: {}

auto_source uses the following strategies to find content:

  1. schema: Parses <script type="json/ld"> tags containing structured data (e.g., Schema.org).
  2. semantic_html: Searches for semantic HTML5 tags like <article>, <main>, and <section>.
  3. html: Analyzes the HTML structure to find frequently occurring selectors that are likely to contain the main content.
  4. json_state: Single-page applications often stash pre-rendered article data in <script type="application/json"> tags or global variables such as window.__NEXT_DATA__, window.__NUXT__, or window.STATE. The JSON-state scraper walks those blobs, finds arrays with title/url pairs, and converts them into the same hashes produced by HtmlExtractor.

json_state Limitations: the scraper requires discoverable arrays of hashes containing clear title and url fields. Minified or obfuscated state objects, heavily encoded values, or blobs that require executing embedded functions are ignored.

You can customize auto_source to improve its accuracy.

Enable or disable specific scrapers and adjust their settings:

auto_source:
scraper:
schema:
enabled: false # default: true
semantic_html:
enabled: false # default: true
json_state:
enabled: false # default: true
html:
enabled: true
minimum_selector_frequency: 3 # default: 2
use_top_selectors: 3 # default: 5

Remove unwanted items from the results:

auto_source:
cleanup:
keep_different_domain: false # default: true
min_words_title: 4 # default: 3

For detailed documentation on the Ruby API, see the official YARD documentation.