Skip to content

Auto Source

The auto_source scraper automatically finds items on a page, so you don’t have to specify CSS selectors.

To enable it, add auto_source: {} to your configuration:

channel:
url: https://example.com
auto_source: {}

auto_source uses the following strategies to find content:

  1. schema: Parses <script type="json/ld"> tags containing structured data (e.g., Schema.org).
  2. semantic_html: Searches for semantic HTML5 tags like <article>, <main>, and <section>.
  3. html: Analyzes the HTML structure to find frequently occurring selectors that are likely to contain the main content.

You can customize auto_source to improve its accuracy.

Enable or disable specific scrapers and adjust their settings:

auto_source:
scraper:
schema:
enabled: false # default: true
semantic_html:
enabled: false # default: true
html:
enabled: true
minimum_selector_frequency: 3 # default: 2
use_top_selectors: 3 # default: 5

Remove unwanted items from the results:

auto_source:
cleanup:
keep_different_domain: false # default: true
min_words_title: 4 # default: 3

For detailed documentation on the Ruby API, see the official YARD documentation.