Auto Source
The auto_source
scraper automatically finds items on a page, so you don’t have to specify CSS selectors.
To enable it, add auto_source: {}
to your configuration:
channel:
url: https://example.com
auto_source: {}
How It Works
auto_source
uses the following strategies to find content:
-
schema
: Parses<script type="json/ld">
tags containing structured data (e.g., Schema.org). -
semantic_html
: Searches for semantic HTML5 tags like<article>
,<main>
, and<section>
. -
html
: Analyzes the HTML structure to find frequently occurring selectors that are likely to contain the main content.
Fine-Tuning
You can customize auto_source
to improve its accuracy.
Scraper Options
Enable or disable specific scrapers and adjust their settings:
auto_source:
scraper:
schema:
enabled: false # default: true
semantic_html:
enabled: false # default: true
html:
enabled: true
minimum_selector_frequency: 3 # default: 2
use_top_selectors: 3 # default: 5
Cleanup Options
Remove unwanted items from the results:
auto_source:
cleanup:
keep_different_domain: false # default: true
min_words_title: 4 # default: 3