Advanced Features
This guide covers advanced features and performance optimizations for html2rss.
Parallel Processing
Section titled “Parallel Processing”html2rss uses parallel processing to improve performance when scraping multiple items. This happens automatically and doesn’t require any configuration.
How It Works
Section titled “How It Works”- Auto-source scraping: Multiple scrapers run in parallel to analyze the page
- Item processing: Each scraped item is processed in parallel
- Performance benefit: Significantly faster when dealing with many items
Performance Tips
Section titled “Performance Tips”- Use appropriate selectors: More specific selectors reduce processing time
- Limit items when possible: Use CSS selectors that target only the content you need
- Cache responses: The web application caches responses automatically
- Choose the right strategy: Use
faraday
for static content,browserless
only when JavaScript is required
Memory Optimization
Section titled “Memory Optimization”html2rss is designed to be memory-efficient:
- Frozen objects: Parsed content is frozen to prevent accidental modifications
- Efficient data structures: Uses
Set
instead ofArray
for lookups - Minimal allocations: Prefers bang methods to avoid unnecessary memory allocations
Large Feed Handling
Section titled “Large Feed Handling”For websites with many items:
# Use specific selectors to limit itemsselectors: items: selector: ".article:not(.advertisement)" # Exclude ads title: selector: "h2" # More specific than generic selectors
Error Recovery
Section titled “Error Recovery”html2rss includes built-in error handling:
- Graceful degradation: If one scraper fails, others continue
- Detailed logging: Set
LOG_LEVEL=debug
for detailed information - Validation: Configuration is validated before processing
Custom Headers for Performance
Section titled “Custom Headers for Performance”Optimize requests with appropriate headers:
headers: Accept: "text/html,application/xhtml+xml" # Avoid JSON if not needed Accept-Encoding: "gzip, deflate" # Enable compression
Monitoring and Debugging
Section titled “Monitoring and Debugging”Enable Debug Logging
Section titled “Enable Debug Logging”LOG_LEVEL=debug html2rss feed config.yml
Web Application Health Checks
Section titled “Web Application Health Checks”Use the health check endpoint to monitor feed generation:
curl -u username:password http://localhost:3000/health_check.txt
Article Validation
Section titled “Article Validation”html2rss includes built-in validation for articles to ensure feed quality:
Validation Rules
Section titled “Validation Rules”Articles are considered valid if they have:
- A non-empty URL
- Either a title OR description (or both)
- A unique ID
Invalid Articles
Section titled “Invalid Articles”Invalid articles are automatically filtered out to prevent empty or broken feed items.
Custom Validation
Section titled “Custom Validation”You can add custom validation by using post-processors:
selectors: title: selector: "h2" post_process: - name: "gsub" pattern: "^\\s*$" replacement: "Untitled"
Best Practices
Section titled “Best Practices”- Test configurations: Always test your configurations before deploying
- Monitor performance: Use health checks to detect issues early
- Keep selectors simple: Complex selectors are harder to maintain
- Use auto-source when possible: It’s often more reliable than manual selectors
- Handle errors gracefully: Implement proper error handling in your applications
- Validate your data: Ensure your selectors return valid content