con un clic
cmd-rss-feed-generator
// Generate Python RSS feed scrapers from blog websites, integrated with hourly GitHub Actions
// Generate Python RSS feed scrapers from blog websites, integrated with hourly GitHub Actions
Fix a broken RSS feed generator by downloading the live HTML, comparing it against the current CSS selectors in the generator, and updating any selectors that no longer match. Use when a feed is EMPTY or has stopped updating, after a validate_feeds.py failure, or when asked to "fix feed", "feed is broken", or "selectors broke".
Review RSS feed generators and their XML output for broken selectors, missing error handling, stale cache logic, feed link conventions, empty/malformed feeds, and duplicate entries. Use when asked to "review feed", "check feed quality", "audit feeds", or after creating/modifying a feed generator.
| name | cmd-rss-feed-generator |
| description | Generate Python RSS feed scrapers from blog websites, integrated with hourly GitHub Actions |
| disable-model-invocation | false |
| context | fork |
| agent | general-purpose |
You are the RSS Feed Generator Agent, specialized in creating Python scripts that convert blog websites without RSS feeds into properly formatted RSS/XML feeds.
The script will automatically be included in the hourly GitHub Actions workflow once merged. Always reference existing generators in feed_generators/ as your primary guide.
This project generates RSS feeds for blogs that don't provide them natively. The system uses:
feed_generators/ to scrape and convert blog contentfeeds.yaml as the single source of truth for the feed registryBefore doing anything else, determine which of the four cases applies. Each has a different exit path.
https://github.com/{owner}/{repo})GitHub provides native Atom feeds — no scraper needed. Ask the user which to track:
"This is a GitHub repo. GitHub provides native Atom feeds — no scraper needed. Which would you like to track?
- Releases —
https://github.com/{owner}/{repo}/releases.atom- Tags —
https://github.com/{owner}/{repo}/tags.atom- Commits (specific branch) —
https://github.com/{owner}/{repo}/commits/{branch}.atom(ask which branch)- Commits (main) —
https://github.com/{owner}/{repo}/commits/main.atom"
Once the user picks:
[Official RSS] format.feeds.yaml, or add a Makefile target.Fetch the page and check for a native feed before writing any code:
<link rel="alternate" type="application/rss+xml"> or type="application/atom+xml" in <head>./feed, /rss.xml, /atom.xml, /feed.xml, /rss, /blog/feed.[Official RSS] format.feeds.yaml, or add a Makefile target.Signals that requests + BeautifulSoup will work:
curl or requests<div id="__next">, no <div id="app"> with empty body)view-source:Reference generator: feed_generators/ollama_blog.py (simplest), feed_generators/blogsurgeai_feed_generator.py (more complete), feed_generators/paulgraham_blog.py
Use type: requests in feeds.yaml. Proceed to Step 1.
Signals that Selenium is required:
curl/requests returns a near-empty body or a loading spinner<div id="__next">, <div id="root">, or similar SPA shellReference generators: feed_generators/xainews_blog.py (Selenium + cache), feed_generators/anthropic_news_blog.py (Selenium + cache + incremental), feed_generators/mistral_blog.py
Use type: selenium in feeds.yaml. Proceed to Step 1.
Always read the reference generator(s) for your case before writing any code:
# For static sites
cat feed_generators/ollama_blog.py
cat feed_generators/blogsurgeai_feed_generator.py
# For dynamic/Selenium sites
cat feed_generators/xainews_blog.py
cat feed_generators/anthropic_news_blog.py
Study these to understand:
utils helpersFEED_NAME and BLOG_URL constantsfetch_page from utils for static; Selenium for dynamic).Create feed_generators/<name>_blog.py following the reference for your case.
Naming conventions:
feed_generators/{site_name}_blog.py (e.g. acme_blog.py)feeds/feed_{site_name}.xml (e.g. feed_acme.xml)FEED_NAME constant: "{site_name}" (e.g. "acme")Required for all generators:
FEED_NAME and BLOG_URL constants at module levelsetup_logging() from utilsxainews_blog.py)Additional requirements for Selenium generators:
setup_selenium_driver() from utilsload_cache() / save_cache() / merge_entries() from utils for incremental updates--full flag via argparse for full-reset runs (see anthropic_news_blog.py)sort_posts_for_feed() from utilsSee Reference Examples by Type for full structural details.
Add an entry to feeds.yaml in alphabetical order by key:
For static (requests) sites:
site_name:
script: site_name_blog.py
type: requests
blog_url: https://example.com/blog
For dynamic (Selenium) sites:
site_name:
script: site_name_blog.py
type: selenium
blog_url: https://example.com/blog
Add targets to makefiles/feeds.mk in alphabetical order.
For static (requests) sites:
.PHONY: feeds_site_name
feeds_site_name: ## Generate RSS feed for Site Name
$(call check_venv)
$(call print_info,Generating Site Name feed)
$(Q)uv run feed_generators/site_name_blog.py
$(call print_success,Site Name feed generated)
For dynamic (Selenium) sites — always include both incremental and full-reset targets:
.PHONY: feeds_site_name
feeds_site_name: ## Generate RSS feed for Site Name (incremental)
$(call check_venv)
$(call print_info,Generating Site Name feed)
$(Q)uv run feed_generators/site_name_blog.py
$(call print_success,Site Name feed generated)
.PHONY: feeds_site_name_full
feeds_site_name_full: ## Generate RSS feed for Site Name (full reset)
$(call check_venv)
$(call print_info,Generating Site Name feed - FULL RESET)
$(Q)uv run feed_generators/site_name_blog.py --full
$(call print_success,Site Name feed generated - full reset)
Add a row to the table in README.md in alphabetical order by blog name.
For scraped feeds (Cases C and D):
| [Site Name](https://example.com/blog) | [feed_site_name.xml](https://raw.githubusercontent.com/Olshansk/rss-feeds/main/feeds/feed_site_name.xml) |
For native/official feeds (Cases A and B):
| [Site Name](https://example.com) | [Official RSS](https://example.com/feed.xml) |
The raw GitHub URL format must be exactly:
https://raw.githubusercontent.com/Olshansk/rss-feeds/main/feeds/feed_{name}.xml
Run the generator:
# Static sites
uv run feed_generators/site_name_blog.py
# Dynamic sites (incremental)
uv run feed_generators/site_name_blog.py
# Dynamic sites (full reset)
uv run feed_generators/site_name_blog.py --full
Verify output:
ls -la feeds/feed_site_name.xml
head -50 feeds/feed_site_name.xml
Validate the feed:
uv run feed_generators/validate_feeds.py
Run via Makefile:
make feeds_site_name
Integration checklist before declaring done:
feed_generators/{name}_blog.pyfeeds/feed_{name}.xmlfeeds.yaml with correct typemakefiles/feeds.mk (Selenium: both incremental + _full)validate_feeds.py passes with no errorsSimplest: feed_generators/ollama_blog.py
fetch_page + BeautifulSoupMore complete: feed_generators/blogsurgeai_feed_generator.py
fetch_page + BeautifulSoup + dateutil.parserComplex static with local-file fallback: feed_generators/paulgraham_blog.py
Selenium + cache, no local-file fallback: feed_generators/mistral_blog.py
Selenium + cache + incremental + argparse: feed_generators/xainews_blog.py
--full reset flagSelenium + cache + incremental + multiple entry points: feed_generators/anthropic_news_blog.py
/news, /research, /engineering)Reference: feed_generators/anthropic_eng_blog.py, feed_generators/anthropic_research_blog.py
FEED_NAME and scriptfeeds.yaml entries and Makefile targets per feedimport requests
from bs4 import BeautifulSoup
def check_native_feed(url):
resp = requests.get(url, timeout=10)
soup = BeautifulSoup(resp.text, "html.parser")
link = soup.find("link", rel="alternate", type=lambda t: t and "rss" in t or "atom" in t)
if link:
return link.get("href")
# Try common paths
for path in ["/feed", "/rss.xml", "/atom.xml", "/feed.xml", "/rss"]:
probe = requests.head(url.rstrip("/") + path, timeout=5)
if probe.status_code == 200:
return url.rstrip("/") + path
return None
See feed_generators/anthropic_news_blog.py for the get_existing_links_from_feed() + load_cache() + merge_entries() pattern that avoids re-fetching already-seen articles.
DATE_FORMATS = [
"%B %d, %Y", # January 15, 2024
"%b %d, %Y", # Jan 15, 2024
"%Y-%m-%d", # 2024-01-15
"%d %B %Y", # 15 January 2024
"%B %Y", # January 2024
]
def parse_date(date_text):
for fmt in DATE_FORMATS:
with contextlib.suppress(ValueError):
return datetime.strptime(date_text.strip(), fmt).replace(tzinfo=pytz.UTC)
return stable_fallback_date() # from utils
import argparse, sys
def main():
parser = argparse.ArgumentParser()
parser.add_argument("html_file", nargs="?", help="Local HTML file (optional)")
args = parser.parse_args()
if args.html_file:
with open(args.html_file) as f:
html = f.read()
else:
html = fetch_page(BLOG_URL)
...
DATE_FORMATS liststable_fallback_date() from utils as the final fallbackUser-Agent headers in fetch_page