Profiler (Sitemap Scanner)
Profiler is the first module in the Shopkeeper scanning pipeline.
Responsibility
To extract a complete list of individual page URLs from a given sitemap URL.
Implementation
File: shopkeeper/internal/profiler/profiler.go
Entry point: Discover(sitemapURL string) ([]string, error)
Features
- Recursive Processing: Handles both
<urlset>(direct URLs) and<sitemapindex>(nested sitemaps). When a sitemapindex is found, each nested sitemap is fetched and parsed recursively. - Max Depth: Recursion is limited to 10 levels to prevent infinite loops.
- De-duplication: Uses a
map[string]boolto ensure URLs are unique before returning. - Validation: Each URL is validated using
url.ParseRequestURI— onlyhttp://andhttps://schemes are accepted. - Fetch Retries: Sitemap requests retry transient fetch failures before giving up.
- Complete Discovery Requirement: Shopkeeper does not continue into scanning with a partial sitemap index. If any nested sitemap still fails after retries, discovery returns an error and the scan fails instead of silently scanning an incomplete URL set.
- Size Limit: Response bodies are limited to 50MB to prevent memory exhaustion.
XML Structures
type sitemapIndex struct {
Sitemaps []sitemapEntry `xml:"sitemap"`
}
type urlSet struct {
URLs []urlEntry `xml:"url"`
}
HTTP Client
- Timeout: 30 seconds per request
- Retries transient sitemap fetch failures up to 3 times with short backoff
- Uses a dedicated
http.Clientconfiguration for sitemap requests
Output
A distinct slice of validated, deduplicated URL strings ready to be consumed by the Juicer module.