Core Concepts
Website Crawling
How hej! discovers and extracts content from your website.
The crawling process is the foundation of your AI chatbot's knowledge. Our intelligent crawler visits your website, discovers pages, and extracts the content that will power your AI assistant's responses.
How Crawling Works
1. Initial Discovery
Starting from your homepage, the crawler discovers all linked pages within your domain. It builds a sitemap of your entire website structure.
2. AI Page Selection
An LLM analyzes the discovered pages and intelligently selects which ones to include in your knowledge base. This ensures quality over quantity.
3. Content Extraction
For each selected page, we extract the main content, removing navigation, footers, ads, and other boilerplate elements.
4. Indexing & Embedding
Extracted content is processed, chunked, and converted into vector embeddings for semantic search capabilities.
Page Limits by Plan
Each plan includes a maximum number of indexed pages. These limits help ensure optimal performance and relevance of your knowledge base.
| Plan | Max Pages | Manual Re-crawl |
|---|---|---|
| Starter | 10,000 pages | Once per day |
| Growth | 25,000 pages | Once per day |
| Business | 100,000 pages | Once per day |
| Enterprise | Unlimited | Unlimited |
Priority Pages
What are Priority Pages?
Priority pages are frequently updated pages (like news, pricing, or contact info) that get recrawled more often than your regular content. This ensures your AI always has the latest information.
Business and Enterprise plans include priority page support.
Re-crawling Your Site
When your website content changes, you'll want to update your AI's knowledge base. There are several ways to trigger a re-crawl:
Manual Re-crawl
Trigger a re-crawl anytime from your Studio dashboard. Great for after major content updates.
Re-crawl Limit
Paid plans can re-crawl once per day. Enterprise plans have no limit.
Best Practices
Ensure pages are publicly accessible
Our crawler can only index pages that don't require authentication.
Use semantic HTML
Proper heading structure and semantic elements help with content extraction.
Keep content focused
Pages with clear, focused content produce better AI responses.
Avoid blocking our crawler
Make sure your robots.txt allows our crawler (User-agent: hej-bot).