Core Concepts

Website Crawling

How hej! discovers and extracts content from your website.

The crawling process is the foundation of your AI chatbot's knowledge. Our intelligent crawler visits your website, discovers pages, and extracts the content that will power your AI assistant's responses.

How Crawling Works

1. Initial Discovery

Starting from your homepage, the crawler discovers all linked pages within your domain. It builds a sitemap of your entire website structure.

2. AI Page Selection

An LLM analyzes the discovered pages and intelligently selects which ones to include in your knowledge base. This ensures quality over quantity.

3. Content Extraction

For each selected page, we extract the main content, removing navigation, footers, ads, and other boilerplate elements.

4. Indexing & Embedding

Extracted content is processed, chunked, and converted into vector embeddings for semantic search capabilities.

Page Limits by Plan

Each plan includes a maximum number of indexed pages. These limits help ensure optimal performance and relevance of your knowledge base.

Plan	Max Pages	Manual Re-crawl
Starter	10,000 pages	Once per day
Growth	25,000 pages	Once per day
Business	100,000 pages	Once per day
Enterprise	Unlimited	Unlimited

Priority Pages

What are Priority Pages?

Priority pages are frequently updated pages (like news, pricing, or contact info) that get recrawled more often than your regular content. This ensures your AI always has the latest information.

Business and Enterprise plans include priority page support.

Re-crawling Your Site

When your website content changes, you'll want to update your AI's knowledge base. There are several ways to trigger a re-crawl:

Manual Re-crawl

Trigger a re-crawl anytime from your Studio dashboard. Great for after major content updates.

Re-crawl Limit

Paid plans can re-crawl once per day. Enterprise plans have no limit.

Best Practices

Ensure pages are publicly accessible

Our crawler can only index pages that don't require authentication.

Use semantic HTML

Proper heading structure and semantic elements help with content extraction.

Keep content focused

Pages with clear, focused content produce better AI responses.

Avoid blocking our crawler

Make sure your robots.txt allows our crawler (User-agent: hej-bot).

← How It Works Knowledge Base →