Heretto Portal Web Crawlers Configuration
When a bot visits your Heretto Portal, it goes through this sequence:
-
Access check: The portal checks if the bot's user-agent passes the browser version requirements or is on the whitelist
-
Crawl rules: If access is granted, the bot reads the robots.txt file to learn what it can crawl
-
Page-level rules: As the bot accesses individual pages, it reads meta tags for page-specific instructions
Heretto Portal gives you full control over how bots interact with content on your production portal environment through three configuration options:
-
userAgentWhitelist: Controls which bots can bypass browser version checks and access your production portal -
robotsTxt: Controls which areas of your site bots can crawl -
robotsMeta: Controls how individual pages are indexed
You can configure these settings for your production portal environment through the main config.json file associated with your main portal sitemap. This configuration is exclusive to production environments, as non-production portal environments are private and not indexed.
If your production portal is not officially launched yet, consult the Heretto Implementation Team before making any changes to crawler settings. During the implementation phase, SEO is disabled for production portal environments.
Let's take a closer look at these settings:
userAgentWhitelistsetting-
The
userAgentWhitelistparameter enables you to control which bots can access your production portal. By default, the whitelist is empty and all bots are blocked, which protects your content from being accessed by any bot without your explicit permission.The whitelist uses partial string matching against the user-agent header, for example,
claudematchesclaudeBot. If a match is found, the bot gets access to your production portal.Note:This setting only controls whether a bot can access your production portal. To control what content bots can crawl or index once they have access, configure
robotsTxtandrobotsMetadescribed below.An example that whitelists a few of the most popular AI bots:
"userAgentWhitelist": ["claude", "gpt", "gemini", "microsoft", "copilot", "grok"] robotsTxtsetting-
The
robotsTxtparameter enables you to overwrite the default robots.txt file settings of a production portal environment.When a bot visits a website, it first looks for the robots.txt file in the root directory, reads the rules in this file, and, unless it's malicious, follows the instructions. For example, when a bot visits your production portal environment, it goes to https://your-portal-domain/robots.txt.
By default, for production portal environments, the robots.txt file is set up such that it prevents all bots from accessing the backend API and frontend assets like JavaScript or CSS files, and points them at the sitemap.xml file for indexing. You can override these default settings by using the
robotsTxtparameter.An example of a
robotsTxtparameter that blocks all bots from accessing everything on a portal:"robotsTxt": "User-agent: *\nDisallow: /\nSitemap: __domain__/sitemap.xml"where,
__domain__represents your portal domain and gets replaced with it automatically.Important:When setting up the
robotsTxtparameter, ensure it includes unchanged parts, like the sitemap, as it completely replaces the robots.txt file. robotsMetasetting-
The
robotsMetaparameter enables you to control how search engines handle individual pages, for example, whether to index them, follow links, or show snippets in search results.Common settings include
indexandnoindexto include and exclude pages from search results respectively,followandnofollowto enable and prevent following links on a page, andsnippetandnosnippetto show and hide preview snippets.An example of the
robotsMetaparameter that tells search engines to index a portal and follow its links:"robotsMeta": "index, follow"
Important Considerations
This configuration is available only for production portal environments with SEO enabled. You can choose to enable SEO in your production portal environment or keep it disabled.
If your production portal is not officially launched yet, consult the Heretto Implementation Team before making any changes to crawler settings. During the implementation phase, SEO is disabled for production portal environments.
Non-production environments remain private, unindexed, and have SEO disabled.
When SEO is enabled and
robotsMetais empty, the default setting isindex, follow. WhenrobotsMetais defined, it overrides the defaults.When SEO is disabled, the default
robotsMetasetting isnoindex, nofollow.Malicious bots may ignore your instructions.
Configure Web Crawler Settings
You can configure web crawler behavior for your production Heretto Portal environment to control which bots can access your site, what content they can crawl, and how pages are indexed. You add this configuration to the main config.json file associated with your main portal sitemap.
This configuration is available only for production portal environments with SEO enabled. Non-production environments remain private, unindexed, and have SEO disabled.
If your production portal is not officially launched yet, consult the Heretto Implementation Team before making any changes to crawler settings. During the implementation phase, SEO is disabled for production portal environments.
Ensure you have the permissions to edit the main config.json file associated with your main portal sitemap.
View the default robots.txt settings for your production portal environment at
https://your-portal-domain/robots.txt.
Configure the userAgentWhitelist parameter
Configure robotsTxt and robotsMeta parameters
Validate and publish
You modified crawler settings for your production portal environment. The user-agent whitelist takes effect immediately for bot access. The new SEO settings are now available for the robots.txt file and meta HTML tag. Remember that malicious crawlers may ignore your instructions. For sensitive data, consider using authentication.
Complete configuration of all three parameters:
"userAgentWhitelist": ["claude", "gpt", "gemini", "microsoft", "copilot", "grok"],
"seo": {
"robotsTxt": "User-agent: *\nDisallow:\nSitemap: __domain__/sitemap.xml",
"robotsMeta": "index, follow"
}