Under maintenance

Heretto Help

Heretto Portal Web Crawlers Configuration

Web crawlers (also called bots) are automated programs that systematically browse and scan the internet to collect information about websites. They are primarily used by search engines to discover and index web pages as well as by Artificial Intelligence (AI) bots to collect bulk training data from your site.

When a bot visits your Heretto Portal, it goes through this sequence:

  1. Access check: The portal checks if the bot's user-agent passes the browser version requirements or is on the whitelist

  2. Crawl rules: If access is granted, the bot reads the robots.txt file to learn what it can crawl

  3. Page-level rules: As the bot accesses individual pages, it reads meta tags for page-specific instructions

Heretto Portal gives you full control over how bots interact with content on your production portal environment through three configuration options:

  • userAgentWhitelist: Controls which bots can bypass browser version checks and access your production portal

  • robotsTxt: Controls which areas of your site bots can crawl

  • robotsMeta: Controls how individual pages are indexed

You can configure these settings for your production portal environment through the main config.json file associated with your main portal sitemap. This configuration is exclusive to production environments, as non-production portal environments are private and not indexed.

Important:

If your production portal is not officially launched yet, consult the Heretto Implementation Team before making any changes to crawler settings. During the implementation phase, SEO is disabled for production portal environments.

Let's take a closer look at these settings:

userAgentWhitelist setting

The userAgentWhitelist parameter enables you to control which bots can access your production portal. By default, the whitelist is empty and all bots are blocked, which protects your content from being accessed by any bot without your explicit permission.

The whitelist uses partial string matching against the user-agent header, for example, claude matches claudeBot. If a match is found, the bot gets access to your production portal.

Note:

This setting only controls whether a bot can access your production portal. To control what content bots can crawl or index once they have access, configure robotsTxt and robotsMeta described below.

An example that whitelists a few of the most popular AI bots:

"userAgentWhitelist": ["claude", "gpt", "gemini", "microsoft", "copilot", "grok"]
robotsTxt setting

The robotsTxt parameter enables you to overwrite the default robots.txt file settings of a production portal environment.

When a bot visits a website, it first looks for the robots.txt file in the root directory, reads the rules in this file, and, unless it's malicious, follows the instructions. For example, when a bot visits your production portal environment, it goes to https://your-portal-domain/robots.txt.

By default, for production portal environments, the robots.txt file is set up such that it prevents all bots from accessing the backend API and frontend assets like JavaScript or CSS files, and points them at the sitemap.xml file for indexing. You can override these default settings by using the robotsTxt parameter.

An example of a robotsTxt parameter that blocks all bots from accessing everything on a portal:

"robotsTxt": "User-agent: *\nDisallow: /\nSitemap: __domain__/sitemap.xml"

where, __domain__ represents your portal domain and gets replaced with it automatically.

Important:

When setting up the robotsTxt parameter, ensure it includes unchanged parts, like the sitemap, as it completely replaces the robots.txt file.

robotsMeta setting

The robotsMeta parameter enables you to control how search engines handle individual pages, for example, whether to index them, follow links, or show snippets in search results.

Common settings include index and noindex to include and exclude pages from search results respectively, follow and nofollow to enable and prevent following links on a page, and snippet and nosnippet to show and hide preview snippets.

An example of the robotsMeta parameter that tells search engines to index a portal and follow its links:

"robotsMeta": "index, follow"

Important Considerations

  • This configuration is available only for production portal environments with SEO enabled. You can choose to enable SEO in your production portal environment or keep it disabled.

  • If your production portal is not officially launched yet, consult the Heretto Implementation Team before making any changes to crawler settings. During the implementation phase, SEO is disabled for production portal environments.

  • Non-production environments remain private, unindexed, and have SEO disabled.

  • When SEO is enabled and robotsMeta is empty, the default setting is index, follow. When robotsMeta is defined, it overrides the defaults.

  • When SEO is disabled, the default robotsMeta setting is noindex, nofollow.

  • Malicious bots may ignore your instructions.

Configure Web Crawler Settings

You can configure web crawler behavior for your production Heretto Portal environment to control which bots can access your site, what content they can crawl, and how pages are indexed. You add this configuration to the main config.json file associated with your main portal sitemap.

This configuration is available only for production portal environments with SEO enabled. Non-production environments remain private, unindexed, and have SEO disabled.

Important:

If your production portal is not officially launched yet, consult the Heretto Implementation Team before making any changes to crawler settings. During the implementation phase, SEO is disabled for production portal environments.

  • Ensure you have the permissions to edit the main config.json file associated with your main portal sitemap.

  • View the default robots.txt settings for your production portal environment at https://your-portal-domain/robots.txt.

Configure the userAgentWhitelist parameter

  1. In the master branch, navigate to the main config.json file associated with your main portal sitemap.
  2. Right-click the config.json file and select Edit Source.
  3. In the Source Editor, scroll to the bottom of the file and add the userAgentWhitelist element with the bots you want to whitelist:

    This setting whitelists some of the main AI bots:

    "userAgentWhitelist": ["claude", "gpt", "gemini", "microsoft", "copilot", "grok"],

Configure robotsTxt and robotsMeta parameters

  1. In the Source Editor, scroll to the bottom of the config.json file and, if not present, add the seo section like shown here.
    "seo": {
            "robotsTxt": "", 
            "robotsMeta": ""
        }

    where

    • The robotsTxt parameter defines the rules for the robots.txt file, which search engine crawlers use to determine what parts of the site they can and cannot access

    • The robotsMeta parameter defines meta directives for search engines, typically embedded in the meta tag of the HTML header

  2. Add desired values to the robotsTxt parameter.

    Here are some important guidelines for configuring crawler settings:

    • In the robotsTxt parameter, separate each value, like User-agent: * or Disallow: /api/*, with \n to indicate to the robots.txt file where a new line starts.

    • Separate multiple entries in the seo parameter with commas (,). For example, "robotsTxt": "User-agent: *\nDisallow:\nSitemap: __domain__/sitemap.xml",

      Note:

      The __domain__ variable represents your portal domain and gets replaced with it automatically.

    "seo": { 
            "robotsTxt": "User-agent: *\nDisallow:\nSitemap: __domain__/sitemap.xml",
    	    "robotsMeta": ""	
        }

    With \n added as in this example, the robots.txt file will be interpreted and rendered like this:

    User-agent: *
    Disallow: 
    Sitemap: __domain__/sitemap.xml
    
  3. Add desired values to the robotsMeta parameter.

    Note that you need a comma after the robotsTxt entry and no comma after "robotsMeta": "index, follow".

    "seo": { 
            "robotsTxt": "User-agent: *\nDisallow:\nSitemap: __domain__/sitemap.xml", 
            "robotsMeta": "index, follow"
        }

    With robotsMeta set up like this, the meta tag of the HTML header will be interpreted like this:

    <meta name="robots" content="index, follow">
  4. Save your changes.

Validate and publish

  1. Validate your .json file.

    This syntax is very specific and won't validate if, for example, spaces are missing, or you have an extra comma. One option for validating the file is at https://jsonlint.com/.

  2. Push the config.json file from the master branch to your production branch and publish the changes.

You modified crawler settings for your production portal environment. The user-agent whitelist takes effect immediately for bot access. The new SEO settings are now available for the robots.txt file and meta HTML tag. Remember that malicious crawlers may ignore your instructions. For sensitive data, consider using authentication.

Complete configuration of all three parameters:

"userAgentWhitelist": ["claude", "gpt", "gemini", "microsoft", "copilot", "grok"],
"seo": {
    "robotsTxt": "User-agent: *\nDisallow:\nSitemap: __domain__/sitemap.xml", 
    "robotsMeta": "index, follow"
}