Heretto Portal Web Crawlers Configuration

Overview

Web crawlers are automated programs (or bots) that systematically browse and scan the internet to collect information about websites. They are primarily used by search engines to discover and index web pages, enabling users to find relevant content through search results, as well as by Artificial Intelligence (AI) bots to collect bulk training data from your site.

The two primary sources of information crawlers use to learn what they can and can't do on a website are: the robots.txt file and the meta tag of the HTML header.

robots.txt file

An important component of SEO (Search Engine Optimization), as it helps control how search engines crawl and index a website. While it doesn't directly influence rankings, it plays a significant role in managing crawler behavior and ensuring efficient indexing.

When a crawler visits a website, it first looks for the robots.txt file in the root directory, for example, https://www.domain.com/robots.txt. The crawler reads the rules in this file and follows the instructions.

Purpose of the robots.txt file:

Control crawling: It instructs web crawlers which sections of a website they are allowed to crawl.
Reduce server load: By restricting crawlers from accessing certain parts of a website, it helps reduce unnecessary server strain.
Prevent indexing of sensitive content: It can prevent crawlers from indexing private or irrelevant pages (though it does not secure these pages from access).

Limitations of the robots.txt file:

Not a security measure: It does not prevent unauthorized users or bots from accessing restricted content if they know the URL.
Ignored by malicious crawlers: Some crawlers may ignore robots.txt instructions.

meta tag of the HTML header

An HTML element used to control how search engine crawlers (robots) handle a webpage. It provides specific instructions about whether the page should be indexed, whether links on the page should be followed, and other crawling and indexing behaviors.

Purpose of the meta tag:

Control indexing: Specifies whether a webpage should be included in a search engine's index. For example, this setting prevents the page from appearing in search results: <meta name="robots" content="noindex"> .
Control link following: Determines whether search engines should follow and evaluate links on the page. For example, this setting prevents passing link equity to linked pages: <meta name="robots" content="nofollow"> .
Restrict snippets and previews: Limits what content from the page is shown in search results, such as snippets or image previews. For example, this setting prevents snippets from being shown in search results: <meta name="robots" content="nosnippet"> .
Page-specific control: Enables fine-grained control of crawling and indexing for individual pages, especially when global rules in robots.txt are insufficient or inappropriate.

Limitations of the meta tag:

Crawlers must access the page: Crawlers must download and parse the page to see the tag. If access to the page is restricted (e.g., blocked by robots.txt), the meta tag will not be read.
Malicious crawlers: Only search engines and bots that respect standard crawling protocols will honor the directives. Malicious bots or scrapers might ignore the tag entirely.
Conflicts with robots.txt: If a page is disallowed in robots.txt, the crawler won't see the meta tag, rendering it ineffective.

You can configure the settings for the robots.txt file and the meta HTML tag for your production Heretto Portal environment through the main config.json file associated with your main portal sitemap.

Important:

If your production portal is not officially launched yet, consult the Heretto Implementation Team before making any changes to crawler settings. During the implementation phase, SEO is disabled for production portal environments.

Configure Web Crawler Settings

You can configure crawler behavior for your Heretto Portal to optimize how your help site interacts with search engines, ensure better control over your site's visibility and accessibility, or prevent indexing of sensitive or irrelevant content.

The default crawler settings are defined in the robots.txt file that you can view by opening https://www.your-portal-domain.com/robots.txt in a web browser for each portal environment.

You can overwrite the default robots.txt settings by adding new settings in the main config.json file associated with your main portal sitemap.

Ensure you have the permissions to edit the main config.json file associated with your main portal sitemap.

This configuration is valid only for production portal environments that by default have SEO features enabled. For non-production portal environments, SEO features, and therefore bots, are disabled.

Important:

In the master branch, navigate to the main config.json file associated with your main portal sitemap.
Right-click the config.json file and select Edit Source.
In the Source Editor, scroll to the bottom of the file and, if not present, add the seo section like shown here.
```
"seo": {
        "robotsTxt": "", 
        "robotsMeta": ""
    }
```
where
- The robotsTxt key defines the rules for the robots.txt file, which search engine crawlers use to determine what parts of the site they can and cannot access
- The robotsMeta key defines meta directives for search engines, typically embedded in the meta tag of the HTML header
Add desired values to the robotsTxt key.
Here are some important guidelines for configuring crawler settings:
- In the robotsTxt key, separate each value, like User-agent: * or Disallow: /api/*, with \n to indicate to the robots.txt file where a new line starts.
- Separate multiple entries in the seo parameter with commas (,). For example, "robotsTxt": "User-agent: *\nDisallow: /api/*\nDisallow: /bundle/*\nSitemap: __domain__/sitemap.xml",
- Do not add a comma after the last entry in the seo parameter.
Tip:
To add a comment in a .json file, use two forward slashes //.
```
"seo": {
        //A comment in a .json file 
        "robotsTxt": "User-agent: *\nDisallow: /api/*\nDisallow: /bundle/*\nSitemap: __domain__/sitemap.xml"
    },
```
With \n added as in this example, the robots.txt file will be interpreted and rendered like this:
```
User-agent: *
Disallow: /api/*
Disallow: /bundle/*
Sitemap: __domain__/sitemap.xml
```
Add desired values to the robotsMeta key.
Note that there is no comma at the end of the robotsTxt entry and no comma after the last entry in the seo parameter: "robotsMeta": "noindex, follow"
```
"seo": {
        //A comment in a .json file 
        "robotsTxt": "User-agent: *\nDisallow: /api/*\nDisallow: /bundle/*\nSitemap: __domain__/sitemap.xml", 
        "robotsMeta": "noindex, follow"
    },
```
With robotsMeta set up like this, the meta tag of the HTML header will be interpreted like this:
```
<meta name="robots" content="noindex, follow">
```
Save your changes.
Validate your .json file.
This syntax is very specific and won't validate if, for example, spaces are missing, or you have an extra comma. One option for validating the file is at https://jsonlint.com/.

You modified crawler settings for your production portal environment. They are now available in the robots.txt file and the meta HTML tag. Remember that malicious crawlers may ignore your new instructions. For sensitive data, consider using authentication.

Be sure to push the config.json file from the master branch to your staging and production branches. Publish the changes.

Under maintenance

Heretto Help

Heretto Portal Web Crawlers Configuration

Overview

Configure Web Crawler Settings

Footer Section

Our Support Team is here to help