Heretto Portal Web Crawlers Configuration
Overview
The two primary sources of information crawlers use to learn what they can and can't do on a website are: the robots.txt file and the meta
tag of the HTML header.
- robots.txt file
-
An important component of SEO (Search Engine Optimization), as it helps control how search engines crawl and index a website. While it doesn't directly influence rankings, it plays a significant role in managing crawler behavior and ensuring efficient indexing.
When a crawler visits a website, it first looks for the robots.txt file in the root directory, for example, https://www.domain.com/robots.txt. The crawler reads the rules in this file and follows the instructions.
Purpose of the robots.txt file:
-
Control crawling: It instructs web crawlers which sections of a website they are allowed to crawl.
-
Reduce server load: By restricting crawlers from accessing certain parts of a website, it helps reduce unnecessary server strain.
-
Prevent indexing of sensitive content: It can prevent crawlers from indexing private or irrelevant pages (though it does not secure these pages from access).
Limitations of the robots.txt file:
-
Not a security measure: It does not prevent unauthorized users or bots from accessing restricted content if they know the URL.
-
Ignored by malicious crawlers: Some crawlers may ignore robots.txt instructions.
-
meta
tag of the HTML header-
An HTML element used to control how search engine crawlers (robots) handle a webpage. It provides specific instructions about whether the page should be indexed, whether links on the page should be followed, and other crawling and indexing behaviors.
Purpose of the
meta
tag:-
Control indexing: Specifies whether a webpage should be included in a search engine's index. For example, this setting prevents the page from appearing in search results:
<meta name="robots" content="noindex">
. -
Control link following: Determines whether search engines should follow and evaluate links on the page. For example, this setting prevents passing link equity to linked pages:
<meta name="robots" content="nofollow">
. -
Restrict snippets and previews: Limits what content from the page is shown in search results, such as snippets or image previews. For example, this setting prevents snippets from being shown in search results:
<meta name="robots" content="nosnippet">
. -
Page-specific control: Enables fine-grained control of crawling and indexing for individual pages, especially when global rules in robots.txt are insufficient or inappropriate.
Limitations of the
meta
tag:-
Crawlers must access the page: Crawlers must download and parse the page to see the tag. If access to the page is restricted (e.g., blocked by robots.txt), the meta tag will not be read.
-
Malicious crawlers: Only search engines and bots that respect standard crawling protocols will honor the directives. Malicious bots or scrapers might ignore the tag entirely.
-
Conflicts with robots.txt: If a page is disallowed in robots.txt, the crawler won't see the
meta
tag, rendering it ineffective.
-
You can configure the settings for the robots.txt file and the meta
HTML tag for your production Heretto Portal environment through the main config.json file associated with your main portal sitemap.
If your production portal is not officially launched yet, consult the Heretto Implementation Team before making any changes to crawler settings. During the implementation phase, SEO is disabled for production portal environments.
Configure Web Crawler Settings
You can configure crawler behavior for your Heretto Portal to optimize how your help site interacts with search engines, ensure better control over your site's visibility and accessibility, or prevent indexing of sensitive or irrelevant content.
The default crawler settings are defined in the robots.txt file that you can view by opening https://www.your-portal-domain.com/robots.txt in a web browser for each portal environment.
You can overwrite the default robots.txt settings by adding new settings in the main config.json file associated with your main portal sitemap.
Ensure you have the permissions to edit the main config.json file associated with your main portal sitemap.
If your production portal is not officially launched yet, consult the Heretto Implementation Team before making any changes to crawler settings. During the implementation phase, SEO is disabled for production portal environments.
You modified crawler settings for your production portal environment. They are now available in the robots.txt file and the meta
HTML tag. Remember that malicious crawlers may ignore your new instructions. For sensitive data, consider using authentication.
Be sure to push the config.json file from the master branch to your staging and production branches. Publish the changes.