Welcome to the Power Users community on Codidact!
Power Users is a Q&A site for questions about the usage of computer software and hardware. We are still a small site and would like to grow, so please consider joining our community. We are looking forward to your questions and answers; they are the building blocks of a repository of knowledge we are building together.
Block certain namespace webpages for anonymous users (non registered users) with some Information Security method
Per each general audience webpage (i.e. any main-namespace page such as an article page or Category: page), the MediaWiki content management system creates about 10 or 100 or 1,000 if not more webpages (link pages, revision pages, revision-diff pages, etc.) and for me that's a serious SEO problem.
MediaWiki doesn't have any core or even non core fast way to lock all these "peripheral webpages" (for lack of a better term) to registered users, so naturally any "anonymous user" which is also a Google crawler will crawl them each time anew and this can easily finish the crawling budget allocated for that website.
Blocking these pages with some brutal robots.txt such as the following is nice but robots.txt blocking is by nature only "advisory"; directives may go outdated; directives won't necessarily effect all search engines; and the following directives aren't accessible for users who don't know, or don't know enough, regex.
User-agent: *
Sitemap: https://example.com/sitemap/sitemap.xml
disallow: /index.php?
disallow: /index.php/*:
allow: /index.php/Category:
allow: /index.php/קטגוריה:
As of the time of publishing this post, MediaWiki doesn't have any command to block anything which isn't main-namespace from anonymous users (so that it won't even initially be discovered by search engines) and for me that's a serious SEO problem because it makes thousands if not tens or hundreds of thousands possibly irrelevant webpages to be discovered and most likely also periodically crawled (if or if not indexed) and it just "eats" any plausible crawling budget.
Blocking these webpage in the server level via Apache directives and regex isn't good because I do want to serve them, just not to anonymous users (which includes crawlers).
But, maybe some Web Application Firewal could help.
I host my website on a shared sever plan in Namecheap with Cpanel and Apache ModSecurity WAF (or other WAF).
Can this be of use to solve my problem and if so how?
0 comment threads