WikiIndex:Spam control policy: Difference between revisions

From WikiIndex
Jump to navigation Jump to search
m (Hoof Hearted moved page WikiIndex:Spam Control Policy to WikiIndex:Spam control policy without leaving a redirect: Text replacement - "Spam Control Policy" to "Spam control policy")
 
(24 intermediate revisions by 10 users not shown)
Line 1: Line 1:
For some thoughts about spambot hunting see [[WikiProject:Junking bots]] <small>...didn't know where to lonk it --[[Wolf Peuker|Wolf]] | <small>[[User talk:Peu|talk]]</small> 07:05, 13 October 2007 (EDT)</small>
{{TOC right}}
==Proposed Spam Control Policy==
For some thoughts about [[spambot]] hunting, see [[WikiProject:Junking bots]] <small>...didn't know where to lonk it --[[Wolf Peuker|Wolf]] | <small>[[User talk:Peu|talk]]</small> 07:05, 13 October 2007 (EDT)</small>


Comments & Corrections Welcome
==Proposed spam control policy==
Comments and corrections welcome.


There are three levels of spam control now in play and this policy will address how each of them should be used. The fourth level is not yet implemented.
There are three levels of {{tag|spam}} control now in play here on {{tag|WikiIndex}}, and this {{tag|policy}} will as {{tag|guidelines}} to address how each of them should be used.


==Level 1 - LocalSettings.php==
===Level 1 - LocalSettings.php===
There is a regex filter in the [[LocalSettings.php]] file that is under the control of the [[WikiIndex:Bureaucrats|site bureaucrats]].  This level blocks specific words, phrases, and html fragments – that are commonly used by link [[spam]]mers and [[vandal]]s.  It contains common curse and swear words, sex acts and symbols, body parts, most of the major drug and pharmacy names, commonly counterfeited brand names and products, and html fragments that are used to hide and / or mask link spam, vandalism, and graffiti.


There is a regex filter in the LocalSettings.php file that is under the control of the site admin. This level blocks specific words, phrases and html fragments that are commonly used by link spammers and vandals. It contains common curse words, sex acts and symbols, body parts, most of the major drug names and html fragments that are used to hide and/or mask link spam and graffiti.
These are the common denominators of 90% of spam and graffiti. The regex will match any of these items, and block the saving of an [[edit]] on a page that has any one of these items anywhere on it.
 
These are the common denominators of 90% of spam and graffiti. The regex will match any of these items and block the save of a page that has any one of these items anywhere on it.  


It is NOT necessary to block anything containing these words anywhere else.
It is NOT necessary to block anything containing these words anywhere else.


==Level 2 - WikiMedia Blacklist==
===Level 2 - Wikimedia Meta-Wiki spam blacklist===
 
[[Wikimedia Meta-Wiki]] maintains an active [[blacklist]] of link spam target sites and common phrases that are used in link spam. We are connected to this list, and every page save checks this list for bad external links on the list. The blacklist only checks URLs not the complete content. The blacklist itself can be seen at '''[[Meta-Wiki:Spam blacklist|Wikimedia spam blacklist]]''', and additions to it can be suggested at the '''[[Meta-Wiki:Talk:Spam blacklist|blacklist talk page]]'''. You can expect this list to contain most of the current major link spam offenders targets. This list is very actively maintained, and should raise our blocking level to about 99.5% of link spam. <small>Is this 99.5% figure still accurate?  [[User:Hoof Hearted|Sean, aka <small>Hoof Hearted</small>]] • <sub>[[:Category:Active administrators of this wiki|Admin]]</sub> • <small>[[User talk:Hoof Hearted|talk2HH]]</small> 02:45, 9 March 2013 (PST)</small>
WikiMedia maintains an active blacklist of link spam target sites and common phrases that are used in link spam. We are connected to this list and every page save checks this list for bad external links on the list. The blacklist only checks URLs not the complete content. The blacklist itself can be seen at '''[http://meta.wikimedia.org/wiki/Spam_blacklist WikiMedia Blacklist]''' and additions to it can be suggested at the '''[http://meta.wikimedia.org/wiki/Talk:Spam_blacklist blacklist talk page]'''. You can expect this list to contain most of the current major link spam offenders targets. This list is very actively maintained and should raise our blocking level to about 99.5% of link spam.


If you find a particular link is not being caught by Levels 1 or 2 you can:
If you find a particular link is not being caught by Levels 1 or 2 you can:
# If it is link spam or graffiti that contains a word that should be in the level 1 list, submit it to the site admin via email.
#If it is link spam or graffiti that contains a word that should be in the level 1 list, submit it to the [[:Category:Active administrators of this wiki|site administrators]] via a new message on their [[talk page]];
# If it is a link that you think should be banned at all wikis, submit it to '''[http://meta.wikimedia.org/wiki/Talk:Spam_blacklist WikiMedia]'''
#If it is a link that you think should be banned at all [[wiki]]s, submit it to '''[[Meta-Wiki:Talk:Spam blacklist|Wikimedia Meta-Wiki]]''';
# If it is a link that you think should be banned from just WikiIndex, go to level 3
#If it is a link that you think should be banned from just [[WikiIndex]], go to Level 3.
==Level 3 - Local Blacklist==


We maintain a local blacklist at '''[[My spam blacklist]]''' this is protected page that Sysops can use to block offending link spam not caught by Level 1 and Level 2. There should be very few entries here and NONE that contain the following:
===Level 3 - Local blacklist===
* Periods "." - Periods have a special meaning in the regex syntax and can cause the list to malfunction.
We maintain a local blacklist at '''[[My spam blacklist]]'''.  This is protected page that [[:Category:Active administrators of this wiki|Sysops]] can use to block offending link spam not caught by Level 1 and Level 2. There should be very few entries here, and NONE that contain the following:
* Tlds "com, org, net" - These appear in all URLs so provide no value to the blocking mechanism.
*Periods (full stop) '.' — periods, aka the 'full stop' have a special meaning in the regex syntax, and can cause the list to malfunction;
* "http://www." - The regex only checks valid URLs so this is not necessary.
*TLDs (top level domains) 'com, org, net' — these appear in virtually all [[:Category:United States of America|United States]] URLs, and are also extensively used in many (though not all) [[:Category:Country|countries]] around the [[:Category:Earth|world]], so provide no value to the blocking mechanism;
*'http://www.' — the regex only checks valid URLs, so this is not necessary.


An Example:
;An example:
If you want to block linking to <tt><nowiki>http://www.MyBadWordSite.com</nowiki></tt> you should only enter '<tt>MyBadWordSite</tt>'.


If you want to block linking to http://www.mybadwordsite.com you should only enter "mybadwordsite"
If Level 1 or Level 2 already contain the 'bad word' – then the link would be blocked already, and no entry would be necessary, and you would not be able to save the list.


If Level 1 or Level 2 already contain the "badword" then the link would be blocked already and no entry would be necessary and you would not be able to save the list.
===Level 3.5 - CAPTCHA===
Now implemented.  By default, CAPTCHAs are triggered on the following events:
*[[Special:CreateAccount|New user registration]];
*[[Anonymity|Anonymous]], or [[IP editor]] edits that contain new external links;
*Brute-force password cracking.


==Level 3.5 - CAPTCHA==
===Level 4 - Login to edit===
We have <u>not</u> implemented [[:Category:LoginToEdit|login to edit]].  This means that we still allow [[Anonymity|anonymous]] editing by '[[IP editor]]s', due to our original community desire of being open to all for editing.  Levels 1, 2, 3, and 3.5 should (hopefully) provide an adequate level of spam protection.


Now implemented, by default, CAPTCHAs are triggered on the following events:
There are three fundamental options of 'login to edit' at this level, though not all are available or implemented here on WikiIndex, but include:
#Require a login to edit, with no further checks;
#Require a login to edit, and '''request''' a valid [[:Category:ConfirmEmail|e-mail confirmation]];
#Require a login to edit, and '''insist''' a valid e-mail confirmation is completed before editing is allowed.
For new [[user]]s who do wish to create a registered account here on WikiIndex, ''and'' then edit under their desired [[username]], we currently require option 3 above.


* New user registration
==Guidance for spam fighters==
* Anonymous edits that contain new external links
[[Ward Cunningham]] offered this advice for {{tag|SpamFighting|spam fighting}}, and keeping your sanity: "do the absolute minimum required to block each attack and the spammer will grow tired and leave" (paraphrasing).  This is so true, because some folks can drive themselves crazy trying to think of a way to defeat all attacks in advance of their actually happening!
* Brute-force password cracking


==Level 4 - Login to Edit==
==MediaWiki spam blacklist regex==
Not implemented yet.
According to the [https://phabricator.Wikimedia.org/diffusion/ESPB/browse/master/README readme] for the {{Mw|Extension:SpamBlacklist|MediaWiki spam blacklist extension}}, internally a single giant regular expression is formed using the lines from the blacklist file as follows:


There are two options at this level:
In simple terms:
# Require a login to edit and '''request''' an email confirmation.
*Everything from a "#" character to the end of the line is a comment (and therefore is only for information or advice, and has no effect);
# Require a login to edit and '''require''' an email confirmation before editing is allowed.
*Every non-blank line is a regex fragment which will '''only match inside URLs''' (ie, the active component of the blacklist).


==Guidance for Spam Fighters==
Internally, a regex is formed which looks like this:
[[Ward Cunningham]] gave me this advice for spam fighting and keeping your sanity, "do the absolute minimum required to block each attack and the spammer will grow tired and leave" (I'm paraphrasing) This is so true because you can drive yourself crazy trying to think of a way to defeat all attacks in advance of their actually happening!
<pre>!http://[a-z0-9\-.]*(line 1|line 2|line 3|....)!Si</pre>
A few notes about this format.  It is not necessary to add 'www' to the start of hostnames, the regex is designed to match <u>any</u> sub-domain.  Do not add patterns to your file which may run off the end of the URL, e.g. anything containing '<tt>.*</tt>'.  Unlike in some similar systems, the line-end meta-character '<tt>$</tt>' will not assert the end of the hostname, it will assert the end of the page.


==Spam Blacklist Regex==
==What happens if spam slips through automated systems==
According to the readme for the MediaWiki spam blacklist extension, internally a single giant regular expression is formed using the lines from the blacklist file as follows:
Please delete any [[spam]] that slips through the automated systems, and add the <code>{{template|spammer}}</code> tag on the [[spammer]]s [[user page]].  If a new page has been created with purely spam, edit the page by deleting said spam, and highlight the page for deletion by adding the <code>{{template|delete}}</code> tag, ideally by using <code><nowiki>{{delete|spam}}</nowiki></code>.  If you are a [[:Category:Active administrators of this wiki|Sysop]], please block the spammer in accordance with the [[WikiIndex:Blocking and banning policy]].


In simple terms:
==External links==
* Everything from a "#" character to the end of the line is a comment
*{{Mw|Manual: Combating spam}} — at [[MediaWiki.org]]
* Every non-blank line is a regex fragment which will '''only match inside URLs'''


Internally, a regex is formed which looks like this:
[[Category:Guidelines]]
<pre>
[[Category:Spam| ]]
  !http://[a-z0-9\-.]*(line 1|line 2|line 3|....)!Si
[[Category:Spammed| ]]
</pre>
[[Category:SpamFighting| ]]
A few notes about this format. It's not necessary to add www to the start of
hostnames, the regex is designed to match any subdomain. Don't add patterns
to your file which may run off the end of the URL, e.g. anything containing
".*". Unlike in some similar systems, the line-end metacharacter "$" will not
assert the end of the hostname, it'll assert the end of the page.

Latest revision as of 17:29, 11 January 2023

For some thoughts about spambot hunting, see WikiProject:Junking bots ...didn't know where to lonk it --Wolf | talk 07:05, 13 October 2007 (EDT)

Proposed spam control policy[edit]

Comments and corrections welcome.

There are three levels of spam control now in play here on WikiIndex, and this policy will as guidelines to address how each of them should be used.

Level 1 - LocalSettings.php[edit]

There is a regex filter in the LocalSettings.php file that is under the control of the site bureaucrats. This level blocks specific words, phrases, and html fragments – that are commonly used by link spammers and vandals. It contains common curse and swear words, sex acts and symbols, body parts, most of the major drug and pharmacy names, commonly counterfeited brand names and products, and html fragments that are used to hide and / or mask link spam, vandalism, and graffiti.

These are the common denominators of 90% of spam and graffiti. The regex will match any of these items, and block the saving of an edit on a page that has any one of these items anywhere on it.

It is NOT necessary to block anything containing these words anywhere else.

Level 2 - Wikimedia Meta-Wiki spam blacklist[edit]

Wikimedia Meta-Wiki maintains an active blacklist of link spam target sites and common phrases that are used in link spam. We are connected to this list, and every page save checks this list for bad external links on the list. The blacklist only checks URLs – not the complete content. The blacklist itself can be seen at Wikimedia spam blacklist, and additions to it can be suggested at the blacklist talk page. You can expect this list to contain most of the current major link spam offenders targets. This list is very actively maintained, and should raise our blocking level to about 99.5% of link spam. Is this 99.5% figure still accurate? Sean, aka Hoof HeartedAdmintalk2HH 02:45, 9 March 2013 (PST)

If you find a particular link is not being caught by Levels 1 or 2 you can:

  1. If it is link spam or graffiti that contains a word that should be in the level 1 list, submit it to the site administrators via a new message on their talk page;
  2. If it is a link that you think should be banned at all wikis, submit it to Wikimedia Meta-Wiki;
  3. If it is a link that you think should be banned from just WikiIndex, go to Level 3.

Level 3 - Local blacklist[edit]

We maintain a local blacklist at My spam blacklist. This is protected page that Sysops can use to block offending link spam not caught by Level 1 and Level 2. There should be very few entries here, and NONE that contain the following:

  • Periods (full stop) '.' — periods, aka the 'full stop' have a special meaning in the regex syntax, and can cause the list to malfunction;
  • TLDs (top level domains) 'com, org, net' — these appear in virtually all United States URLs, and are also extensively used in many (though not all) countries around the world, so provide no value to the blocking mechanism;
  • 'http://www.' — the regex only checks valid URLs, so this is not necessary.
An example

If you want to block linking to http://www.MyBadWordSite.com you should only enter 'MyBadWordSite'.

If Level 1 or Level 2 already contain the 'bad word' – then the link would be blocked already, and no entry would be necessary, and you would not be able to save the list.

Level 3.5 - CAPTCHA[edit]

Now implemented. By default, CAPTCHAs are triggered on the following events:

Level 4 - Login to edit[edit]

We have not implemented login to edit. This means that we still allow anonymous editing by 'IP editors', due to our original community desire of being open to all for editing. Levels 1, 2, 3, and 3.5 should (hopefully) provide an adequate level of spam protection.

There are three fundamental options of 'login to edit' at this level, though not all are available or implemented here on WikiIndex, but include:

  1. Require a login to edit, with no further checks;
  2. Require a login to edit, and request a valid e-mail confirmation;
  3. Require a login to edit, and insist a valid e-mail confirmation is completed before editing is allowed.

For new users who do wish to create a registered account here on WikiIndex, and then edit under their desired username, we currently require option 3 above.

Guidance for spam fighters[edit]

Ward Cunningham offered this advice for spam fighting, and keeping your sanity: "do the absolute minimum required to block each attack and the spammer will grow tired and leave" (paraphrasing). This is so true, because some folks can drive themselves crazy trying to think of a way to defeat all attacks in advance of their actually happening!

MediaWiki spam blacklist regex[edit]

According to the readme for the MediaWiki spam blacklist extension, internally a single giant regular expression is formed using the lines from the blacklist file as follows:

In simple terms:

  • Everything from a "#" character to the end of the line is a comment (and therefore is only for information or advice, and has no effect);
  • Every non-blank line is a regex fragment which will only match inside URLs (ie, the active component of the blacklist).

Internally, a regex is formed which looks like this:

!http://[a-z0-9\-.]*(line 1|line 2|line 3|....)!Si

A few notes about this format. It is not necessary to add 'www' to the start of hostnames, the regex is designed to match any sub-domain. Do not add patterns to your file which may run off the end of the URL, e.g. anything containing '.*'. Unlike in some similar systems, the line-end meta-character '$' will not assert the end of the hostname, it will assert the end of the page.

What happens if spam slips through automated systems[edit]

Please delete any spam that slips through the automated systems, and add the {{spammer}} tag on the spammers user page. If a new page has been created with purely spam, edit the page by deleting said spam, and highlight the page for deletion by adding the {{delete}} tag, ideally by using {{delete|spam}}. If you are a Sysop, please block the spammer in accordance with the WikiIndex:Blocking and banning policy.

External links[edit]