WikiIndex:Spam control policy: Difference between revisions

From WikiIndex
Jump to navigation Jump to search
Line 36: Line 36:
==Level 4 - Login to Edit==
==Level 4 - Login to Edit==
Not implemented yet.
Not implemented yet.
There are two options at this level:
# Require a login to edit and '''request''' an email confirmation.
# Require a login to edit and '''require''' an email confirmation before editing is allowed.


==Guidance for Spam Fighters==
==Guidance for Spam Fighters==

Revision as of 17:25, 19 February 2007

Proposed Spam Control Policy

Comments & Corrections Welcome

There are three levels of spam control now in play and this policy will address how each of them should be used. The fourth level is not yet implemented.

Level 1 - LocalSettings.php

There is a regex filter in the LocalSettings.php file that is under the control of the site admin. This level blocks specific words, phrases and html fragments that are commonly used by link spammers and vandals. It contains common curse words, sex acts and symbols, body parts, most of the major drug names and html fragments that are used to hide and/or mask link spam and graffiti.

These are the common denominators of 90% of spam and graffiti. The regex will match any of these items and block the save of a page that has any one of these items anywhere on it.

It is NOT necessary to block anything containing these words anywhere else.

Level 2 - WikiMedia Blacklist

WikiMedia maintains an active blacklist of link spam target sites and common phrases that are used in link spam. We are connected to this list and every page save checks this list for bad external links on the list. The blacklist only checks URLs not the complete content. The blacklist itself can be seen at WikiMedia Blacklist and additions to it can be suggested at the blacklist talk page. You can expect this list to contain most of the current major link spam offenders targets. This list is very actively maintained and should raise our blocking level to about 99.5% of link spam.

If you find a particular link is not being caught by Levels 1 or 2 you can:

  1. If it is link spam or graffiti that contains a word that should be in the level 1 list, submit it to the site admin via email.
  2. If it is a link that you think should be banned at all wikis, submit it to WikiMedia
  3. If it is a link that you think should be banned from just WikiIndex, go to level 3

Level 3 - Local Blacklist

We maintain a local blacklist at My spam blacklist this is protected page that Sysops can use to block offending link spam not caught by Level 1 and Level 2. There should be very few entries here and NONE that contain the following:

  • Periods "." - Periods have a special meaning in the regex syntax and can cause the list to malfunction.
  • Tlds "com, org, net" - These appear in all URLs so provide no value to the blocking mechanism.
  • "http://www." - The regex only checks valid URLs so this is not necessary.

An Example:

If you want to block linking to http://www.mybadwordsite.com you should only enter "mybadwordsite"

If Level 1 or Level 2 already contain the "badword" then the link would be blocked already and no entry would be necessary and you would not be able to save the list.

Level 4 - Login to Edit

Not implemented yet.

There are two options at this level:

  1. Require a login to edit and request an email confirmation.
  2. Require a login to edit and require an email confirmation before editing is allowed.

Guidance for Spam Fighters

Ward Cunningham gave me this advice for spam fighting and keeping your sanity, "do the absolute minimum required to block each attack and the spammer will grow tired and leave" (I'm paraphrasing) This is so true because you can drive yourself crazy trying to think of a way to defeat all attacks in advance of their actually happening!

Spam Blacklist Regex

According to the readme for the MediaWiki spam blacklist extension, internally a single giant regular expression is formed using the lines from the blacklist file as follows:

In simple terms:

  • Everything from a "#" character to the end of the line is a comment
  • Every non-blank line is a regex fragment which will only match inside URLs

Internally, a regex is formed which looks like this:

   !http://[a-z0-9\-.]*(line 1|line 2|line 3|....)!Si

A few notes about this format. It's not necessary to add www to the start of hostnames, the regex is designed to match any subdomain. Don't add patterns to your file which may run off the end of the URL, e.g. anything containing ".*". Unlike in some similar systems, the line-end metacharacter "$" will not assert the end of the hostname, it'll assert the end of the page.