INFOWORLD GRIPE LINE BY ED FOSTER Bookmark this page

 
Display: Sort:
Borderline searches and seizures | 19 comments (19 topical) | Post A Comment
Turning off href[ Parent | Reply to This ] (none / 0) (#6)
by Ed Foster on Mon Jun 30, 2008 at 11:42:07 AM PDT

The Japanese spammer (and, yes, it is Japanese -- I can read hiragana and katakana, the Japanese phonetic alphabets) is particularly pernicious. He seems to have found a way around my captcha and can post virtually at will. With considerable effort we recently eliminated thousands of his link spam comments, only to see him post over a thousand more in a short period Friday night. I am getting discouraged.

At LasVegan's suggestion, I've now turned off href= as allowable HTML in the comments. I don't know if that's going to stop him or any of the other link spammers because I don't know if they'll notice that their links no longer work. But we'll give it a try. If anyone has other suggestions, I'd love to hear them.

Ed



[ Parent | Reply to This ]


We appreciate[ Parent | Reply to This ] (none / 0) (#7)
by Anonymous User on Mon Jun 30, 2008 at 11:54:13 AM PDT

...your efforts and the work you put into the site Ed.

[ Parent | Reply to This ]


Href=[ Parent | Reply to This ] (none / 0) (#8)
by LasVegan on Tue Jul 01, 2008 at 06:43:04 AM PDT

Treat the presence of an href= the same as a captcha failure--don't post it, offer it back for them to fix. His script might keep trying but nothing will actually post.

[ Parent | Reply to This ]


Re: Href=[ Parent | Reply to This ] (none / 0) (#10)
by Ed Foster on Tue Jul 01, 2008 at 01:28:33 PM PDT

Well, it turns out that it's not going to be that easy to do. When I went to post last night's story, the hyperlinks didn't work there either, so I had to turn href back on. Do any of my volunteers know Perl really well? Jeff and I could definitely use some help in figuring out how to fix some of these things. -- Ed

[ Parent | Reply to This ]


I think you're going at it backwards[ Parent | Reply to This ] (none / 0) (#11)
by LasVegan on Wed Jul 02, 2008 at 08:03:31 AM PDT

Don't try to filter href= from the output--that would cause exactly what you describe. I'm saying to reject any post containing href=. Unfortunately, perl isn't a language I've learned, I can't help you implement it.

[ Parent | Reply to This ]


Yeah[ Parent | Reply to This ] (none / 0) (#13)
by Anonymous User on Fri Jul 04, 2008 at 11:49:32 AM PDT

It needs to be limited to comment posting. Better yet, only to anonymous user posts and, perhaps, sufficiently-recent registrations (based on post-count rather than duration, even).

Then established users can still post functional links. If any of those abuse the privilege, you can simply yank it, or suspend the account, or whatever.

Trickier would be to still let (some) anonymous users post links. Disallowing link posting by anons from particular IP ranges, perhaps -- any time a spammer starts abusing an IP range, it gets added to the black list. Innocent people from that range can still post comments, but not ones with working links, (not anonymously -- they still can if they register and make a few posts).

Another angle of attack is to figure out how the captcha might be circumvented. The obvious answer, that the guy's posting manually, is not applicable if he made over a thousand posts in only a few minutes.

Another, but it might not sit well with some people, is to silently reject foreign-language posts. Requiring posts to have a minimum of five English words from some dictionary would work for a while, especially if this comment got edited or deleted once you read it and it was unclear why the posts in question were failing. Eventually someone would guess and start adding random English-language junk to spams to get them by. Requiring the whole post to be fairly close to English-prose character-frequency statistics might work, for longer posts anyway.

Changing the captcha code a bit might break the lozer's script without affecting anyone else.

Most of the above could be applied solely to posts with links -- even the captcha itself could be required solely on posts with links, for that matter. All of the posts we seek to prevent have links, after all.

I assume that this:

<input type="hidden" name="patch2007b" value="9.6d934cffda03be.17.57AjV7.5.7ddf4e908c0e110aac.17.ExknXR" />

is intended to detect scripts that just fill/change all form fields? Making it non-hidden and adding a (human-readable only!) note elsewhere on the page saying to leave the contents of that field be might help trip up such scripts -- the smarter ones will look for type="hidden" and ignore those fields, but might edit everything else.

Changing the name of the big textarea to something other than "comment" might trip up some of the stupider scripts. (Smarter ones will assume any non-hidden textarea is a comment field, and flood all of them with copies of the spam message.)

Best bet, though, may be simply to reject anything with more than a certain proportion of the text part of links. Most of the spew I've seen here fails a "less than 50% of the visible text is blue and underlined" test, and none of the legitimate comments.

P.S. Is "9.6d934cffda03be.17.57AjV7.5.7ddf4e908c0e110aac.17.ExknXR" an encoded representation of "Yellow", by any chance? Or a hash of same?

If so, the captcha may be bypassed by anyone who manages to reverse-engineer the coding. In the case it's a hash, the bypass would be to change that form field to the hash of a known string, and put that known string in the visible captcha field (patch2007a), requiring only knowledge of the hash algorithm.

Better would be for the 2007b field (or the prior, "form key" field) to be temporarily stored in a database table with the correct answer, at the time the form and captcha question get generated at the server side; a submission has the submitted answer compared with whatever is stored with the key field in that DB table. The code is then not a hash or any other discernible function of the correct answer, but instead a magic cookie.

The other apparent way to beat the captcha is to brute-force it -- parse out the six possible answers and pick one at random, then try to submit. Out of 6000 attempts, 1000 or so will succeed. This can be blocked in a few ways:

  • Images instead of text, at least raising the bar to require OCR to parse out the possible answers.
  • Non-multiple-choice-format questions, such as math problems (ENGLISH-LANGUAGE WORD PROBLEMS or they could be fed to "calc.exe"!); they'd have to completely replace the current multiple-choice ons.
  • Automatic temporary IP bans, say for an hour, every time there are e.g. three wrong answers in a row from the same IP. That slows the flood from brute-forcing to a trickle -- you'll get one or two successful posts, then nothing for a blissful 60 minutes as the bot gets itself temp-banned, then one or two successful posts, then nothing. For extra added fun, ignore the last two octets in this, so three failed postings (all due to WRONG CAPTCHA ANSWERS, mind you) from IPs that all start with 183.36 will result in all posts from IPs that start with 183.36 being rejected for an hour. (Don't accept a post with an answer of "--fill me--", but don't count it as a "strike" either; perhaps only count a "strike" if the answer submitted is identical to one of the five "wrong" answers. This will make it virtually incapable of punishing a legit human, but still guaranteed effective against scripts. Relax "three strikes in a row" to "three strikes in a one-hour period" on top of that. Bye-bye bots -- the few that trickle through can be mopped up by hand due to their low rate, unless someone goes after you with a large distributed botnet, and even there, what would have been a tsunami becomes a mere flood.)


[ Parent | Reply to This ]


Word problems[ Parent | Reply to This ] (none / 0) (#14)
by Anonymous User on Fri Jul 04, 2008 at 12:23:11 PM PDT

Math word problems is a good idea, but they shouldn't be too naive or they still fall to brute-force attacks. Parsing out numbers, and guessing operations (perhaps based on words or phrases, e.g. "together" suggests addition, "how many more" suggests subtraction, etc.).

Here's a fairly good example. "John has five apples and eight oranges. Amy has three apples and two oranges. How many cores will be left over once they eat them all?" The answer is obviously eight, since that's the total number of apples and only the apples will result in cores. No script is likely to be able to solve problems resembling this one -- there's nothing easily parsed out of that to suggest which numbers to use of those supplied. The downside is that a fairly clever script might figure out that there are only a few likely right answers and try one at random, and a percentage will get through.

The ending, and right answer, can be changed in a few ways that machines would have difficulty identifying:

  • How many fruit do they have? 18
  • How many fruit that won't leave cores? Ten
  • How many fruit does the one with the fewest have? 5
  • If they divide what they have equally between them, how many cores will each leave behind? Four
  • Ditto, but how many peels? Five

This amount of possible variety makes the number of machine-guessable possible right answers fairly large, at least as large as the number of multiple choice answers now. It can be made larger by throwing in irrelevant details, e.g. "Amy is seventeen years old". Indeed, best is to have a small number of "stories", which each contain as many as ten different numbers, and for each one several possible questions, which use different subsets (two or three, typically) of those numbers, and whose answers range fairly widely, preferably into the low triple-digits. Throw in random numbers, plus images that may include needed numbers (e.g. a girl in a soccer jersey with "56" on the front, and elsewhere an image that looks like text with the number 23, used as a number in the text), and make these large and clear, and you further confuse bots. They will need to use OCR, and a single image might show many potentially-significant numbers. Picture a baseball scoreboard and a question that might ask how many runs the visiting team scored, period or in a particular inning or even in a different game entirely, if the text says "Pictured is the final score from their first game against one another. In the rematch, Dumont's Dudley scored their only run, in the third inning. What was Dumont's score?" -- the answer, one, may not even appear anywhere literally, the scoreboard image is completely irrelevant, and we've also got the number three occurring in the text where it's completely irrelevant. There might be a total of two dozen numbers, 20 of them (inning scores and final scores) in the image, plus four in the text, none of which are (or are used to compute) the genuine right answer. But of course the question could be "What inning contained Dudley's run" (3) or "What did Dumont score in the second inning" (zero) or something about the scoreboard image instead. An image that, I might add, OCR software might choke on, perhaps reading "0001032017" instead of ten separate numbers in a fairly plausible failure mode.

Guessing, even "educated" guessing after parsing text and images, could be made to produce a very low hit-rate, under 1%, at least in principle.

In fact, simply asking text-answer questions about images might work wonders. Have a couple dozen stock images and a few hundred questions about them with easy, unambiguous answers that OCR mostly won't work to produce, and watch the bots bash their heads against a brick wall, even if you allow for a certain amount of sloppiness in the answers.

Especially combined with the three-strikes rule suggested by the previous poster.

Just watch for someone to develop a script that exploits the finite repertoire of questions if you do this instead of math problems with random components. I'd just wait until an apparently-automated spamming spree succeeds, then completely replace all of the images and questions, wait until it happens again, and repeat as needed; it should be infrequent enough that, over all, very few spams make it through per day on average and very little work per day is actually required on average.

Only presenting and requiring an answer to the captcha if there are "href=" in the comment will further reduce any impact the captcha has on normal users while making it a bit more awkward for a would-be spammer to catalogue all of the captcha questions that can occur (if even finite).


[ Parent | Reply to This ]



Missing the obvious?[ Parent | Reply to This ] (none / 0) (#15)
by Anonymous User on Fri Jul 04, 2008 at 12:41:16 PM PDT

The database key "magic cookie" idea is good, but needs a little extra to avoid slowly bloating up to one day become a ludicrous waste of disk space. Entries need to be deleted when a post is submitted (consult, then remove the entry). Also, people who bring up the form and then for whatever reason don't submit anything will still cause it to more slowly bloat up with unused entries. Adding a third, date field and every day at 2am or whatever purging every entry in that table older than 24 hours will get rid of those, without disturbing people that happen to be posting at the time (as their post's entry should be much younger than 24 hours).

But all of this discussion may be missing the obvious.

Nobody has any business posting thousands of posts in the space of only an hour. Why not just enforce a posting volume limit per first-two-octets or first-three-octets IP block of, say, 20 posts in an hour? It's very unlikely that legitimate posters will hit this limit (and it could be waived for long-standing registered users with a history of legit posts, and/or applied solely to posts containing links, so only 20 containing links plus however-many that don't per hour). Any beyond that either fail entirely or have links stripped out or something, and maybe the IP range is blocked from posting for additional time. (Only do the latter if there's measures taken to keep it from ever hitting a legitimate human. If it only kicks in if you post 20 with links in one hour, and are not on a whitelist of registered users, and on the last three posts towards the limit the form comes up with successively more dire warnings that the limit is being approached and don't post any links for a while or else, say.)

Of course, that won't faze someone who commands a botnet, as someone else pointed out. Using a captcha that's difficult to guess (large number of possible right answers) whenever a post contains links will then reduce the volume somewhat.


[ Parent | Reply to This ]



Borderline searches and seizures | 19 comments (19 topical) | Post A Comment
Display: Sort:

Menu
· create account
· faq
· search

Login
Make a new account
Username:
Password:

 HOME  NEWS  COLUMNS  BLOGS  PODCASTS  TECHNOLOGIES  TEST CENTER  EVENTS  CAREERS  IT EXEC-CONNECT   About Awards Contact Us 

Copyright © 2006, Reprints, Permissions, Licensing, IDG Network, Privacy Policy.
All Rights reserved. InfoWorld is a leading publisher of technology information and product reviews on topics including viruses,
phishing, worms, firewalls, security, servers, storage, networking, wireless, databases, and web services.

ComputerWorld :: LinuxWorld :: Network World :: CIO :: PC World :: Darwin :: CMO :: CSO
IT Careers :: JavaWorld :: Macworld :: Mac Central :: Playlist :: GamePro :: GameStar :: Gamerhelp
ITWorld Canada :: Computerwoche :: Techworld UK :: tecChannel :: IDG.se :: IDG.no :: IDG.pl

create account | faq | search