Baxil [bakh-HEEL'], n. - WARNING: Google has broken Javascript spam munging
[Recent Entries][Archive][Friends][User Info]
My Sites
[Tomorrowlands]
[The TTU Wiki]
[Photos]
View My LJ
[By Tag]
06:06 pm
![[User Picture]](http://l-userpic.livejournal.com/32112278/240226) [Link] |
WARNING: Google has broken Javascript spam munging Cruising through my mailbox just now, I happened to glance at a piece of spam before deleting it, and did a double-take:
![[Headers for spam to the address kawaii@tomorrowlands.org]](http://www.tomorrowlands.org/images/kawaii_spam.png)
The reason I was startled is that the kawaii address (for feedback from my "Chibi Jesus" page) is one that I exempted from spam filtering about a year back. I chose an unused address at my domain, did not use it for any purpose or attach it to any outbound mail, and published it nowhere except for a single web page, where it was protected from spam filtering by the Javascript munging recommended by Project Honey Pot.
Here, as of a month ago when it wasn't being spammed, was the only reference (WARNING: HIDEOUS MIDI MUSIC) to that address on the Web:
<SCRIPT LANGUAGE="JavaScript"> // thx to http://www.blazonry.com/javascript/js_hiding.php var rhs = "tomorrowlands"; var tld = "org"; var lhs = "kawaii"; function print_mail_to_link() { document.write("<a href=\"mailto"); document.write(":" + lhs + "@" + rhs + "." + tld + "\">"); document.write(lhs + "@" + rhs + "." + tld + "<\/a>"); }
<b><script language="JavaScript" type="text/javascript">print_mail_to_link()</script></b> <noscript> <i>(An e-mail link here has been hidden in Javascript. If you have Javascript turned off, please use the contact form linked at the bottom of the page.)</i> </noscript>
Should be pretty freakin' bulletproof, right? After all, as Project Honey Pot noted with no apparent sense of irony, "It should be noted that both of these techniques are likely to remain sound for some time to come. Harvesters that interpret the Javascript on every page they encounter would face a substantial risk of getting stuck in infinite loops or crashing due to malformed Javascript. ... This is likely beyond the current computing power of a legitimate company like Google."
The problem is that, if a legitimate company like Google does apply the computing power to it, the spammers don't have to expend the effort: they merely have to crawl the Google results.
And, alarmingly, this seems to be what has started to happen.
At first I thought that the address had been either guessed or else reposted somewhere, and I ran a Google search for kawaii@tomorrowlands.org in order to explore this. The only result to pop up was my own page, and the text summary of the page read:

The source code of the Google results page shows the address bare: "<em>kawaii@tomorrowlands.org</em>"
The source code of the Google-cached page is identical to mine (i.e. no raw address; the Javascript is preserved); the cache was taken May 12, 2009. It appears that the caching itself doesn't break the munging. There must be something about the excerpting process that does the trick.
At first I couldn't believe my eyes. Was this coincidence? I went through my recently deleted e-mail and rechecked all of the spam headers.
I have a similar whitelisted-and-munged address I use only for WikkaWiki announcements, that has only been posted on my own wiki, protected similarly. It has also started receiving spam, and investigation turned up the same results. At this point the evidence is pretty damning.
The first spam I still have for kawaii was on June 8; it's likely that Google's behavior change dates from before then, and the spammers are only now beginning to take advantage of this new potential. The spam started slowly and is now up to several messages per day -- word is probably spreading amongst the bad guys.
So.
Webmasters: Time to re-spamproof your site. A damn useful tool has just dropped out of the toolbox.
PLEASE NOTE: I have disabled the e-mail address referred to by this post. To contact me regarding this post, please write to [the first three letters of this journal name] [the dash symbol, '-'] [mail] [at-sign] tomorrowlands [dot.] org, or leave a comment below.
UPDATE: Two pieces of additional information I'd like to pull out from comments:
1. Even though the sample search I provided was for the compromised e-mail address, the spammer does NOT need to previously know your e-mail address in order to Google it. They just have to search for things shaped like e-mail addresses and skim the cream of the results. [*]
2. There is anecdotal evidence that pages which pull their decode function from a separate .js file have not been broken. (Yet.)
UPDATE 2: Welcome to /. readers! More discussion in the Slashdot thread.
Current Location: ~spiral Current Music: Jim's Big Ego, "WTFMFWTFAYT?" Tags: geekery, my brain now hurts, privacy, technology
|
|
| |
![[User Picture]](http://l-userpic.livejournal.com/90105273/148201) | | From: | elynne |
| Date: | June 23rd, 2009 02:09 am (UTC) |
|---|
| | | (Link) |
|
... oh Google. How I want to love you, and how, sometimes, you fail so very hard.
This reminds me of that meta-Google search engine that you wrote about in the T-lands universe, though. ;)
And also reminds me of Chibi Jesus. XD Good times!
![[User Picture]](http://l-userpic.livejournal.com/9758787/240226) | | From: | baxil |
| Date: | June 23rd, 2009 02:20 am (UTC) |
|---|
| | | (Link) |
|
Yeah, Google gets closer and closer to DWIM every day. In a world with magitechnology, DWIM would be abused just as hard by spammers as Google is today. On the other hand, in a world with magitechnology, every time that spam arrived in a mage's mailbox, someone's server would catch on fire.
"...a firestorm engulfed the entire southern hemisphere today..."
Gah, spam - scourge of the universe!
That led to some fascinating Google searches. I hope that we can find out more about the matter. What else have you checked to see how Google's behavior has changed - have they made any announcements that seem relevant? Favorite techniques after searching: changing the code direction and the more effort-intensive but accessible mod_rewrite and PHP/JS trickery recommended by A List Apart. If I were a better programmer, I'd try to implement the latter in Python/Pylons, since I have a vague idea of how it could be done.
![[User Picture]](http://l-userpic.livejournal.com/9461209/240226) | | From: | baxil |
| Date: | June 23rd, 2009 09:46 pm (UTC) |
|---|
| | | (Link) |
|
I haven't seen anything from Google, and I'm not certain how to contact them about this sort of issue (it doesn't seem like the sort of thing to report to their security@ bounce), but now that the story has hit /. hopefully answers might be forthcoming. Incidentally, as a counterpoint to the links you provided above, I followed a comment in one of those pages to http://jasonpriem.com/2009/05/stop-obfuscating-email/ . I disagree with its main argument, but its point about the poor security of email obfuscation is well taken.
I sent off a message linking to this here post to my webmaster and sysad. And I have a kitty here asking me to go to bed, and so I shall...
![[User Picture]](http://l-userpic.livejournal.com/32202457/7702987) | | From: | amthrax |
| Date: | June 23rd, 2009 05:29 am (UTC) |
|---|
| | | (Link) |
|
It's in Google's interest to be able to index simple document.write-generated HTML, since it's so common. They're definitely not doing full-blown Javascript execution. I munge email addresses on my own website by swapping every pair of letters and the Google snippet doesn't show the results of that. This also suggests that it's not simply a time-bounded execution, since my decoder takes almost no time to run.
Project Honey Pot's suggestion that a harvester would need a full-blown Javascript engine seems a little ridiculous. I bet you can get a lot of the way with a bounded-time non-Turing complete subset of Javascript.
![[User Picture]](http://l-userpic.livejournal.com/81649139/240226) | | From: | baxil |
| Date: | June 23rd, 2009 07:40 am (UTC) |
|---|
| | | (Link) |
|
Huh, that's actually even stranger: there's no substantial difference in our two routines except for yours calling a subfunction. I mean, ultimately they both boil down to taking a hardcoded string as input, tweaking with the input in various ways and document.write'ing it. And yet, you're right, on your site Google hasn't seemed to pick it up. (Though your address is compromised in a hundred other ways ...) I suppose this could also be taken as evidence that Google doesn't (currently) interpret pages based on items included by reference, since you call a separate .js file for your functions. That theory would benefit from a more rigorous test.
![[User Picture]](http://l-userpic.livejournal.com/1553915/523038) | | From: | nexxcat |
| Date: | June 23rd, 2009 07:23 pm (UTC) |
|---|
| | | (Link) |
|
I have a page I help maintain, and I also reference an external .js file that does the munging, and it also appears immune from Google's address harvesting.
![[User Picture]](http://l-userpic.livejournal.com/30804254/240226) | | From: | baxil |
| Date: | June 23rd, 2009 08:16 pm (UTC) |
|---|
| | | (Link) |
|
This is worth investigating at a stopgap measure. Thank you both.
![[User Picture]](http://l-userpic.livejournal.com/91254462/877831) | | From: | soph |
| Date: | June 23rd, 2009 10:34 pm (UTC) |
|---|
| | | (Link) |
|
I came here from Slashdot. I can also confirm that on one of the sites I host, an email address which is decoded by an external script file is *not* picked up in Google's excerpts.
Very interesting stuff, thank you for this post.
In modern CMS frameworks, libraries and support utilities are included through a script tag to an external javascript, and page essential javascript is run inline (document.write, onload handlers to trigger the writes in the function, &c.)
They're probably just picking the low-hanging fruit that's easily rendered with a full javascript engine, but minimal time to process (ie: none of the special effects from external libraries, or heavy handed processing). Plenty of benefit for a large chunk of sites.
Emails are going to be like SSNs soon. Just don't hand them out. Make a form that submits an email through the server (it's own issues with spam then, but email address doesn't get out...) or use natural language and make people work it out with their head :P (though texting any one of those pay 1c for a real person to answer a question service is easy and cheap -- tis how spammers get past some nagging signup processes that can't be automated just yet)
![[User Picture]](http://l-userpic.livejournal.com/1021922/240226) | | From: | baxil |
| Date: | June 23rd, 2009 10:17 pm (UTC) |
|---|
| | | (Link) |
|
The problem with forms is pretty much the same as the problem with munging: The full interface, along with everything that is needed for mail delivery, is presented to the client, and best practice for up-front (not filtering-based) spam prevention is to rely on the difference between programmed response and pattern matching. Solutions like CAPTCHA or exploiting the difference between human and bot clients are basically the state of the art for robust contact forms.
If we had more contact forms, there would be more advancements in form-breaking, for the same reasons you cite. So I don't think there's anything magical about forms as a solution here -- but if they provide better results, I'm willing to give them a try.
![[User Picture]](http://l-userpic.livejournal.com/41782906/5259858) | | From: | bdonlan |
| Date: | June 23rd, 2009 06:35 pm (UTC) |
|---|
| | | (Link) |
|
You might want to try retrieving the email address with an XHR next - I doubt google will permit such a request, and so they will be unable to decode it.
You could use robots.txt to block Google's access to your external javascript file or XHR request URL.
![[User Picture]](http://l-userpic.livejournal.com/9461209/240226) | | From: | baxil |
| Date: | June 23rd, 2009 10:22 pm (UTC) |
|---|
| | Re: Robots.txt | (Link) |
|
That seems like the best immediate measure to take. I'll also try putting an inline filter-free address on a robots.txt'd page in an effort to see if spammers can crack it without Goog backing them up.
| | Make them work for it | (Link) |
|
You might want to factor some very large numbers in your munging code there...
![[User Picture]](http://l-userpic.livejournal.com/9461209/240226) | | From: | baxil |
| Date: | June 23rd, 2009 10:31 pm (UTC) |
|---|
| | Re: Make them work for it | (Link) |
|
While that does scratch the itch for spam vindictiveness, that seems to me to be doomed to failure. I do want the address to display to human visitors, and I would suspect that any delay large enough to deter automated processes would also serve as a source of annoyance or "This must be broken" for my meatspace visitors.
| From: | (Anonymous) |
| Date: | June 23rd, 2009 07:26 pm (UTC) |
|---|
| | Hang On | (Link) |
|
The only way the address is being displayed is if you search for the address itself. And if you already know the address it doesn't matter if it's plain text. Any other summary shown doesn't contain the address. I suspect something managed to harvest it, and you're just going after the easy target. Show me a google search for something that isn't the e-mail address that shows the address in plain text.
| From: | funaho |
| Date: | June 23rd, 2009 07:59 pm (UTC) |
|---|
| | Re: Hang On | (Link) |
|
If they are indeed running the Javascript then I would just change the code to only execute if it's run in a full browser (which, I assume, they are NOT doing.) Checking navigator.userAgent should make this relatively easy.
![[User Picture]](http://l-userpic.livejournal.com/51121964/240226) | | From: | baxil |
| Date: | June 23rd, 2009 08:13 pm (UTC) |
|---|
| | Re: Hang On | (Link) |
|
navigator.userAgent can be trivially faked.
| From: | funaho |
| Date: | June 23rd, 2009 08:37 pm (UTC) |
|---|
| | Re: Hang On | (Link) |
|
Yes, but why would Google completely fake a user agent? They've always been good about identifying themselves as a bot somewhere in the user agent string.
![[User Picture]](http://l-userpic.livejournal.com/9461209/240226) | | From: | baxil |
| Date: | June 23rd, 2009 10:46 pm (UTC) |
|---|
| | Re: Hang On | (Link) |
|
Oh, good point.
I think what I was thinking at the time of my reply was that, if Google has started doing this, other companies (and less savoury individuals) are probably not far behind. A better long-term solution would be to focus on a method of munging still not broken, rather than try to exclude special cases based on a voluntary field like that.
![[User Picture]](http://l-userpic.livejournal.com/9461209/240226) | | From: | baxil |
| Date: | June 23rd, 2009 08:12 pm (UTC) |
|---|
| | Re: Hang On | (Link) |
|
> The only way the address is being displayed is if you search for the address itself. Not entirely true. While the address does not display if you try to search for it indirectly -- "chibi jesus" "please send here *" actually returns the noscript alternate text for the address, which blows my mind completely -- a spammer can still get the address trivially without knowing it to begin with. Google has powerful wildcarding. Voila: "* tomorrowlands org" site:tomorrowlands.org chibiAll you need to do is search for things shaped like e-mail addresses, optionally on the same domain that you're searching for addresses on. I added the "chibi" in to push the CJ page to the top of the results list but I'm sure it's in the complete listing as well.
![[User Picture]](http://l-userpic.livejournal.com/50653553/734567) | | From: | kistaro |
| Date: | June 23rd, 2009 08:58 pm (UTC) |
|---|
| | | (Link) |
|
Google tries to block bots from access. The obvious end results of this is botnets to query Google- not to intentionally DDoS it, but for information harvesting, since Google's the best tool to do the heavy lifting in the background, apparently.
A brief experiment doesn't show Bing, Ask, Yahoo, or AltaVista doing this.
(Full disclosure, for those who don't know- hi, /. p33pz!- I'm a full-time employee of Microsoft. This comment doesn't reflect any position or information official to Microsoft, and no representation I make about a Microsoft product herein is official or checked for accuracy. I do not work on the Bing team, have no non-public information about Bing, and wouldn't disclose it if I did.)
Posted this in IM to Bax this morning, but maybe someone else will find it interesting:
<html> <head> <style type="text/css"> #mailme { list-style-type: none; float: left; } #mailme li { display: inline; float: right; } </style> </head> <body> <ul id="mailme"> <li>m</li> <li>o</li> <li>c</li> <li>.</li> <li>n</li> <li>i</li> <li>a</li> <li>m</li> <li>o</li> <li>d</li> <li>@</li> <li>l</li> <li>i</li> <li>a</li> <li>m</li> <li>e</li> </ul> </body> </html>
It's not easy to hilite, and it's a mess if you copy it to the clipboard, but it's an interesting non-JavaScript way to post an address that would require some computational feats for spambots to interpret correctly.
| From: | (Anonymous) |
| Date: | June 24th, 2009 12:55 am (UTC) |
|---|
| | | (Link) |
|
This is much simpler and works just as well:
<div style="font-family: 'courier new', courier, monospace; line-height: 0px;">
z r y i h g a l c m<br>
c a f s @ m i . o
</div>
Example:
z r y i h g a l c m
c a f s @ m i . o
Nice, although I specifically avoided any methods which used a 0 line-height or font-size or similar, because it's too easy to filter that out. Google for example has been watching for that kind of stuff due to SEO word salad abuse.
Even better, try this replacing the '.' with a &+#+46; and the @ with a &+#+64; (remove the plus signs)
:) Bud
Hi, I work at Google and help out in our Webmaster Help forums. In general, we work hard to find, index and make available content that we find on the web -- this includes innovations and experiments in recognizing information which is rendered via Flash or JavaScript, in PDF files and possibly in other documents that contain indexable content. If you wish to prevent the crawling and indexing of content, it would be best to use methods that explicitly disallow access to that content (for example through a robots.txt file or a robots meta tag on the page itself).
| From: | (Anonymous) |
| Date: | June 24th, 2009 05:05 pm (UTC) |
|---|
| | Byproduct | (Link) |
|
Seems like you've been relying on an unreliable way to protect your email address. It's certainly foreseeable that some company might run the javascript. Probably the biggest reason for doing this is not to intentionally expose email addresses, but instead that since many pages have now turned to AJAX for basic text display, Google must process the javascript in the AJAX to correctly index and allow searching on the results. |
|