Log in

No account? Create an account
WARNING: Google has broken Javascript spam munging - Baxil [bakh-HEEL'], n. My Sites [Tomorrowlands] [The TTU Wiki] [Photos]
View My LJ [By Tag]

June 22nd, 2009
06:06 pm
[User Picture]


Previous Entry Share Next Entry
WARNING: Google has broken Javascript spam munging
Cruising through my mailbox just now, I happened to glance at a piece of spam before deleting it, and did a double-take:

[Headers for spam to the address kawaii@tomorrowlands.org]

The reason I was startled is that the kawaii address (for feedback from my "Chibi Jesus" page) is one that I exempted from spam filtering about a year back. I chose an unused address at my domain, did not use it for any purpose or attach it to any outbound mail, and published it nowhere except for a single web page, where it was protected from spam filtering by the Javascript munging recommended by Project Honey Pot.

Here, as of a month ago when it wasn't being spammed, was the only reference (WARNING: HIDEOUS MIDI MUSIC) to that address on the Web:

// thx to http://www.blazonry.com/javascript/js_hiding.php
var rhs = "tomorrowlands";
var tld = "org";
var lhs = "kawaii";
function print_mail_to_link()
document.write("<a href=\"mailto");
document.write(":" + lhs + "@" + rhs + "." + tld + "\">");
document.write(lhs + "@" + rhs + "." + tld + "<\/a>");

<b><script language="JavaScript" type="text/javascript">print_mail_to_link()</script></b>
<i>(An e-mail link here has been hidden in Javascript. If you have Javascript turned off, please use the contact form linked at the bottom of the page.)</i>

Should be pretty freakin' bulletproof, right? After all, as Project Honey Pot noted with no apparent sense of irony, "It should be noted that both of these techniques are likely to remain sound for some time to come. Harvesters that interpret the Javascript on every page they encounter would face a substantial risk of getting stuck in infinite loops or crashing due to malformed Javascript. ... This is likely beyond the current computing power of a legitimate company like Google."

The problem is that, if a legitimate company like Google does apply the computing power to it, the spammers don't have to expend the effort: they merely have to crawl the Google results.

And, alarmingly, this seems to be what has started to happen.

At first I thought that the address had been either guessed or else reposted somewhere, and I ran a Google search for kawaii@tomorrowlands.org in order to explore this. The only result to pop up was my own page, and the text summary of the page read:

Last major update Jun 17, 2002 ; last minor update Mar 22, 2007 . Scripting and design © 2000-2007 Tad 'Baxil' Ramspott. Chibi Jesus! kawaii@tomorrowlands.org.

The source code of the Google results page shows the address bare: "<em>kawaii@tomorrowlands.org</em>"

The source code of the Google-cached page is identical to mine (i.e. no raw address; the Javascript is preserved); the cache was taken May 12, 2009. It appears that the caching itself doesn't break the munging. There must be something about the excerpting process that does the trick.

At first I couldn't believe my eyes. Was this coincidence? I went through my recently deleted e-mail and rechecked all of the spam headers.

I have a similar whitelisted-and-munged address I use only for WikkaWiki announcements, that has only been posted on my own wiki, protected similarly. It has also started receiving spam, and investigation turned up the same results. At this point the evidence is pretty damning.

The first spam I still have for kawaii was on June 8; it's likely that Google's behavior change dates from before then, and the spammers are only now beginning to take advantage of this new potential. The spam started slowly and is now up to several messages per day -- word is probably spreading amongst the bad guys.


Webmasters: Time to re-spamproof your site. A damn useful tool has just dropped out of the toolbox.

PLEASE NOTE: I have disabled the e-mail address referred to by this post. To contact me regarding this post, please write to [the first three letters of this journal name] [the dash symbol, '-'] [mail] [at-sign] tomorrowlands [dot.] org, or leave a comment below.

UPDATE: Two pieces of additional information I'd like to pull out from comments:

1. Even though the sample search I provided was for the compromised e-mail address, the spammer does NOT need to previously know your e-mail address in order to Google it. They just have to search for things shaped like e-mail addresses and skim the cream of the results. [*]

2. There is anecdotal evidence that pages which pull their decode function from a separate .js file have not been broken. (Yet.)

UPDATE 2: Welcome to /. readers! More discussion in the Slashdot thread.

Current Location: ~spiral
Current Music: Jim's Big Ego, "WTFMFWTFAYT?"
Tags: , , ,

(32 comments | Leave a comment)

[User Picture]
Date:June 23rd, 2009 02:09 am (UTC)
... oh Google. How I want to love you, and how, sometimes, you fail so very hard.

This reminds me of that meta-Google search engine that you wrote about in the T-lands universe, though. ;)

And also reminds me of Chibi Jesus. XD Good times!
[User Picture]
Date:June 23rd, 2009 02:20 am (UTC)
Yeah, Google gets closer and closer to DWIM every day.

In a world with magitechnology, DWIM would be abused just as hard by spammers as Google is today. On the other hand, in a world with magitechnology, every time that spam arrived in a mage's mailbox, someone's server would catch on fire.
[User Picture]
Date:June 23rd, 2009 02:15 am (UTC)
Gah, spam - scourge of the universe!
[User Picture]
Date:June 23rd, 2009 02:40 am (UTC)
That led to some fascinating Google searches. I hope that we can find out more about the matter. What else have you checked to see how Google's behavior has changed - have they made any announcements that seem relevant?

Favorite techniques after searching: changing the code direction and the more effort-intensive but accessible mod_rewrite and PHP/JS trickery recommended by A List Apart. If I were a better programmer, I'd try to implement the latter in Python/Pylons, since I have a vague idea of how it could be done.
[User Picture]
Date:June 23rd, 2009 09:46 pm (UTC)
I haven't seen anything from Google, and I'm not certain how to contact them about this sort of issue (it doesn't seem like the sort of thing to report to their security@ bounce), but now that the story has hit /. hopefully answers might be forthcoming.

Incidentally, as a counterpoint to the links you provided above, I followed a comment in one of those pages to http://jasonpriem.com/2009/05/stop-obfuscating-email/ . I disagree with its main argument, but its point about the poor security of email obfuscation is well taken.
[User Picture]
Date:June 23rd, 2009 05:07 am (UTC)
I sent off a message linking to this here post to my webmaster and sysad. And I have a kitty here asking me to go to bed, and so I shall...
[User Picture]
Date:June 23rd, 2009 05:29 am (UTC)
It's in Google's interest to be able to index simple document.write-generated HTML, since it's so common. They're definitely not doing full-blown Javascript execution. I munge email addresses on my own website by swapping every pair of letters and the Google snippet doesn't show the results of that. This also suggests that it's not simply a time-bounded execution, since my decoder takes almost no time to run.

Project Honey Pot's suggestion that a harvester would need a full-blown Javascript engine seems a little ridiculous. I bet you can get a lot of the way with a bounded-time non-Turing complete subset of Javascript.
[User Picture]
Date:June 23rd, 2009 07:40 am (UTC)
Huh, that's actually even stranger: there's no substantial difference in our two routines except for yours calling a subfunction. I mean, ultimately they both boil down to taking a hardcoded string as input, tweaking with the input in various ways and document.write'ing it. And yet, you're right, on your site Google hasn't seemed to pick it up. (Though your address is compromised in a hundred other ways ...)

I suppose this could also be taken as evidence that Google doesn't (currently) interpret pages based on items included by reference, since you call a separate .js file for your functions. That theory would benefit from a more rigorous test.
[User Picture]
Date:June 23rd, 2009 06:35 pm (UTC)
You might want to try retrieving the email address with an XHR next - I doubt google will permit such a request, and so they will be unable to decode it.
Date:June 23rd, 2009 09:44 pm (UTC)


You could use robots.txt to block Google's access to your external javascript file or XHR request URL.
Date:June 23rd, 2009 07:18 pm (UTC)

Make them work for it

You might want to factor some very large numbers in your munging code there...
[User Picture]
Date:June 23rd, 2009 10:31 pm (UTC)

Re: Make them work for it

While that does scratch the itch for spam vindictiveness, that seems to me to be doomed to failure. I do want the address to display to human visitors, and I would suspect that any delay large enough to deter automated processes would also serve as a source of annoyance or "This must be broken" for my meatspace visitors.
Date:June 23rd, 2009 07:26 pm (UTC)

Hang On

The only way the address is being displayed is if you search for the address itself. And if you already know the address it doesn't matter if it's plain text. Any other summary shown doesn't contain the address. I suspect something managed to harvest it, and you're just going after the easy target. Show me a google search for something that isn't the e-mail address that shows the address in plain text.
Date:June 23rd, 2009 07:59 pm (UTC)

Re: Hang On

If they are indeed running the Javascript then I would just change the code to only execute if it's run in a full browser (which, I assume, they are NOT doing.) Checking navigator.userAgent should make this relatively easy.
[User Picture]
Date:June 23rd, 2009 08:58 pm (UTC)
Google tries to block bots from access. The obvious end results of this is botnets to query Google- not to intentionally DDoS it, but for information harvesting, since Google's the best tool to do the heavy lifting in the background, apparently.

A brief experiment doesn't show Bing, Ask, Yahoo, or AltaVista doing this.

(Full disclosure, for those who don't know- hi, /. p33pz!- I'm a full-time employee of Microsoft. This comment doesn't reflect any position or information official to Microsoft, and no representation I make about a Microsoft product herein is official or checked for accuracy. I do not work on the Bing team, have no non-public information about Bing, and wouldn't disclose it if I did.)
[User Picture]
Date:June 23rd, 2009 09:40 pm (UTC)
Posted this in IM to Bax this morning, but maybe someone else will find it interesting:

        <style type="text/css">
            #mailme             { list-style-type: none; float: left; }
            #mailme li          { display: inline; float: right; }
        <ul id="mailme">

It's not easy to hilite, and it's a mess if you copy it to the clipboard, but it's an interesting non-JavaScript way to post an address that would require some computational feats for spambots to interpret correctly.
Date:June 24th, 2009 12:55 am (UTC)
This is much simpler and works just as well:

<div style="font-family: 'courier new', courier, monospace; line-height: 0px;">
z r y i h g a l c m<br>
&nbsp;c a f s @ m i . o </div>


z r y i h g a l c m
 c a f s @ m i . o
Date:June 24th, 2009 09:38 am (UTC)
Hi, I work at Google and help out in our Webmaster Help forums. In general, we work hard to find, index and make available content that we find on the web -- this includes innovations and experiments in recognizing information which is rendered via Flash or JavaScript, in PDF files and possibly in other documents that contain indexable content. If you wish to prevent the crawling and indexing of content, it would be best to use methods that explicitly disallow access to that content (for example through a robots.txt file or a robots meta tag on the page itself).
Date:June 24th, 2009 05:05 pm (UTC)


Seems like you've been relying on an unreliable way to protect your email address. It's certainly foreseeable that some company might run the javascript. Probably the biggest reason for doing this is not to intentionally expose email addresses, but instead that since many pages have now turned to AJAX for basic text display, Google must process the javascript in the AJAX to correctly index and allow searching on the results.
Tomorrowlands Powered by LiveJournal.com