Tweaking SpamAssassin Works

January 18th, 2005

A while back I mentioned that spamassassin wasn’t filtering my spam too well anymore and that I suspected it might be because the bayesian filter had been neutered.  Apparently I was right.  Since I upped the score for BAYES_99 the amount of spam reaching my inbox has dramatically reduced.  I’m back down to one or two spam messages getting through per week instead of four or five a day.

Very happy.

One Up For Technorati

January 16th, 2005

Technorati seem to have worked out much better ways for claiming your blog in their listings than Feedster - particularly the new auto-claim mechanism (of course you need to trust them a fair bit to use that).  With Feedster I have to paste this stupid link in a blog entry and annoy anyone subscribed to my feed or any of the planets that aggregate my feed with a pointless entry.  So Feedster, when are we going to get a less annoying method?

Playing With Technorati and PubSub

January 16th, 2005

I previously mentioned that I don’t get Technorati.  In response to that I got a bunch of good responses so I thought I should play with it more in light of those comments.  It seems that Technorati is best used for tracking news as it happens, so the release of the Mac Mini was the perfect test case.  To add to the fun I also ran the same search on PubSub to compare the results.

The search itself was something along the lines of "mac mini" OR "mini mac" OR "minimac" or "macmini" so it pretty comprehensively covered any reference to the Mac Mini.  Both engines seemed to understand the exact same query string which was nice considering search engines tend to have slightly varying formats for query logic.

Overall, both services were somewhat useful - I got a ton of comments about the Mac Mini and a few of them were worth reading.  I’d say I got a reasonably good understanding of what bloggers thought about the Mac Mini.  Unfortunately, that understanding won’t translate at all to what matters for the success of the Mac Mini since the target audience for the Mac Mini typically don’t write blogs.  Still interesting to view the reactions though.

Improvement 1: Support Language Filtering

The big problem with both engines was that because my search was on a product name I got a lot of results in foreign languages.  There really needs to be a way to specify that I only want languages X, Y and Z.

Improvement 2: Fix Character Set Support

The second problem with both engines was that they made a total mess of any character set that wasn’t ASCII.  This problem is probably caused more by the fact that bloggers themselves tend to tag the character set incorrectly and made worse by the fact that planet style aggregators are notoriously bad at corrupting character sets.  So it’s not going to be easy to fix the problems but there needs to be a fix - it’s very annoying to receive a bunch of question marks as a search response.  I think this happened more with PubSub but it occurs with both engines.

Improvement 3 (Technorati Only): Give Meaningful Summaries

There seems to be two key differences in the approaches taken by Technorati and PubSub - Technorati searches the entire page, PubSub only searches the RSS feed.  This means that Technorati can find matches that occur outside of the short snippet in the RSS feed but means that PubSub can provide the actual RSS feed entry as the result match (and know that it contains the search term) whereas Technorati try to create their own summary from the page.  Technorati’s algorithms for this are really woeful - I don’t recall ever getting a match from Technorati that I understood without opening it in a browser.  With PubSub on the other hand, most matches could be read right from my aggregator which makes life manageable when you have that many entries flying past.

Improvement 4: Provide Magic Pixies To Summarize Results

Basically, I was absolutely flooded with matches.  I’ve had to unsubscribe from the feeds so I could get some work done.  My aggregator updates feeds every 30 minutes or so and every time it updated for 2 or 3 days there would be another hundred updates to read about the Mac Mini.  Users could help mitigate this by subscribing to the search feeds in a second aggregator that didn’t grab their attention when new matches come through.  That still wouldn’t work well though because when you scanned it at the end of the day you’d have a few thousand results to go through.  Maybe Robert Scoble has time and interest to read that much but I certainly don’t.  I have no idea how this feature would work but I want it.

I suppose the ideal approach would be to parse all the matches using (as yet unheard of) natural language processing, work out the key points in the discussion and combine them.  Then you provide a report like:

I want one: 10000 entries

Not so cheap as you have to buy a monitor (side note: aka people who missed the point): 1000 entries

I hate Apple fanboys: 20000 entries

Each key point could come with a few choice examples and a link to go view the entire list of entries that mentioned that point.  I should get an update in my RSS feed for the report either whenever a new entries is discovered or when a new key point is discovered (user chooses which when they conduct the search).

Of course if I knew how to actually achieve this I’d write it and become an instant millionaire.  Maybe you just need to use collaborative categorization instead of a magical language processor to work out what the key points are.  Then again, the mainstream news has been happily providing these summaries anyway.

Conclusion

These systems would be very useful if it were my product that was being released or if the news item wasn’t the biggest event happening that absolutely everyone just had to comment on.  Both systems became unusable purely because they found too much stuff.  The best engine however was clearly PubSub because of it’s meaningful RSS feeds.  Technorati really need to fix this as it really makes the entire system worthless.  Even with low-volume searches I tried Technorati drove me crazy with it’s useless matches (made worse by the fact that it returns a lot of false-positives due to picking up links in sidebars etc despite the fact that the post had nothing to do with the search terms).

On the other hand, Technorati at least provides instant results instead of PubSub’s really weird approach of returning nothing up front and then getting back to you on your query an hour or so later (the length of time taken is exacerbated by the fact that aggregators only check for updates every so often).  Technorati then would be my choice of the systems if I wanted to know what people had already said about a topic but didn’t want to follow it over time (this would have been a more effective approach to the Mac Mini conversation), but in reality I’d probably just use Google.  PubSub is definitely better for watching conversations over time though.

Now do I have to submit this to Technorati’s feedback email address (and the equivalent that I assume PubSub have) that was mentioned in my last post or is their system good enough to find this and tell them about it for me….

I Don’t Get Technorati

January 10th, 2005

I’ve been reading Scoble a lot lately and he’s so excitable about blogging and related technologies that I figured it was worth checking out some of them.  So I started playing with Technorati and so far I’m just not getting it.  It’s a far worse search engine than Google, it doesn’t pick up on changes to content particularly well, it requires people to go to a fair bit of effort to ensure that they do wind up in the index and most of it’s search results just point to stupid people ranting on their blogs (you know, like I’m doing now).

The worst part of it though is that it doesn’t seem to be subscribed to the RSS feeds which would have made sense - it searches the contents of the HTML page.  This in turn means that it picks up on syndicated content as if it were actually being referred to and some comment made about it and the most cardinal sin - the content has disappeared off the page unless you’re following it all in real time.

It seems to me that I’m much better off just searching Google with the keywords and finding things that way rather than restricting my searches to the very small list of stuff Technorati knows about.

Update: Robert Scoble responded with some interesting comments, including a link to David Sifry’s blog (knowing him he’ll have found my post through Technorati).  Tim pointed out one of his recent success stories in the comments below and David Starkoff pointed me to http://tantek.com/log/2005/01.html on IRC.  David Sifry’s blog happens to have an entry that gives a good reason for Technorati searching the contents of the page:

Second, not all people who have RSS feeds have full-text feeds. Technorati actually indexes the full content of a post, not just the partial text that is often in the RSS feed.

Pretty obvious now that I think about it.  Hopefully they’ll find a way to ignore all the extraneous links and syndicated content around the sides and focus in on the useful stuff in the middle.  Of course it would be even better if everyone made an RSS feed available with full contents (even if they have a summary feed as well) but that seems like a whole new religious war.

I also have to play some more with Feedster and PubSub since they seem to be competing in the same space.  I find PubSub’s LinkRanks to be interesting but I wish there was RSS feeds for watching changes to specific URLs and changes within the top X rated sites etc (ideally X would be up to the user).

When The Going Gets Tough The Press Get All Whiney

January 9th, 2005

Robert Scoble points to an open letter from Jason McCabe Calacanis to Steve Jobs about Apple suing Think Secret.  It comes across to me as the epitome of childishness.  The key to it all comes down to this point:

If you want to sue someone sue your employees who send us the leaks, or your partners who tip us off. They are the ones who sign agreements with you not to talk—not us!

Which of course is precisely what Apple is trying to do.  Think Secret however won’t tell Apple who gave them the information and as such Apple is suing them in an attempt to find out those names and to prevent them publishing trade secrets (the press has no right to publish trade secrets).

If you want free press then the press has to act in good faith.  Knowingly publishing trade secrets and information that was illegally disclosed to them is not good faith.  Think Secret knew that their informant was breaching an NDA in revealing the information to them, they knew that the information was a trade secret and they knew that publishing that information was not in the interests of the greater good (the information about Apple releasing a sub-$500 PC is not beneficial to the greater good even if Apple actually releasing the sub-$500 PC is).  Even more importantly the information would have come out in time anyway direct from Apple, without any shred of doubt and without any breach of NDA.  There was no need for Think Secret to publish the report other than to generate revenue for Think Secret.

What worries me more though is that Jason is willing to call Think Secret "the press".  Think Secret is not a professional media organization, it is not responsible, it does not report news.  It reports rumors.  It doesn’t back up it’s stories with evidence and it is prepared to report rumors which can’t be confirmed.  That’s not the actions of a responsible news organization and it should not be encouraged.  It is not the behavior I expect from "the press" (though it is all too common).  Think Secret could do wonders for their reputation by simply naming their sources.  By doing so we would all be able to judge whether or not that person is in a position to know that information and those who are interested could track their previous record for information.  If they’ve provided false information in the past readers could disregard information that came from them.  However since Think Secret gets most of their information through illegal means (for the provider of the information not for Think Secret) they are compelled to refer to their sources as "a highly reliable source" or attribute quotes to "a source".

Do you want to live in a world where we don’t have a rabid press? Sure, the world does not revolve around gadgets, but the principles of a free press should be obvious to a rebel like you. Maybe you’re not a rebel any more, and maybe you listen to lawyers more then you listen to your heart.

I don’t want to have a rabid press.  Reporters foaming at the mouths do nothing to benefit society.  Responsible, accurate, verified reporting is useful to the public.  The media seems to have forgotten that in recent years.  These days the media aren’t about providing news to the public, the media is about providing sensationalistic headlines to grab attention and about beating up stories to get readers attention and get a "scoop".  It’s sickening and every reporter should be ashamed of the state their industry is in.  I mean you know you’re in trouble when blogs are being considered a news source that rivals the mainline media - blogs are all about personal opinion, rants and unjustified rumors.  The mainline media should stand out head and shoulders above such a raucous but they don’t.

Think of all the good the press has done for the world in righting big wrongs, and fighting for the every (hu)man.

I seem to remember the press vilifying teachers who took appropriate disciplinary action against misbehaving kids.  I seem to remember the press incorrectly branding an innocent man a pedophile.  I seem to remember the press doing a whole bunch of miracle weight loss cure stories.  I don’t recall the press ever doing anything in a "fight for the every (hu)man" but I do recall them doing anything they can to grab attention and improve their bottom line.

You need to learn to play nice in the sandbox or we’re going to go home, and I can tell you it’s no fun playing alone.

There’s a very simple, age old response to that: bye!  Face it the media isn’t going to stop reporting on Apple - doing so would reduce their readership, cause them to miss out on attention grabbing headlines and in turn reduce their profit.  So if you want to go home you do just that - another news source will happily accept all your readers.  I’d say you should expect Apple to call you on that bluff too.  What are you going to do then?  Maybe you could try telling the teacher.