Hunch

7

Large-scale distributed processing on the web

December 16 by Rasmus Andersson, tagged javascript, html5, web, dropular and essay, filed under software

Silly drawing illustrating the awesomeness of people and the internets The title probably gives you goose bumps. No? It doesn’t? Maybe it should.

Imagine you have a lot of work to do, a lot of image processing work, like rescaling and cropping large amounts of pictures. Now think about the web as we know it, with web sites where people hang around for a few seconds now and then. Imagine each visitor would be given a task to complete while reading your web site. Like for instance download, rescale and crop a picture from somewhere on the web. It’s possible my friend.

Applied

For the upcoming new version of Dropular we are going to make good use of this technology.

The basic concept of Dropular can be explained with this use-case:

  1. A user called kate drops a picture she just found somewhere on the web (sending its URL to the Dropular service).

  2. kates dropped image appears on the Dropular front page, or in the global stream of pictures as well as appears in other places throughout dropular.net.

  3. Another user — let’s call him john — visits Dropular and sees the picture dropped by kate.

At step 3 we display a smaller version of the original image along with some metadata like a title, link to the original source, and so on. The smaller version of the image will be created by our imaginary user johns web browser. It only takes a split second and john will probably not notice anything.

Methods of processing

When it comes to image processing on the web at large, there are basically two (or three) types of methods one can employ:

  • Host-based processing
  • Client-based processing
  • A combination of A and B

The host-based processing method has the upside of being performed in a controlled environment, thus we can assure a certain level of quality and there are few — if any — trust issues. On the other hand, processing imagery can be a very resource-intensive task requiring loads of hardware and/or time + in most cases bandwidth (sending and receiving the source and output images).

Client-based processing methods are employed by most desktop applications, but until today no web applications, basically because the technology is not yet mature enough or even available.

Moreover desktop applications in general does only perform processing on trusted data, data available in your local computer, and only uses the output of the processing itself. My Photoshop program does not email you my cropped version of crazy-cat.png — if you want to crop that picture you do it yourself.

What we are trying to do is to marry the two methods, effectively performing processing only when needed and sharing the results among visitors.

Flow diagram

The problem with trust

So you’ve figured: the real problem with this shared distributed method is trust. What happens if a rogue user submits a bad picture? How can we trust the submitted outcome?

Ways to “work around” the trust problem:

  1. Only logged in users can submit.
  2. Race for Nth submission.
  3. Compare many with similarity threshold.

Method 1

Method 1 requires the submitting client to be verified (e.g. by means of username and password). The downside being a less powerful “grid” of clients performing passive processing.

A product with the majority of users being logged in, or where the logged in users are probable to activate task requests of most images, would probably benefit most from this solution.

Method 1 is probably the solution we will employ for Dropular.

Method 2

Many clients are given the same task and the Nth submission is picked. “Nth” might refer to a fixed, pre-defined number like “first response received” or “4th response received”, or it might be a random arbitrary number which changes between task contexts.

Using this method it would require a great effort from a rogue users perspective, drastically lowering the probability of success (of messing up things). However, it comes with the cost of increased latency (N number of submissions must be sent in before we can start utilising the results, i.e. a pre-processed image). It also requires more complex and foremost stateful backend (host) software.

Method 3

This method works similar to Method 2 in that we need to keep some kind of state in the host software. We request multiple submissions and compare the “outcomes” using some sort of similarity algorithm1 and identify the biggest cluster of commonality, pick one of the “outcomes” in that cluster and forget all other “outcomes”.

Here, we require a even more complex software running at the host. The upside being that rogue submissions will have a hard time making it (assuming only one submission per internet origin is allowed to participate in each session, and that the comparison algorithm is sufficient).

In practice

As of last week, my Hunch stuff — aka “box of interesting stuff” — uses this technique of distributed image processing. It currently works by using a canvas element to perform the actual processing with, then sending the resulting image data using a temporary hidden form. This currently works in Safari, Chrome and Firefox (possibly also Opera, but untested) — for sad people with Internet Explorer (or other browsers), no processing or submission will be attempted.

In the future

What more than image processing will we be able to distribute in the future? Already today we could hand out simple number-crunching tasks to clients in the same way, but what’s more alluring is the potential of distributing otherwise very expensive — or sometimes impossible — working sets. Data mining vast quantities of resources on the internet, anyone?


  1. In the case of image processing, each outcome might have totally different data (bits) since most image compression algorithms (e.g. JPEG and PNG) introduce some level of randomness, thus we can not use basic data comparison like checksums. 

1

Hunch Aggregator

August 14, 2008 by Rasmus, tagged aggregation, essay, hunch, php, python, software and web, filed under software

I’ve never written about this piece of software before, but have gotten a few questions about it recently, so I thought I’d shed some light on things.

The Hunch Aggregator is a pretty simple thing – it keeps a central state of many addressable online services. Once upon a time, I created and managed most services myself, like hosting images, blogging, chatting, etc. Suddenly better services than those which I have written popped up – Flickr, Jaiku/Twitter, Facebook, Wordpress, Google Reader and so on. So like any pragmatic tech-savvy user would do, I distributed the tasks. Outsourced the pain in the ass of keeping things up to date and working. Now, a new problem arouse: I am one person, but to the outside, the myriad of services did express several different users. Different “persons” or identities. All I want is my friends to be able to hear me, not spend all their time surfing around this myriad of specialized websites.

So I came up with the simple idea of presenting my stuff as a singular stream of events, occurring over time. The first versions of the aggregation software was clumsy and hard to extend. It was unable to synchronize (only add new things) and the presentation was not very sexy. A few years later I blew off the dust from the idea and began from scratch. The result was what is now running on hunch.se.

Continue reading...

Music is Identity

June 9, 2008 by Rasmus, tagged essay, music, social and sociology

Silhouette of a personMusic has always been a way for people to communicate and create an identity. From the traditional and world music of tribes and alike to the modern multitude of genres and subgenres, many closely associated with social groups.

In a modern perspective, I think we sometimes forget about this, almost instinctive, way people want to identify themselves using music (among other things). One company and service which do promote using music as part of an identifier is Last.fm. Basically, you give Last.fm information on when and what you are listening to, they then makes this information available in a online community (or more precisely; the whole world wide web) in a way that makes your profile as an individual clearer and better defined. Last.fm also helps finding other people in the world sharing the same taste in music, and probably also share more stuff on other levels.

May it be connecting people by means of music, interests, taste of fine arts, ethic groups, business branches, schools or sports the effect is the same: Increased definition of the individual. People tend to being drawn to social groups, especially in early ages (teenages).

Music becomes a more important method of definition as the modern society blurs the borders of traditional groups like ethicality and alike. People travel all over the world and business do the same thing, spreading across nations and continent boundaries.

I would like to see more focus towards social identification of the individual, in the music-biased services of today and tomorrow. Easy access to technologies like web services/APIs and feeds can further enable expansion of the individuals’ social presence.