Hunch

7

Large-scale distributed processing on the web

December 16 by Rasmus Andersson, tagged javascript, html5, web, dropular and essay, filed under software

Silly drawing illustrating the awesomeness of people and the internets The title probably gives you goose bumps. No? It doesn’t? Maybe it should.

Imagine you have a lot of work to do, a lot of image processing work, like rescaling and cropping large amounts of pictures. Now think about the web as we know it, with web sites where people hang around for a few seconds now and then. Imagine each visitor would be given a task to complete while reading your web site. Like for instance download, rescale and crop a picture from somewhere on the web. It’s possible my friend.

Applied

For the upcoming new version of Dropular we are going to make good use of this technology.

The basic concept of Dropular can be explained with this use-case:

  1. A user called kate drops a picture she just found somewhere on the web (sending its URL to the Dropular service).

  2. kates dropped image appears on the Dropular front page, or in the global stream of pictures as well as appears in other places throughout dropular.net.

  3. Another user — let’s call him john — visits Dropular and sees the picture dropped by kate.

At step 3 we display a smaller version of the original image along with some metadata like a title, link to the original source, and so on. The smaller version of the image will be created by our imaginary user johns web browser. It only takes a split second and john will probably not notice anything.

Methods of processing

When it comes to image processing on the web at large, there are basically two (or three) types of methods one can employ:

  • Host-based processing
  • Client-based processing
  • A combination of A and B

The host-based processing method has the upside of being performed in a controlled environment, thus we can assure a certain level of quality and there are few — if any — trust issues. On the other hand, processing imagery can be a very resource-intensive task requiring loads of hardware and/or time + in most cases bandwidth (sending and receiving the source and output images).

Client-based processing methods are employed by most desktop applications, but until today no web applications, basically because the technology is not yet mature enough or even available.

Moreover desktop applications in general does only perform processing on trusted data, data available in your local computer, and only uses the output of the processing itself. My Photoshop program does not email you my cropped version of crazy-cat.png — if you want to crop that picture you do it yourself.

What we are trying to do is to marry the two methods, effectively performing processing only when needed and sharing the results among visitors.

Flow diagram

The problem with trust

So you’ve figured: the real problem with this shared distributed method is trust. What happens if a rogue user submits a bad picture? How can we trust the submitted outcome?

Ways to “work around” the trust problem:

  1. Only logged in users can submit.
  2. Race for Nth submission.
  3. Compare many with similarity threshold.

Method 1

Method 1 requires the submitting client to be verified (e.g. by means of username and password). The downside being a less powerful “grid” of clients performing passive processing.

A product with the majority of users being logged in, or where the logged in users are probable to activate task requests of most images, would probably benefit most from this solution.

Method 1 is probably the solution we will employ for Dropular.

Method 2

Many clients are given the same task and the Nth submission is picked. “Nth” might refer to a fixed, pre-defined number like “first response received” or “4th response received”, or it might be a random arbitrary number which changes between task contexts.

Using this method it would require a great effort from a rogue users perspective, drastically lowering the probability of success (of messing up things). However, it comes with the cost of increased latency (N number of submissions must be sent in before we can start utilising the results, i.e. a pre-processed image). It also requires more complex and foremost stateful backend (host) software.

Method 3

This method works similar to Method 2 in that we need to keep some kind of state in the host software. We request multiple submissions and compare the “outcomes” using some sort of similarity algorithm1 and identify the biggest cluster of commonality, pick one of the “outcomes” in that cluster and forget all other “outcomes”.

Here, we require a even more complex software running at the host. The upside being that rogue submissions will have a hard time making it (assuming only one submission per internet origin is allowed to participate in each session, and that the comparison algorithm is sufficient).

In practice

As of last week, my Hunch stuff — aka “box of interesting stuff” — uses this technique of distributed image processing. It currently works by using a canvas element to perform the actual processing with, then sending the resulting image data using a temporary hidden form. This currently works in Safari, Chrome and Firefox (possibly also Opera, but untested) — for sad people with Internet Explorer (or other browsers), no processing or submission will be attempted.

In the future

What more than image processing will we be able to distribute in the future? Already today we could hand out simple number-crunching tasks to clients in the same way, but what’s more alluring is the potential of distributing otherwise very expensive — or sometimes impossible — working sets. Data mining vast quantities of resources on the internet, anyone?


  1. In the case of image processing, each outcome might have totally different data (bits) since most image compression algorithms (e.g. JPEG and PNG) introduce some level of randomness, thus we can not use basic data comparison like checksums. 

22

Easy data visualization with WebKit

November 26 by Rasmus Andersson, tagged cocui, spotify, visualization and real-time, filed under software

At Spotify, we recently put up two large TV screens on the walls of our Stockholm office (most R&D is done there). The idea is to visualize & communicate that “stuff is happening” without actually revealing any critical data (since a lot of external people are visiting the office).

Today me, Andreas Öman and Emil Hesslow — fellow Spotifiers — kicked off a cozy little Hack Night at the office, trying to create something simple yet impressive to have running on one of the TV screens.

We ended up writing a real-time search query visualization in just about a few hours. It looks like this and is smoothly animated:

Screen shot

Try a demo version here… (Tested in Safari, iPhone, Firefox and Chrome).

How did we manage to build a real-time scalable system and high-performance viz in such an awfully short time?!

Hack nightWell, for starters we used WebKit through Cocui which instantly gave us full screen high-performance hardware-accelerated drawing (yes, it’s a long sentence with cool words but those things shouldn’t be taken for granted).

But… where does the data come from? From the internets? — Not really, but it sure travels in internets-style. We use a dumb pub/sub message queue. In one end a client (the WebKit/Cocui app in HTML/JavaScript) is listening (subscribing). In the other end one of our search servers are pushing messages into the queue in batches.

[batch of search queries during last minute]
                   ||
                   ||
                   \/
             [message queue]
              ||        ||
              ||        ||
              \/        \/
           [client]  [client]

The client simply enqueues these search queries when they are delivered while at the same time dequeueing search queries. We do it this way (batches and a queue) because we simply have too high rate of searches. It would be almost impossible to read anything if we actually sent every single message. To give the feeling of real time we use a random delay when dequeueing queries.

When a query is dequeued it’s formatted into a chunk of HTML (a div and an a tag) and then prepended to the body using jQuery with animated effects.

Don’t forget to try the demo version (Does not work in Internet Explorer) which is simply the client without any real-time data. Note that this demo uses static data for demonstration purposes, not an actual real-time stream. The real-time data stream is only available within our office and thus not available for public use.

Update: Successfully tested with iPhone, Firefox and Chrome — thank you readers!

Here’s the full Cocui demo application: sptv1-demo-cocui-app.zip (for Mac OS X 10.5 and newer).

26

Take a screenshot, paste the URL

November 21 by Rasmus Andersson, tagged scrup, osx, application and open source, filed under software

Scrup icon

I’m a big fan of integrated non-intrusive, productivity-enhanching applications. One category which is especially useful for me is the automatic publishing of screenshots, making conversations about looks and state so much easier.

Mr Bulgur: The label of the “More” button looks totally skewed.
Jean-Claude Which button?! You mean the home one? Looks good for me on Windows.
Mr Bulgur See, it looks like the v-centering algo is broken on OS X: http://hunch.se/s/8y/9sd0h2fcow8gs.png
Jean-Claude Ah! Yes, I’ll fix it in a blink of an eye.

I once purchased a license for Grab Up but the team bailed on us when the software broke. Moved on to TinyGrab but it’s too slow and often not working.

Since this functionality is rather trivial I looked around if someone had written an open source version, which I could simply adjust to my needs. None found. So I wrote one myself — Scrup.

Scrup is a simple little OS X application, or system plug-in, which sits in your menu bar:

Scrup in the menu bar

When you take a screenshot, Scrup sends it to a web server of your choice. The web server then do something with the image (saves it, doh!) and returns a URL to the new image. That URL is then placed in your pasteboard, ready to be pasted somewhere. Scrup also keeps a list of the most recent scrups in it’s menu, for easy access at a later date.

Continue reading...

4

Improving the Spotify installation experience

October 24 by Rasmus Andersson, tagged spotify, mac and ux, filed under software

At the time of writing this, we distribute Spotify for Mac OS X as a regular DMG (disk image). The user experience is not really what I would call smooth:

  • Download the DMG file.
  • Open the DMG (implicitly mounting the disk image. Safari does this for you, BTW).
  • Move Spotify to the Applications folder.

Now, for a unexperienced user double clicking the app icon inside the DMG feels like a natural action. It’s there, I’ll just open it then. Later, she restarts her computer, the DMG gets unmounted and “Hey, where’s Spotify?”.

Our solution is to use an internet-enabled disk image which automatically unpacks Spotify upon download. We then use some magic in the app to check if it was launched from another place than the Applications folder.

Continue reading...

Manage posts in Gitblog web admin UI

October 12 by Rasmus Andersson, tagged gitblog, ui and ux, filed under software

The latest version of Gitblog got a new posts manager, which has been inspired by the inbox of Google Mail.

As a post can be in several states (and versions) at the same time, I had to use multiple dimensions of visual cues — row colors, labels (”Draft”, the “scheduled” clock) and hierarchical rows in the case when there is a work version alongside a cached version.

  • Yellow marks a modified, uncommitted (but tracked) working copy with an older cached version live.
  • Green marks a scheduled, tracked post (which will appear live once it’s future publish date is reached).
  • Red marks a post which has been removed in the working stage, but is still tracked (previous version is still live).
  • Grey marks a uncommitted (untracked) post, a post which does not have a record in the repository.

To get this feature, simply perform a git pull in your gitblog directory:

cd path/to/my/blog/gitblog
git pull

Gitblog says it in Markdown

October 5 by Rasmus Andersson, tagged gitblog and markdown, filed under software

Yesterday Markdown support was introduced in a new version of Gitblog. Markdown is a text-to-HTML conversion tool for web writers. Markdown allows you to write using an easy-to-read, easy-to-write plain text format.

Simply give your post a file extension of .md, .mdown or .markdown and you’re done.

For example, have a look at this post from yesterday and compare it to it’s Markdown source: content/posts/2009/10-10k-comet-connections.md.

The Mardown Extra flavour is used, adding structures like tables and header id tags. There are also a few Gitblog specific things, most notably the support for code-blocks (syntax highlight). Language can be explicitly specified using a shebang:

#!language
actual code...

To get this feature, simply perform a git pull in your gitblog directory:

cd path/to/my/blog/gitblog
git pull
8

10 000 comet connections

October 5 by Rasmus Andersson, tagged nginx, comet, http, push and pubsub, filed under software

Q: How well does the Nginx HTTP push module perform with 10 000 concurrent clients? (Ye olde C10k problem).

A: Very well. About 7 kB per client and practically zero CPU load.

This article describes how I performed the test, using three different hosts — my local computer, a Debian Linux server and a Mac OS X host simulating 10 000 clients.

Continue reading...

6

Comet/HTTP push with nginx

October 2 by Rasmus Andersson, tagged nginx, comet, http, push and pubsub, filed under software

One of the most cumbersome problems of implementing some kind of HTTP push a.k.a. Comet functionality is that the client (website) need to be served from the same host and on the same port as the actual push (long-polling or multipart response) mechanism. Now, as we need to maintain a high number of concurrent client connections we can not use traditional server-side applications like PHP or Ruby on Rails. Using PHP for instance would require one PHP process per client connection — the main memory would quickly become saturated and we’ll most likely hit some scary limit of fds and processes the kernel handles without being a sad little kernel.

For a little project I’m involved in I needed to have the best of two worlds and to a cheap price…

Continue reading...

2

How I wrote DroPub in two days

September 20 by Rasmus Andersson, tagged cocoa, programming and dropub, filed under software

Yesterday I wrote DroPub — a simple but powerful little OS X application which transparently handles file transfers “from the desktop”.

Even though it has a lot of features, have been tested, updates itself and so on, I only spent about two days on the whole project — for me, this is the essence of Cocoa.

DroPub is heavily based on NSOperations and uses a hierarchy model for structuring operations. NSOperation hierarchies are powerful means for writing most types of “service” applications. The code can easily be followed by a Cocoa programmer and the operating system frameworks and libraries can give really good performance.

Continue reading...

3

DroPub 1.0

September 20 by Rasmus Andersson, tagged osx, application and cocoa, filed under software

DroPubI just released the first official version of DroPub. It’s a mute little OS X application which makes drop boxes and sending stuff to remote servers as simple as it gets — just put a file in a regular folder and you’re done.

Try it out — simply download, double click and create a folder.

Uses SCP (secure copy over SSH) and thus you need to add your SSH key to the remote server in order for things to work. A future version will introduce storing of passwords in the KeyChain.

It normally lives in the menu bar, but can live in the dock also

It normally lives in the menu bar, but can live in the dock also.

Preferences

Virtually unlimited number of folders can be watched and configured.

Further reading: How I wrote DroPub in two days.