Skip to content

The new XML

JSON is the new XML.

JSON, the “JavaScript Object Notation”, is a very simple notation for representing hierarchical data. It achieves what XML was supposed to do, and in a very simple way.

While the XML spec is hundreds (thousands?) of pages, the JSON spec fits comfortably on one page. Although the name suggest an association to JavaScript, JSON is in fact language-agnostic. There are JSON libraries for almost any language out there.

Sample JSON fragment:

{
"name":"John",
"tel":["+16051231234", "+48885003300"],
"address":{"street":"Dyamond 8", "city":"San Francisco"}
}

JSON has elementary values (strings, numbers; e.g. “foo”, 10), arrays marked by square brackets (e.g. [2, 4, 8]), and maps (also named ‘objects’) which contain key:value pairs between curly brackets (e.g. {“name”:”John”, “age”:”10″}).

Because of its simple and uniform structure, JSON is very simple to parse (this is a major advantage over XML). JSON is more compact then XML. JSON is more ‘pure’, more elegant than XML. And because of its ability of representing arbitrary complex hierarchical structures, JSON is just as powerful as XML.

While XML had all the hype, JSON is quietly getting the job done.

Follow-up: quite funny, only one day after I wrote this article, there was a similarly-themed post on CEO’s blog: JSON for data exchange (vs. XML).

Warszaw Marathon

Today I run, for the first time, the marathon. The marathon distance is 42.195 km. I finished it in 4:28 (i.e. a little under 4 hours and a half). Although I drank a lot of isotonic drinks and water (about 2 liters) during the course, at the end of the course I was weighting with 2.5kg less then at the start (perhaps lost water). Now my legs hurt and it’s a bit difficult to walk, but during the course I had no particular problems. Now I look forward to doing it again (with a better time?).

Today I run, for the first time, the marathon. The marathon distance is 26 miles 385 yards. I finished it in 4:28 (i.e. a little under 4 hours and a half). Although I drank a lot of isotonic drinks and water (about half a gallon) during the course, at the end of the course I was weighting with 5.5pounds less then at the start (perhaps lost water). Now my legs hurt and it’s a bit difficult to walk, but during the course I had no particular problems. Now I look forward to doing it again (with a better time?).

Simplicity is hard to get

I’ve just seen the new web application Jottit, and I liked the way they get simplicity right.

Because simplicity and elegance is not only in the way it looks, but also: in the size of the web pages, in the URL naming scheme, and most importantly in the interaction design.

About the web page size, do you know many sites with under-1KB size pages? To compare, what about sites with over-100KB size pages? Why would size matter — because more people are browsing the web on mobile phones; but also because simplicity and elegance mean minimalism, i.e. not to put on a page stuff that’s not needed. It doesn’t matter that the slack is not visible because it’s hidden inside the HTML source, having an unnecessarily large page is still a violation.

The URL naming scheme is of paramount importance, yet it’s surprising how much developers aren’t aware of this. I created a test page on Jottit, and it has the URL http://javia.jottit.com/. You can’t get the URL simpler than this. The URL must be short, semantic, and without arbitrary, meaningless parts in it. If you have to put some data key somewhere (in the URL or in a Cookie) make it short! Look at Jottit, when they put a page-code in the URL, it has 4 characters: http://jottit.com/xy4v/ .

Let’s compare: I was recently looking for where to host an open source project. Sourceforge.net is the venerable site, and I have respect for the good they did for many open-source projects, but the site IMO sucks. It is as complicated as it can get! I think they worked hard to put as much stuff as possible on as many pages as possible. And one of the smaller details: they also own the domain name sf.net, which only redirects to sourceforge.net. I can’t begin to understand why, having such an incredibly valuable domain name available, the 2-letter SF.net, they don’t use it. If my project was named let’s say “foo”, I’d love a domain name like “foo.sf.net”.

Google Code Project Hosting (I hope I got the official name right) is a breeze compared to Sourceforge, it is much simpler. Much better. I’d recommend GCPH (Google Code Project Hosting) to anybody over Sourceforge, if only for all the hassle you save yourself when creating a new project. Yet GCPH si far from perfect: they don’t get the URL naming scheme! Let’s consider a project named “menstral”, it will have this URL: http://code.google.com/p/menstral/, and the subversion repository URL http://menstral.googlecode.com/svn/. The problems with these URL names is: 2 different domain names (code.google.com and googlecode.com) inside a single project, and the redundant and arbitrary “p” in the project’s main URL. The URL for the project should have been http://menstral.googlecode.com.

The conclusion: bravo to Jottit for setting a new standarnd in simplicity and giving us an example. And I have a feature request for them: get HTTPS to work on all pages.

How to fix multi-tap

The problem

Let’s consider the classical ITU-T 12-key phone keypad:

  1 2 abc 3 def
4 ghi 5 jkl 6 mno
7pqrs
8 tuv 9 wxyz
  *
  0   #

There are two main ways to enter text using it (for example when writing an SMS): multi-tap, and predictive text input.

Predictive text has its own set of advantages and disadvantages, but I won’t discuss it here.

In multi-tap you push a button multiple times to select a letter from a given position — for example, to type ‘c’ you push the button containing the letters ‘abc’ three times. Multi-tap is popular and easy to understand, yet it’s very inefficient. The UTI-T letter layout (i.e. buttons with ‘abc’, ‘def’, and so no) is a lousy one for text input, and the explanation is historical: in the beginning, the phones had letters next to the numbers in order to allow to dial phone numbers that were presented as words (as in: Call 1-800-FLOWERS or 1-800-GOT-JUNK). So initially, the task was not to enter text, but to dial phone numbers presented as text. And it turns out that the letter layout that was chosen for that task is not at all efficient for text input.

The fix

First thing, let’s make use of the “1″ key — put letters on it to un-crowd the 4-letter keys (e.g. “7″ has “pqrs”) . We still have the three bottom keys (*, 0, #) to put space, symbols and digits switches.

Letter Frequency

We also use the fact that in a given language, the different letters are not equally important — some letters are used much more frequently than others. For example, in English, the top-9 letters (ETAOINSHR) account for 70% of letters used, while the bottom-9 letters (YPBVKJXQZ) account for less than 8%. The obvious thing to do is to place the most-frequently-used letters on the first position on the keys, so that they are easier to type.

One may observe that the letter frequency is language-specific. While this is obviously true, it also happens that there is a large amount of similarity in letter frequency among Latin-alphabet languages from as diverse families as Germanic (e.g. German, Dutch, English), Latin (e.g. French, Spanish, Romanian) and Slavic (e.g. Polish, Czech). Surprisingly, there is a great deal of overlap of the most frequent letters in all Latin-alphabet languages. This means that using letter frequency we can come up with a layout which is good not only for English, but for many languages (a similar situation exists with the Dvorak keyboard layout).

For example, doing an informal frequency-weighting between English, French, German, Spanish, and secondarily Polish and Romanian, I came up with these top-9 letters (in order): EATINSORD. What’s to note is that H, which is rather frequent in English but rare in all the other languages, has been dropped in favour of D, which is frequent in many languages, and not too bad in English either.

So, based on frequency, we categorize the 26 letters of the Latin alphabet in 3 groups: the top-9 which will be placed on the first position, the middle-9 which are placed on the second position, and the bottom-8 which are on the third position on the keys. These groups are:

Most frequent e a t i n s o r d
Middle h l u c m w g f p
Least frequent b y v k j x q z

Bigram Frequency

With multi-tap there is a problem when two successive letters (in the text) happen to be on the same key. In this situation usually the user has to makes a small pause before advancing to the next letter.

For example, to type “ba”, you push twice the “ABC” key, make a pause, and push once more the same key.

This pause is annoying and slows down typing, so it’s good to reduce the number of cases when it’s needed. This is accomplished by putting together on the same key letters that appear seldom in succession. For this we use the bigram frequency, which is the frequency (in a language) of two-letter pairs.

An additional optimization is to eliminate the ‘rotation’ when multi-tapping the same key. Once the last letter on a key is selected, and the user pushes once more the same key, the typing automatically moves to the next character instead of ‘wrapping around’ to the first letter on the key in a circular fashion. The benefit is that after the last key on a button there’s no need for the same-key pause.

Let’s consider a key with 3 letters, let’s name them A,B,C. They are on the key in this order, A-B-C. In this situation, the frequency of the same-key pause for this key is given by: f(wait)=f(AA)+f(AB)+f(AC)+f(BB)+f(BC), where f(AB) denotes the frequency of the bigram AB.

On each key we choose one letter from each of the three frequency categories (most-frequent, middle, least-frequent), as to minimize the same-key wait based of linguistic bigram frequency.

Physical placement

Once we have the groups of letters for each button, we attempt to optimize the physical placement to increase the ergonomy while typing. There are many factors involved here (e.g. how does the user type, one hand or two hands? thumb or index? etc). I attempted to minimize the amount of finger movement between successive letters (because it takes a longer time to move the finger a longer distance) — again by using bigram frequency. Also, I tried to make the layout memorable (the first letters form pronounceable words on rows), and to place the most frequent letters (EAT) on the easiest to reach keys.

The Layout

A tentative layout resulting from these considerations is below. While it may still benefit from some improvements, it’s already much better than the classical UTI-T layout.

Afq Tmk Iuy
Sl Ewj Rhx
Npb
Ogz Dcv
  *
space   #

I call this layout ATI SER NOD, from the first-letter (the most frequent one) read by rows.

Java class literal on CLDC

In Java, a class literal is a construct of the form String.class, which returns the Class object that corresponds to the named class. You can use this to test the exact class of an object, like this:

boolean isInteger(Object obj) {
  return obj.getClass() == Integer.class;
}

(A similar goal can be achieved by using instanceof, even though the semantics and the performance are not exactly the same.)

If you try to use the class literal on CLDC 1.0, it won’t work. You get at compile-time the rather funny error message

class file for java.lang.NoClassDefFoundError not found.

The explanation is that, up to and including Java 1.4, the class literal is emulated this way:

static Class class$java$lang$Integer;

static Class class$(String className) {
  try {
    return Class.forName(className);
  } catch (ClassNotFoundException e) {
    throw new NoClassDefFoundError();
  }
}

boolean isInteger(Object obj) {
  return obj.getClass() ==
    (class$java$lang$Integer != null ?
      class$java$lang$Integer :
      (class$java$lang$Integer = class$("java.lang.Integer")));
}

So the innocent-looking Integer.class is translated into a bunch of code, which can throw a new unchecked exception, NoClassDefFoundError. The problem is that CLDC 1.0 doesn’t provide this exception class, so you get a compilation error.

The solution? Use instanceof if possible. Another solution is to use a static final member initialized by calling getClass() on a dummy instance.

static final Class INTEGER_CLASS = new Integer(0).getClass();

OpenMoko mobile phone available for $300

The OpenMoko mobile phone, with a completely open software stack, is available starting today for $300. http://openmoko.com/

Trivial Templates

A template is used to generate a text string based on some data parameters. A simple template for a letter:

Dear ${name},
please pay us ${amount} dollars.

You use a template like this:

letterTemplate = Template(templateString)
letter1 = letterTemplate(name='John', amount='100')
letter2 = letterTemplate(name='Tom', amount='200')

The template can be more powerful than simple text substitution, by using loops, conditionals, etc.

Hello ${name},
you ordered these ${len(items)} items:

{{for (i, item) in enumerate(items):}}
item ${i+1}: ${item}
{{end}}

Calling the template with some parameters

print template(name='Joshua', items=['bottled water', 'milk', 'icecream'])

generates this output:

Hello Joshua,
you ordered these 3 items:

item 1: bottled water
item 2: milk
item 3: icecream

How difficult is it to implement such a template engine? 60 lines of python code:

import cStringIO as StringIO
import re

_reExpr  = re.compile(r'\${(.*?)}')
_reCode  = re.compile(r'(?:\n *)?{{(.*?)}}')

class Adder():
    def __init__(self):
        self.indent = ''
        self.o = StringIO.StringIO()

    def incIndent(self):
        self.indent = self.indent + '  '

    def decIndent(self):
        self.indent = self.indent[:-2]

    def write(self, txt):
        self.o.write(self.indent + txt + '\n')

    def getvalue(self):
        value = self.o.getvalue()
        self.o.close()
        return value

def pairify(seq):
    it = iter(seq)
    while True:
        yield (it.next(), it.next())

class SimpleTemplate():
    def __init__(self, string, origin = None):
        string = _reExpr.sub(r'{{_o(str(\1))}}', string)
        parts = _reCode.split('{{}}' + string)[1:]
        adder = Adder()
        for (code, txt) in pairify(parts):
            code = code.strip()
            if code:
                if code == 'end':
                    adder.decIndent()
                else:
                    firstWord = code.split(None, 1)[0]
                    if firstWord[-1] == ':': firstWord = firstWord[:-1]
                    if firstWord in ['else', 'elif', 'except', 'finally']:
                        adder.decIndent()
                    adder.write(code)
                    if code[-1] == ':': adder.incIndent()
            if txt:
                txt = txt.replace('\\\\', '\\\\\\\\').replace('"', '\\\\"').replace('\\n', '\\\\n')
                adder.write('_o("' + txt + '")')
        source = adder.getvalue()
        self.code = compile(source, origin if origin else source, 'exec')

    def __call__(self, **varMap):
        output = StringIO.StringIO()
        varMap.update(_o = output.write)
        exec self.code in varMap
        content = output.getvalue()
        output.close()
        return content

How to switch to UTC

UTC (Universal Time Coordinated) is basically the same thing as GMT (Greenwich Mean Time). It is the zero-offset time zone.

As the world gets global, it gets confusing to use all the different timezones. It would be good to have a uniform system to name the time — and UTC is the best candidate.

UTC is already used by airplane pilots when they communicate time — because the planes travel often between timezones, and you don’t want some pilot or control tower to get confused in the time-zone names and time translation.

The local time system that is in use now has the advantage that the time (in the local time zone) corresponds to some degree to the sun position on the sky. In general you expect that 8:00am is in the morning (sunrise), and at 1:00am you expect that it’s dark.

The drawbacks of the local time system become appearent when people travel or communicate between timezones: a traveller has to adjust his watch when travelling, and continously make mental transformations to compute the time in the other time zone. Whatsmore, were there a single (unique) timezone, we could drop the timezone concept altogether, and only talk about time (or Earth time, when need comes).

Also the communication between people is now global — instant messenger, VOIP calls, skype etc. More and more people talk between timezones. It’s easy to see the babel scenario when co-workers use different time-zones each to name the time. For example:

A: What about having the meeting at 9:00 PDT?
B: Sorry, I have to leave at 18:00 CEST.
C: 18:00 EEDT works fine for me.
A: Ok, what about 8:00 PDT then?

Here is the plan for the global switch to UTC:

  1. Eliminate Daylight Saving Time (Summer Time) in all time-zones (have a single time throughout the year in each timezone).
  2. Merge close-by zones: Eastern Europe (EET, e.g. Romania) and Central Europe (CET, e.g. Germany) would adopt U.K. time (UTC); The United States would merge into a unique time zone (e.g. PST or EST). This should reduce the global number of time-zones to about half a dozen.
  3. Start communicating the time both in the local timezone and in UTC (in order for the population to get used to UTC).
  4. Abandon the remaining timezones, everybody adopts UTC only.

Of course, an ambitious endeavor, to unify the timezones. To put it in perspective, is it more difficult than the human flight to Mars? Or is it easier than the power-plug unification?

On the plus side, I think it would be enough for some major countries to adopt UTC (China, India, US, European Union, Russia, Japan) and the others will follow.

Working for the man, part 1

Among the first things I’ve been imprinted with at Google, is that Google is a very secretive company. I’m not allowed to tell almost anything about [the inside of] Google. I don’t even know how I could ever post anything mentioning the word Google on my public blog, but given that I’ve read public blogs of other googlers, I imagine that just blogging may not be a reason for immediate firing. Just for caution, I’ll try to restrict myself to saying only positive things about Google (and this does not imply that there are other-than-positive things to be said about Google) — self-censorship is an interesting attitude.

First, I should stress that there are plenty of great things about Google — the problem is that many of these are already well known (not to mention the food).

My first contact with the inside of Google was through the interviewing process (when I was recruited). I can tell that I was positively impressed by the interviewing process: most of the questions/tasks were relevant and challenging, and the people conducting the interviews seemed compentent and smart. I also had the impression that they were looking for objective hints in order to evaluate my skills (as oposed to subjective judgement). As a simple outline of the interviewing process, I started by sending my Resume (CV) by email to Google. After this followed two phone-screens, and afterwards, four in-person interviews.

What can I say, I’d wish to become as good an interviewer as some of the persons who interviewed me. But there is one small thing that keeps me from being completly extatic about the recruiting process at Google: about a year and a half ago, I sent for the first time my resume to Google. It was not exactly the same resume, and the layout was different, but it was essentially describing the same person (me).

Back then, I got back the standard refusal email, we have no oppenings matching your skills at this time or something like this, but I think that the real message of the email was: based on your resume, we decided it’s not worth proceeding to the phone-screen phase. Meaning that the recruiter was so confident that I’m a bad candidate, that it didn’t warrant a phone-screen test.

I don’t really believe that I improved that much between then and now. I rather think that I experienced a glimpse in the recruitment process: either Google hired a bad candidate now, or the recruiter hurried to drop my resume back then — I hope it’s the last. Of course, Google (like some other companies) is more concerned about avoiding a bad hire than about losing a good candidate — or in other words, a false-negative is preferred to a false-positive. From the employer perspective, it’s preferrable to err on the side of caution; what I’m surprised is that my resume was judged so bad that I didn’t even make it to the phone-screen.

PS: if you’re interested in a software development position at Google, you may be better-off sending your resume (CV) to a Google employee you know for them to refer you, rather than doing a spontaneous application. And second, Google is hiring in (extended) Europe and Russia, so don’t be put-off if you don’t live in the US.

Sidebar: the US measure system

US Metric
1 mile 1.6 km
1 foot (feet, ‘) 0.3 m, ~1/3 m
1 inch (”) 2.5 cm
1 gallon 2.5 liter

Temperature

Celsius Fahrenheit
-10 14
0 32
10 50
20 68
30 86
40 104
Fahrenheit Celsius
25 -4
50 10
75 24
100 38