Examines the challenges in controlling the text cursor in a web application. The requirements of Google Tasks are discussed to motivate the problem.

Caret Navigation in Web Applications

by Michael Bolin, April 24, 2012

A little over two years ago, I left Google. In my farewell blog post, I noted:

"One of the many things Google has taught me is that building simple things is often extremely complicated and [Google Tasks] was no exception. (I think I've spent at least one man-month trying to figure out the best way for the cursor to move up and down between tasks, but that's a topic for another post.)"

Somehow, my blog post made it onto Reddit where some disgruntled group of people modded down my post because they were hoping to hear more about this whole cursoring thing.

Fast-forward to today where a little company named Asana has decided to take a stab at task management software. Now, I still use Google Tasks heavily, and even though the Tasks community has loudly and clearly expressed its desire to share task lists, Google has failed to come through. Finding myself in need of a shared task list, I decided to give Asana a try.

Upon firing it up, one of the things that Asana tells you is that it aspires to be a lightweight text editor, of sorts. This is precisely the approach that we took with Google Tasks, so I was intrigued to see how Asana chose to deal with all of the design issues that my team encountered several years earlier.

I was disappointed to discover that the Asana team has taken a number of shortcuts: hierarachy is limited, tasks are not linkified, and most importantly, task text does not wrap. From experience, I can tell you that each of these shortcuts makes things considerably easier to implement, but with three years of development and $10.2M in funding, I would expect a little more.

But my goal here is not to bash Asana (on the contrary, I really like their sharing and tag features, email integration, and support for headings), but to shed some light on why dealing with wrapped text and cursoring is so complicated. Hopefully this will help others build high-fidelity, web-based user interfaces that require text editing.

In a normal text editor, when you use the up or down arrow keys to navigate between lines, the editor generally does its best to preserve the x-coordinate of the cursor as the y-coordinate changes. For example, consider navigating the cursor downwards through following chunk of text starting from the middle of the word "little:"

Here is a nice lit|tle passage. (Character offset: 18)
It contains three| sentences. (Character offset: 17)
None of which is| all that interesting. (Character offset: 16)

If you do so in a native text editor, the cursor will follow the path marked by the red | characters. Note that each | appears at a different character offset in each of the three lines. Feel free to verify this yourself in the following textarea:

Here is a nice little passage. It contains three sentences. None of which is all that interesting.

This is significant because this means that it is invalid to use the character offset of the cursor position on the previous line as the character offset for the cursor on the next line. (Note this would not be an issue if a monospace font were used, and there were many days that I got out of bed wondering if we could get away with releasing Google Tasks with a monospace font, though you will see that indenting and wrapping preclude using a monospace font as a drop-in solution to this problem.)

In order to emulate native cursoring, you need to "do the math" and calculate where the cursor is and where it should go. This decision process roughly breaks down as follows:

Which element had focus when the user pressed a key?
Where was the cursor when the user pressed that key?
If the cursor should be moved to a new task, where should it be placed?

In Google Tasks, each task is displayed in a contentEditable div (except on Firefox 2, which doesn't support contentEditable elements, and will be discussed later). By comparison, Asana has one input element per task, which means that task text cannot wrap. This is my primary issue with Asana, as I often want to write long task descriptions, all of which can be viewed on one screen. Asana forces me to move more information into the "Notes" section, which is frustrating because it is only possible to see one note at a time.

Can I see all of my notes at once in Google Tasks?

Although it has not been widely publicized, there is a fullscreen view for Google Tasks: https://mail.google.com/tasks/desktop. For each task in your list, you can see the first line of notes associated with each task without digging into the details pane. (Though if you want to get to the details pane, it is easily accessible via either clicking the arrow on the right of the task, or hitting shift+enter when the task has keyboard focus.)

If you are a heavy Tasks user, then you should definitely bookmark the fullscreen view. I have created a Chrome extension that provides a link to Tasks from the bar that is at the top of most Google web properties. Unfortunately, sometimes when Google updates its web properties, my extension breaks, but I do my best to keep it up to date.

Fortunately for Asana engineers, displaying each task in an input element makes cursor management considerably easier to implement. Because task text is only one line, when the user hits the down arrow, it always means the user is navigating to the next task (likewise for the up arrow). By comparision, when a user hits the down arrow in Google Tasks, the user may either be navigating within a multi-line task, or navigating to the next task.

This seemingly small difference in product makes a world of difference in engineering.

Which element had focus when the user pressed a key?

Because key events bubble, it is possible to add a single key event listener at the root element that contains all of the contentEditable task elements. When the listener is triggered, the target can be inspected to determine which element had focus when the key event was fired, and the keyCode can be inspected to determine whether the user pressed the up or down arrow.

Why not add one key event listener per task?

Adding one key event listener per task div is not desirable because it would use more memory since the number of listeners would grow with the length of the task list. Listeners are frequently constructed as anonymous functions, which means they often carry references to environment objects that cannot be garbage collected until the listener is removed.

Further, using a single listener makes bookkeeping simpler. As the user modifies the list, the divs used to display task text can be added and removed from the DOM without having to keep the set of key event handlers in sync.

You can learn more about the more general form of this pattern on Kushal's blog.

Assuming up or down was pressed, the next step is to determine whether to allow the key event to proceed as it normally would (moving the cursor within the task), or whether it should be suppressed and the cursor should be moved into an adjacent task. The latter occurs when either the cursor is in the top line of the div and up is pressed, or when the cursor is in the bottom line of the div and down is pressed. Therefore, the location of the cursor must be determined in order to make this decision.

Where was the cursor when the user pressed that key?

Unfortunately, there is no cross-browser API to determine the (x, y) location of the cursor directly, so we must use other heuristics. Specifically, we use the DOM range API to determine the cursor position, which often, but not always, maps to a unique (x, y) position on the screen.

For those of you who are unfamiliar with the DOM range API, it is basically a way of representing a contiguous subtree within a DOM. (I became intimiately familiar with this API when working on Chickenfoot.) When you select HTML on a web page, the selected HTML is always represented as a DOM range. When a single point in the DOM is selected, as is the case for a cursor, it is referred to as a "collapsed" range.

Therefore, when a key is pressed, we can ask the browser for its selected range at that instant, double-check that it is collapsed (i.e., represents the cursor position), and then map that range to an (x, y) coordinate on the screen. Originally, I did this by recreating the HTML of the task in an offscreen element with a span inserted at the range position so I could use standard DOM APIs to calculate the position of the span, which I would then use to calculate the position of the cursor relative to the div containing the task. For example, suppose a task was rendered in the list using the following HTML:

<div class="task" contenteditable
    >Take out the trash and bundle the recycling.</div>

and that the selection was a range whose start and end elements were the text node within the div and the start and end offset were 23 (just before the "b" in "bundle"). The next step would be to add the following HTML to the DOM (the whitespace between nodes is added for clarity, but would have to be removed in practice to match the original task HTML exactly):

<div id="buffer" style="position: absolute; left: -10000px; top: -10000px">
  <div class="task" contenteditable>
    Take out the trash and <span id="cursor"></span>bundle the recycling.
  </div>
</div>

Now the (x, y) offset relative to the upper-left-hand corner of the task can be calculated as follows using some standard utilities from the Closure Library:

var cursorEl = goog.dom.getElement('cursor');
var cursorOffset = goog.style.getPageOffset(cursorEl);

var bufferEl = goog.dom.getElement('buffer');
var bufferBounds = goog.style.getBounds(bufferEl);

var xOffset = cursorOffset.x - bufferBounds.left;
var yOffset = cursorOffset.y - bufferBounds.top;
var taskHeight = bufferBounds.height;

The yOffset can be used to determine whether the cursor is in the top or bottom line of the div. Specifically, if yOffset is less than or equal to the top padding of the .task CSS, then the cursor is in the first line of rendered text. Conversely, if yOffset is greater than taskHeight minus the line height of the task text and the bottom padding of the .task CSS, then the cursor is in the last line of rendered text. (Note that due to browser differences and/or subpixel rendering, you may have to add a pixel of tolerance when doing these calculations.)

Why not insert a special string and look for that?

When I originally tried to implement this technique, my first instinct was to insert a tiny character wrapped in a span to serve as the element to measure because I assumed it would need to take up a non-zero amount of space for the DOM calculations to work. I used a pipe (|) character to minimize the change in width of the task content, so I originally had something like:

Take out the trash and <span id="cursor">|</span>bundle the recycling.

It turned out that inserting this extra character could cause wrapping behavior that would screw up the calculations. Fortunately, my team mate suggested I try dropping the character, which turned out to work just fine.

Bear in mind that the #buffer must display text exactly the same as the original element in order to reflect line breaks accurately. The HTML snippet above neglects this in two important ways.

First, the width of #buffer should be specified as an inline style to match that of the div being cloned. Without a fixed width, the absolutely positioned #buffer will display as one long string of text, in which case the cursor calculation logic would always determine the cursor to be in both the first and last line of task text.

Second, although #buffer has the same CSS class as the original div, that does not mean that it has inherited all of the same styles as the div. For example, if the original HTML were:

<style>
body {
  font-size: 14px;
}
.task {
  font-size: 120%;
}
</style>

<div style="font-size: 10px">
  <div class="task">
    Take out the trash and bundle the recycling.
  </div>
</div>

In this scenario, the resulting font size of the .task element would be 12px, but if that element were cloned and added as a child of the DOM, then its font size would be 14px. This difference in font sizes would cause the text to draw (and possibly wrap) differently in #buffer, which would throw off the cursor calculations. Therefore, it is important to include any styles that cascade into .task that affect rendering as inline styles of #buffer.

Why not use the original task element as the buffer?

It is true that if the original .task element were used in place of the buffer, then there would be no need to set the width or cascading styles on the buffer element. However, redrawing the content of a contentEditable element while it has focus may disrupt focus considerably on some browsers. Further, drawing into an absolutely positioned element rather than one in the middle of the document should, in theory, reduce the amount of document reflowing that the browser has to do after modifying the DOM.

In practice, setting the width of the #buffer and determining the styles that cascade into it is not that difficult. If you look at the UI for Google Tasks in Gmail, Calendar, or the standalone view, you will see that it actually wraps itself in an iframe so that it can be embedded anywhere without having to worry about unexpected styles from parent elements cascading into it, which would throw off the cursoring calculations. I have discussed other advantages of this approach on my blog, as well as on p. 392 of my book in the section on "Externs versus exports."

The final gotcha in this approach, which is what makes this technique imprecise, is that a collapsed DOM range may not map to a unique (x, y) position on the screen. This happens when the cursor is at the beginning or end of a line. For example, if the area where the task was rendered were particularly narrow such that the text wrapped before the "t" in "the recycling," then the cursor could be in either of the following positions that maps to the same DOM range:

Take out the trash and bundle |
|the recycling.

This is significant because if the cursor were at the end of the first line, then hitting the up arrow should navigate into the previous task while hitting the down arrow should move the cursor to the end of the second line of the task. Conversely, if the cursor were at the start of the second line, then hitting the up arrow should navigate to the beginning of the first line of the task while hitting the down arrow should move the cursor into the next task.

Feel free to use the arrow keys in the following textarea to see how native cursoring behaves. At least in the current versions of Chrome and Safari, using the left and right arrows to move the cursor seems to "ignore" the space after the "e" in "bundle," as right-arrowing from there brings you to the start of the next line. However, it is possible to use the mouse to place the cursor at the end of the space after that "e" on the first line, though strangely hitting the right arrow from that position moves the cursor into the second character of "the" instead of to the start of the line.

Take out the trash and bundle the recycling.

The question remains: how do you implement a solution that accounts for this edge case? One option is to treat it as a hysteretic system such that all previous keyboard and mouse input should be recorded so that the current cursor position can always be calculated, though that would be rather complicated.

A simpler, but imperfect, solution is to resolve the ambiguity by assuming that the cursor is at the beginning of the lower line rather than at the end of the upper line:

Take out the trash and bundle |
|the recycling.

If the DOM range corresponding to the selection is at character offset 30 within the text node, then assume the cursor is at the blue position rather than the red position.

In practice, this is more likely the case because the user is probably arrowing down through the left side of the task list, or the user has right-arrowed to get to the start of the next line. The only case where this heuristic is incorrect is when the user uses the mouse to place the cursor on the right side of the space at the end of the line.

(You can verify this failure case yourself in Google Tasks by typing the above task text into Tasks in Gmail, using the mouse to place your cursor where the red mark is, and then hitting the down arrow. Assuming it is not the last task in the list, the cursor will go to the start of the next task rather than to the end of the current task because Google Tasks assumed you were moving the cursor from the blue position.)

Although this appears to be a practical way to resolve the ambiguity, this introduces the challenge of determining whether you are in an ambiguous case! For example, consider the following HTML (whitespace between HTML tags is included for clarity):

<div id="buffer" style="position: absolute; width: 189px">
  <div contenteditable class="task">
    Take out the trash and bundle <span id="cursor"></span>the recycling.
  </div>
</div>

This is the HTML that represents the selection at character offset 30 within the task. Using the cursor calculations introduced earlier yields an xOffset of 177 and a yOffset of 0. This (x, y) position corresponds to the red cursor position on the upper line rather than the blue cursor position on the lower line. Unfortunately, this is the opposite of the behavior we want, so we need to amend our technique for calculating the cursor position.

The solution is to take a second measurement. For example, when cursoring down, we perform the same cursor calculation at one additional offset within the selection. In this case, performing the cursor calculation for the DOM range that corresponds to character offset 31 yields an xOffset of 4 and a yOffset of 16. Because the yOffset in the second measurement is greater than the yOffset of the first measurement, the current cursor position must be on the boundary of a line break. We have now identified being in the ambiguous case of the red versus blue cursor position, but as explained above, we choose to assume the cursor is at the beginning of the lower line, and move the cursor into the next task accordingly.

If the cursor should be moved to a new task, where should it be placed?

As explained earlier, the yOffset and taskHeight can be used from the cursor calculations to determine whether the cursor should be moved to the previous or next task. The remaining question is how to determine where to place the cursor in the destination task.

From the calculations, we have the xOffset of where the cursor was, so the x in the (x, y) of the new cursor position should be as close to xOffset as possible while still maintaining the correct row position. When moving the cursor downward, the desired row is always the first row of the new task where the new yOffset should be 0. However, when moving the cursor upward, the taskHeight of the destination task must be calculated in order to determine the yOffset for cursor positions in the last row of that task.

Once the desired destination xOffset and yOffset values have been calculated, we employ a binary search to find the closest cursor position in the destination task. Specifically, each character offset in the destination task text is a candidate location for the new cursor position. We can create a search space by populating our existing #buffer element with span elements at each character offset. For example, if the destination task text were "homework," then the HTML would be as follows (again, whitespace between HTML tags is included for clarity):

<div id="buffer" style="position: absolute; width: 189px">
  <div contenteditable class="task">
    <span id="buffer-0"></span>h
    <span id="buffer-1"></span>o
    <span id="buffer-2"></span>m
    <span id="buffer-3"></span>e
    <span id="buffer-4"></span>w
    <span id="buffer-5"></span>o
    <span id="buffer-6"></span>r
    <span id="buffer-7"></span>k
    <span id="buffer-8"></span>
  </div>
</div>

To get the (x, y) position of character offset n, we can use goog.dom.getElement('buffer-' + n) to get the corresponding placeholder element and then use our existing cursor calculation logic to determine the corresponding (x, y) position. At each step of the binary search, the candidate (x, y) position is compared to the destination xOffset and yOffset values:

var evaluate = function(n) {
  var cursorEl = goog.dom.getElement('buffer-' + n);
  var cursorOffset = goog.style.getPageOffset(cursorEl);
  var yDelta = yOffset - cursorOffset.y;

  // If the candidate point is not in the right row,
  // then this cannot be a match.
  if (yDelta !== 0) {
    return yDelta;
  }

  return xOffset - cursorOffset.x;
};

// Number of characters in the task text if the task element has a single
// child node, which is a text node.
var taskLength = taskEl.firstChild.nodeValue.length;

// Create an array of 0..taskLength where each element is its index.
var candidates = new Array(taskLength + 1);
for (var i = 0, len = candidates.length; i < len; i++) {
  candidates[i] = i;
}

var characterOffset = goog.array.binarySelect(candidates, evaluate);

The resulting characterOffset can be mapped to a collapsed DOM range that can be used as the selection to set the new cursor position. Note that when characterOffset is negative, indicating that no exact match was found, additional logic is required to find the best match.

When characterOffset is negative, the corresponding insertionPoint is -1 * (characterOffset + 1). When cursoring downward, either insertionPoint or insertionPoint - 1 should be the desired characterOffset. For example, if the binary search "overshoots" and returns an insertionPoint that corresponds to the first character in the second row, then the cursor should actually be at the last character in the first row, which should be insertionPoint - 1:

var pickBestPoint = function(insertionPoint) {
  if (insertionPoint === 0) {
    return 0;
  }

  var cursorEl = goog.dom.getElement('buffer-' + insertionPoint);
  var cursorOffset = goog.style.getPageOffset(cursorEl);
  if (yOffset !== cursorOffset.y) {
    // insertionPoint corresponds to the wrong row: choose the previous index.
    return insertionPoint - 1;
  }

  var previousCursorEl = goog.dom.getElement('buffer-' + (insertionPoint - 1));
  var previousCursorOffset = goog.style.getPageOffset(previousCursorEl);

  var xDelta1 = xOffset - previousCursorOffset.x;
  var xDelta2 = cursorOffset.x - xOffset;

  return xDelta1 < xDelta2 ? insertionPoint - 1 : insertionPoint; 
}

There are also edge cases when cursoring upwards and characterOffset is negative, but how best to handle those is left as an exercise for the reader.

One final note is that when determining the xOffset of both the old and new cursor position, bear in mind that if the adjacent tasks are not at the same depth in the hierarchy, then some adjustment will have to be made.

Additional challenges

The previous section explained the basics required to emulate native cursor movement in a web application. This section discusses a number of additional product requirements imposed by Google Tasks that made the problem even more difficult.

Supporting user agents that do not support contentEditable elements

When I started working on Google Tasks, Firefox 3.0 was still in beta and IE7 had just been released, so supporting Firefox 2.0 and IE6 was a firm requirement for the product. Like most mid-2000 frontend engineers, I did the bulk of my development in Firefox and would go back and add hacks for IE later to provide cross-browser support. Unfortunately, Firefox 2.0 did not support contentEditable elements, so I was unaware of them when I started on Tasks, and therefore my original implementation did not use them at all.

Do all modern user agents support contentEditable?

The contentEditable attribute was a feature pioneered by Internet Explorer 5.5, which was not supported by Mozilla until Firefox 3.0. I first learned about contentEditable when talking to two Googlers who worked on Closure Library's rich text editor widget, as they were intimiately familiar with the difficulties of creating a text editor in the browser. A loose account of this conversation is provided by Nick Santos in the foreword of my book, Closure: The Definitive Guide.

Today it may seem as though worrying about user agents that do not support contentEditable is a problem of the past, but support on mobile is fairly recent. Specifically, contentEditable support was only introduced in iOS 5.0 and Android 3.0, while Opera Mini/Mobile do not yet support it at all.

Originally, each task was displayed as an ordinary div. A task could receive keyboard focus in one of two ways: the user could click on a task with the mouse, or the user could hit the up or down arrow to navigate to an adjacent task. (Note that there are other edge cases that we will not even discuss, such as hitting backspace at the beginning of a task, which joins it with the previous task, putting the cursor at the boundary of the join.) In either case, as a result of the user action, a singleton textarea would be "shuttled" across the screen so that it was displayed over the task's div.

The content of the textarea was updated to match the content of the task exactly, and once the textarea was in place, the cursor was moved to the point were the user would expect, using logic similar to that described in the previous section. It was imperative that the textarea line up exactly, or else the task text would appear to jitter as the user cursored up and down through the task list. (It took a considerable amount of experimentation with CSS to eliminate this jitter.)

You might wonder: why not just have one textarea per task rather than waste so much energy shuttling a single textarea around? Initially, one of my weaker arguments was that this approach would make it easier to select task text because you can drag to select text across div elements, but you cannot do so across input or textarea elements. (Go ahead, try selecting a bunch of tasks in Asana, copying them to the clipboard, and pasting them in a text editor.) It turns out that once we supported drag-and-drop in Google Tasks, most attempts at selecting task text were misinterpreted as drag-and-drops, anyway.

However, the more compelling reason turned out to be support for rich text in tasks. Today, the only task formatting that Google Tasks supports besides plaintext is hyperlinks, but because tasks are displayed in divs rather than textareas, all sorts of additional markup is possible. Again, compare this to Asana where if a task contains a URL, you cannot click on it. Users clamored for linkification of tasks until we added it.

Unfortunately, one of the drawbacks to the "shuttle" approach was that the textarea had to be continually resized as the user entered text in order to make sure that all of the task text was visible while the user was typing. This turned out to be error-prone because occasionally there would be cases where text wrapped differently in the div used to measure the text than it did in the textarea used to display it, such that the textarea was too small. This caused all sorts of visual errors, so moving to a contentEditable element that resized itself natively as the user typed eliminated this problem altogether.

Ideally, once the contentEditable solution was introduced, we would have deleted all of the code for the "shuttle" solution, but Firefox 2.0 died a slow death, so we had to keep it around for quite some time. These different code paths created a significant additional burden in testing, so we were eager to eliminate it. Although Google Apps dropped support for Firefox 2.0 on March 1, 2010, spoofing the user-agent as Firefox 2.0 on the fullscreen view for Google Tasks today indicates that the textarea code path is still alive and well.

Wrapping task text

We believed that to be an effective task list, you needed to be able to see it alongside your email. Obviously, Gmail already has a lot going on, so there was not much real estate to allot for Tasks. Google Tasks received a small, collapsible space called a "mole" because opening and closing chat windows in Gmail is akin to "whack-a-mole."

Originally, the Tasks mole was slightly wider than the chat moles, but no taller. (Though recently, it appears that someone failed to update the CSS for the Tasks mole in one of recent Gmail redesigns, so now the heights do not match, and the Tasks mole is narrower than the chat moles.) Because the Tasks mole was less than 250 pixels wide, we knew that we had to allow task text to wrap, and that task text would wrap often.

How bad would it be if task text did not wrap?

Consider the Asana UI where each task is displayed in a single-line input. If that interface were embedded in the same space as Tasks, you would only be able to see about four words per task (or fewer for a sub-task, which Asana does not allow) before the rest of the task text would be clipped.

Because Asana is currently used only as a fullscreen webapp, the clipping may be tolerable to most users. However, this design makes it difficult to reuse the existing UI as an embeddable widget that can be viewed alongside other applications where you want to be able to see your task list, such as mail and calendar. For example, Tasks is limited to 162 pixels of horizontal real estate in Google Calendar—it may be cramped, but it's readable!

For the most part, the use of contentEditable elements to display tasks addressed the wrapping issue. However, some browsers would not automatically wrap long strings of characters (such as URLs, which appeared frequently in tasks), in which case Tasks had to provide the browser some hints on how such text should be wrapped. These "hints" were strategically placed word-break tokens in long sequences of non-whitespace characters. The exact type of word-break token to use varied by browser, as format.js in the Closure Library determines the appropriate HTML to use for word-breaking as follows:

/**
 * Constant for the WBR replacement used by insertWordBreaks.  Safari requires
 * <wbr></wbr>, Opera needs the &shy; entity, though this will give a visible
 * hyphen at breaks.  IE8 uses a zero width space.
 * Other browsers just use <wbr>.
 * @type {string}
 */
goog.format.WORD_BREAK_HTML =
    goog.userAgent.WEBKIT ?
        '<wbr></wbr>' : goog.userAgent.OPERA ?
            '&shy;' : goog.format.IS_IE8_OR_ABOVE_ ?
                '&#8203;' : '<wbr>';

Note that once goog.format.WORD_BREAK_HTML was introduced into the task text, the content of a contentEditable task div was no longer guaranteed to be a single text node, but may now be a series of alternating text nodes and wbr elements. Therefore, any of the aforementioned cursoring logic that assumed a task was always rendered as a single text node (of which there is definitely some) must be updated to account for this relaxed restriction.

Unfortunately, introducing these word breaks interferes with cursoring on some browsers. For example, a task containing the text browserfeature.js in Google Tasks will be rendered as the following HTML due to its word-break logic:

browserfeature.<wbr></wbr>js

If you right-arrow from the beginning of the task on Firefox 11, after you arrow past the . character, hitting the right arrow again fails to move the cursor, and hitting the right arrow once more takes the cursor to the start of the following task instead of to the right of the j. It appears as though the cursor gets "stuck" in the wbr somehow, interfering with the native cursoring behavior. Ideally, some sort of browser-specific workaround would be introduced to fix this bug.

Displaying URLs as hyperlinks

Initially, Tasks did not allow for any sort of formatting for a task: it only allowed plain text. There were many requests for rich formatting, but we resisted: we wanted Tasks to feel lightweight, and we feared introducing too many options would have detracted from that.

What about wiki text?

One compromise would have been to allow GChat/wiki-style formatting where surrounding text with underscores leads to italicized text, asterisks lead to bold, etc. To get the benefits of unformatted text editing but formatted text viewing, it would make sense for the task to display itself as wiki text when it was being edited, but as rendered HTML when it did not have keyboard focus. However, this would have introduced a jitter as the user cursored through tasks because the number of displayed characters would change as a task gained or lost keyboard focus.

Nevertheless, the one concession we made on this front was hyperlinks. We frequently found ourselves pasting URLs into Tasks, and it was infuriating being unable to click on them (as is the case in Asana). At first glance, the logic to implement this feature seems trivial:

When displaying a task, linkify the task text.
When editing a task, use goog.dom.getTextContent() or an equivalent utility to normalize the content of the task div, escape it, and set it as the innerHTML of the contentEditable div for editing.

However, the combination of supporting both hyperlinks and word breaks introduces a new challenge. A naïve word-break inserter might insert a wbr every 20 characters, while a naïve linkifier might search for substrings that start with http and wrap them in a tags. The problem is that both of those functions take plain text as input and produce HTML as output, so it is not appropriate to compose the output of one as the input of the other. For example, linkifying the text http://cnn.com/ might yield:

<a target="_blank" href="http://cnn.com/">http://cnn.com/</a>

If this string were passed to our naïve word-break inserter, it would become:

<a target="_blank" h<wbr>ref="http://cnn.com/<wbr>">http://cnn.com/</a<wbr>>

The result is far from valid HTML.

One solution is to linkify the text first, and then apply the word-break inserter to each text node produced by the linkification step. This enables code reuse without sacrificing correctness.

If you look at the HTML produced by Google Tasks, you can see that linkified URLs get special treatment with respect to word breaks. For example, if you have the URL http://code.google.com/p/closure-library/source/detail?r=736 as part of a task, Google Tasks will render it as follows (again, whitespace between HTML tags is included for clarity):

<a href="http://code.google.com/p/closure-library/source/detail?r=736">
  http:/
  <wbr></wbr>
  /code.
  <wbr></wbr>
  google.
  <wbr></wbr>
  com/
  <wbr></wbr>
  p/closure-library/
  <wbr></wbr>
  source/
  <wbr></wbr>
  detail?r=
  <wbr></wbr>
  736
</a>

Note how the wbr elements are not inserted at arbitrary intervals in the URL text. Instead, they are placed at more "natural" boundaries so that if the URL text wraps, it is easier to read. This is no accident! I added this logic specifically so that URLs would always be easy to use in Tasks, regardless of the width of the UI.

Testing caret behavior

Implementing the caret-positioning logic was fairly complicated: there were many edge cases it had to tolerate and a lot of browser-specific code under the hood. Once you finally solved an edge case on one of the browsers, you were more than happy to write a test for it because you would want to be alerted to any sort of regression. The problem was that there was no way to write tests.

Not only was there no way to write tests, but no one believed me when I told them that there was no way to write tests, and everyone kept asking me why I wasn't using Selenium. At the time, only Selenium 1.0 was stable and WebDriver was far from ready.

Selenium 1.0 works by emulating user input events in JavaScript. That means if you want to test your autocomplete widget, you instantiate a KeyEvent object in code and invoke a method on the input element to dispatch the event object. This will send the event through the event system, following the same bubbling and capturing paths that a native event would.

What are the browser APIs to simulate input events in JavaScript?

Take a look at the events code in the Selenium project to see how event objects can be created in JavaScript. This code contains a lot of browser-specific logic, but look for calls to createEvent(), createEventObject(), initMouseEvent(), and initKeyEvent() to see how you can simulate browser events from unprivileged JavaScript.

The problem is that programmatically dispatching a keydown event for the letter a on an input field exercises the key handler, but it does not type the letter a in the field and move the cursor forward one character (the input value is not changed at all). This is a significant problem when you are trying to test a mix of native and custom cursoring, as I was in Google Tasks. This limitation is what made it impossible for me to write tests using existing frameworks.

I sought out a solution that would trigger the same logic as native events, and I suspected that I would have to write some low-level code to do so. As I was not interested in learning C or the Windows API to solve this problem, I spent many hours trying to find an easy-to-use API for injecting native input events across platforms. I have periodically performed a search for such an API, and the only solution that I have ever found is java.awt.Robot.

It turned out that Robot had precisely the API that I needed, though some browsers did not respond to it perfectly. For example, to test out Robot, I created a Java applet that would fire a click event on a div with a JavaScript click handler attached to it. If I remember correctly, in Internet Explorer, the click handler did not fire until I physically moved the mouse after Robot had sent the click event. (Perhaps the click was stuck in some sort of input event processing queue internally on Windows?) Although Robot appeared to provide the right API, it was clear that I was not going to be able to use it on all browsers.

Fortunately, Robot seemed to work just fine on Firefox, which again, was my development browser of choice, anyway. Because I had become intimately familiar with LiveConnect (a technology that allows JavaScript to talk to Java applets) as part of my Master's thesis, Chickenfoot, I was excited at the prospect of writing my functional tests in JavaScript so that I could access both the in-memory data structures in my webapp as well as Robot from JavaScript.

Because Chickenfoot ran JavaScript in a privileged environment, it could also call Firefox APIs as well as execute calls in the shell (such as writing to files). For this reason, I ended up writing all of the functional tests for Tasks in Chickenfoot. (I discuss this system in more detail in my 2009 essay, "Functional Testing for Web Applications.") The only drawback was that getting Chickenfoot to run on Google's continuous integration system would have been quite complicated, so everyone on my team was responsible for running the tests manually before checking in a change.

This created a situation analogous to xkcd 303: "Compiling", in that while the tests were running, you could not do any work because using the mouse or keyboard would interfere with the tests. Your only option was to sit back and watch the show put on by Robot. At the time, I did not feel like learning about xvfb so that I could run the functional tests in the background (or maybe I just enjoyed the five minute break I got while running tests), but eventually the Tasks team got a Noogler who wrote a Python script to do exactly that.

Fortunately, WebDriver has come a long way since 2007, and today it would be my preferred way of writing these types of functional tests. The WebDriver team has taken on the dirty work of writing native code for all modern user agents so that user input can be emulated accurately during testing. This solves the primary limitation of the system I cobbled together using Chickenfoot, which was that it only worked on Firefox. Nevertheless, at the time, having functional tests for Firefox was better than not having any functional tests at all!

The big picture

Although I am extremely excited about the community's recent progress in providing (and documenting!) better tools for building web applications, I think that it is important to remember that using the latest and greatest tools is not sufficient to guarantee the best product. In this essay, I have discussed many of the details that went into the interface for Google Tasks. I assure you that there were just as many interesting problems in the data layer that also required creative solutions, but no one would care about any of them if there weren't a powerful and performant interface that surfaced that data to the user.

Today, when I look at Hacker News, articles about technologies like Backbone.js and Meteor are at the top of the list. Even Asana seems to have distracted itself with Luna, its in-house web framework, rather than focus on the UI for its flagship product. I worry that we will end up with a collection of webapps with beautiful architectures (hopefully that even work offline!), yet are borderline unusable because their interfaces are a thoughtless collection of jQuery UI widgets.

But to be fair, I have to remind myself that I am biased. I was fortunate enough to work on two consumer web products at Google with millions of users: Calendar and Tasks. Both presented the challenge of building a high-fidelity interface in a web browser when web browsers weren't designed to support such interfaces. I enjoyed those challenges, and I found the work incredibly satisfying.

Yet few web applications require that level of detail: many enterprise applications are nothing more than a polite sequence of web forms. Admittedly, those same applications solve real problems and make users more productive. There is an incredible amount of value there, which is why we continue to see new tools and frameworks that help build those types of solutions.

In sum, to move the web forward, we need both better frameworks and better interfaces. Better frameworks are well under way, but developers need to be more pro-active in talking about their UI challenges and communicating these challenges to browser vendors. To build robust products, we need good APIs for building custom user interfaces as well as testing them. Again, not every application may require a fancy frontend solution, but look out for such opportunities and exploit them! I suspect that pushing for a more creative UI will increase both your job satisfaction and your user base.