Design Overview

Terms and Abbreviations

Bidi Bidirectional
LTR Left to Right
RTL Right to Left
LRM Left-to-Right Mark
RLM Right-to-Left Mark
LRE Left-to-Right Embedding
RLE Right-to-Left Embedding
PDF Pop Directional Formatting

General Definitions, Terminology and Conventions

Every instance of bidi text has a base text direction. Bidi text in Arabic or Hebrew has a RTL base direction, even if it includes numbers or Latin phrases which are written from left to right. Bidi text in English or Greek has a LTR base direction, even if it includes Arabic or Hebrew phrases which are written from right to left.

Structured expressions also have a base text direction, which is often determined by the type of structured expression, but may also be affected by the content of the expression (whether it contains Arabic or Hebrew words).

This document addresses two groups of problematic cases:

  1. Expressions with simple internal structure: this category regroups cases in which strings are concatenated together in simple ways using known separators. For example: variable names, "name = value" specifications, file path, etc...
     
  2. Expressions with complex internal structure: this category regroups structured text like regular expressions, XPath expressions and Java code. This category differs from the previous one since the expressions belonging to it have a unique syntax which cannot be described by concatenation of string segments using separators.

We will see that the same algorithms can handle both groups, with some adaptations in the details.

In the examples appearing in this document, upper case Latin letters represent Arabic or Hebrew text, lower case Latin letters represent English text.

"@" represents an LRM, "&" represents an RLM.

Notations like LRE+LRM represent character LRE immediately followed by character LRM.

Bidirectional Control Characters

When there are problems of wrong display of bidi text, it is often possible to cure them by adding some bidi control characters at appropriate locations in the text. There are 7 bidi control characters: LRM, RLM, LRE, RLE, LRO, RLO and PDF. Since this design has no use for LRO and RLO (Left-to-Right and Right-to-Left Override, respectively), the following paragraphs will describe the effect of the 5 other characters.

Note that pieces of text bracketed between LRE/PDF or RLE/PDF can be contained within larger pieces of text themselves bracketed between LRE/PDF or RLE/PDF. This is why the "E" of LRE and RLE means "embedding". This could happen if we have for instance a Hebrew sentence containing an English phrase itself containing an Arabic segment. In practice, such complex cases should be avoided if possible. The present design does not use more than one level of LRE/PDF or RLE/PDF, except possibly in regular expressions.

Bidi Classification

Characters can be classified according to their bidi type as described in the Unicode Standard (see Bidirectional_Character_Types for a full description of the bidi types). For our purpose, we will distinguish the following types of characters:

Text Analysis

In all the structured expressions that we are addressing, we can see characters with a special syntactical role that we will call "separators", and pieces of text between separators that we will call "tokens". The separators vary according to the type of structured expression. Often they are punctuation signs like colon (:), backslash (\) and full stop (.), or mathematical signs like Plus (+) or Equal (=).

Our objective is that the relative progression of the tokens and separators for display should always follow the base text direction of the text, while each token will go LTR or RTL depending on its content and according to the UBA.

For this to happen, the following must be done:

  1. Parse the expression to locate the separators and the tokens.
  2. While parsing, note the bidi classification of characters parsed.
  3. Depending on the bidi types of the characters before a token and in that token, a LRM or a RLM may have to be added. The algorithm for this is detailed below.
  4. If the expression has a LTR base direction and the component where it is displayed has a RTL orientation, add LRE+LRM at the beginning of the expression and LRM+PDF at its end.
  5. If the expression has a RTL base direction and the component where it is displayed has a LTR orientation, add RLE+RLM at the beginning of the expression and RLM+PDF at its end.

The original structured expression, before addition of directional formatting characters, is called lean text.

The processed expression, after addition of directional formatting characters, is called full text.

LRM Addition (structured Text With LTR Base Text Direction)

A LRM will be added before a token if the following conditions are satisfied:

Examples (strings in logical order where "@" represents where an LRM should be added):

   HEBREW @= ARABIC
   HEBREW @= 123

OR

Examples (strings in logical order where "@" represents where an LRM should be added):

   ARABIC NUMBER 123 @< MAX
   ARABIC NUMBER 123 @< 456

RLM Addition (structured Text With RTL Base Text Direction)

A RLM will be added before a token if the following conditions are satisfied:

Example (string in logical order where "&" represents where an RLM should be added):

   my_pet &= dog