You're viewing all posts tagged with bidi

نواقص في دعم العربية - في جميع الانظمة الحديثة

تخيل انك تريد ان تكتب:

#linuxac

كيف ستظهر لو كتبت في وسط سطر عربي؟؟ هكذا: #linuxac
هل تلاحظون اين المشكلة؟ علامة # على اليمين بدلا من اليسار!! قد تظن انه يمكن وضعها داخل كود، أي #linuxac و لكن هذا لا يفي بالغرض! فسيظل الاتجاه العام للنص هو RTL حتى داخل code span، فمالحل اذن؟

الحل “الرسمي” هو بوضع رمز يونيكود خفي قبل علامة # يشير الى ان اتجاه الكلمة هو LTR، هكذا: ‎#linuxac و هذه العلامة اسمها LRM، اي: Left-to-Right mark

نفس الشي لو اردت كتابة C++، ستظهر الـ + على اليسار، اي بعد السي، و لكن نحن نريدها ان تكون جزءا من الكلمة الانكليزية، لهذا نضع LRM بعد اخر +، اي هكذا: C++‎، و هكذا يصبح اتجاه الكلمة صحيحا.

يفترض ان هذه الامور تدار من قبل محرر النصوص: يعني لو كنت تعمل في محرر نصوص يدعم العربية بشكل محترم، يفترض ان يعرف تلقائيا انك الان تكتب بالانكليزي و يضع الرموز المناسبة من تلقاء نفسه.

المشكلة ان هذه الرموز غير مرئية، مما يعني انه اذا لم يكن المحرر محترما قد يتسبب في وضع الرموز في المكان الخطأ و يخربط كل الكلام، و لان الرموز غير مرئية، سيكون من الصعب جدا ان تصحح الاخطاء اللتي قام بها المحرر.

السؤال هو، هل يمكن ان نستوحي قواعد بسيطة تجعل المحرر يعرف اين يضع الرموز المناسبة، مع حذفها تلقائيا حين تنتفي الحاجة لها؟ كل هذا مع اعطاء المستخدم تحكما كاملا بكيفية وضع هذه الرموز من دون ان يقف المحرر في الطريق؟

اتذكر ايام الوندوز ان Word كان يقوم بكتابة C++ بشكل صحيح، كيف؟ اظن الان ان ما كان يفعله هو ما كنت اقوله قبل قليل: بوضع رمز LRM بعد نهاية الكلمة، اي هكذا: C++‎

wops, embedding levels, again

Last time I thought I “got” embedding levels rights.

Turns out, not really!

My mistake is I made some assumptions about the structure: I thought a 222 embedding run must be within a 11111 run, so it would be 11222221111, but as it turns out, it could simply be 22211. The code I posted in my previous post does not work for this case.

Man, I should’ve just read the specs from the start.

So, discard my previous code.

Btw, this fixes the bug I talked about with an LTR sentence crossing a line boundary:

fix

progress on scribus’ complex layout

It turned out there were still many issues left since my last post. (one of them I blogged about in the previous post).

Big Thanks to Zayed’s for testing and reporting issues.

There’s still much to do, but the biggest show-stopper now is when english words cross line boundaries (which is something I didn’t think about before).

ltr with right align

This could arguably be the intended behavior: depending on whether we want or don’t want automatic paragraph direction detection.

However, the real issue underlying this is that scribus has no “RTL” mode. There’s something similar, but it’s hidden deep in the properties box, and it reverses everything, on the character level.

The real meaning of embedding levels

As I usually do with a lot of things, I tend to skim through, glossing over the details. And so, I thought embedding levels were merely a coding scheme that I can ignore and just pretend that an “odd” level means an RTL run. (I’m referring to my adventure with adding arabic support (complex layout) to scribus’s text frame).

I was wrong.

The problem manifests itself clearly when some arabic sample contains numbers. Because numbers run from left to right, they have an even embedding level. Consider this:

R1 ## R2

Where R1, R2 are RTL segments, and ## is a number.

You’d think this break the text into 3 runs: RR LL RR, but this is wrong! the whole thing is an RTL run, but it has an LTR run inside it.

Meaning, the text is to be displayed (visually) like this:

R2 ## R1

The text starts from the right, (even though it’s aligned to the left in this post).

You can’t describe this with merely linear monodirectional segments: you’ll get the wrong result. You have to have a “tree” of runs, where a run could have child runs embedded within it, so that it doesn’t break the surrounding text at the wrong place.

That’s why embedding levels are they way they are.

The above example would resolve to embedding levels (roughly) of:

11 22 11

(I say roughly because I’m not entirely sure how spaces are handled). The level for the number (sequence of digits) is 2, not 0.

So what we need to do, is reverse inner runs before reversing outer ones.

For comparison:

Edit:

I had some code here before, but it was wrong too :) so I deleted it.

Scribus Shaping

Behold:

shaping

Actually .. the code is very crude. This is done using HarfBuzz, but not to its fullest, and there are lots of holes that I still don’t know how to fill, and a bunch of work-arounds that shouldn’t really happen.

Here is one problem for instance: when a sentence contains a lam-alef (ligature), the last (x) character in its run get repeated:

wops

Quick Update:

I think I managed to fix the issues with runs containing lam-alef, here see for yourself:

la

Scribus & GNU FriBidi

NOTE: Please do not use this as a reference or tutorial for fribidi, it contains incorrect information, I’m keeping it unchanged just for the historical record. Some stuff here is just plain wrong.

In my previous post I talked briefly about scribus’ problem.

The result I got so far depended solely on GNU FriBidi. No HarfBuzz yet.

It’s true that HarfBuzz is the library to use for text layout, but:

http://mces.blogspot.com/2009/11/pango-vs-harfbuzz.html

HarfBuzz only does shaping (….) [it] doesn’t provide:

  • An itemizer
  • A Unicode Bidirection Algorithm implementation
  • A Unicode Line Breaking implementation
  • Glyph rasterization
  • Glyph metrics information
  • etc

So it will get us the shaping and stuff, but not the bidi ordering and line breaking.

The GNU FriBidi API is quite simple, though not in an obvious way; at least if you’re studying it for the first time without prior exposure to the bidi issue and the unicode bidirectional algorithm.

The “core” of the api is the get_embedding_levels function. Embedding levels are used to determine directional runs.

Here’s the setup code I used (roughly):

    embeddingLevels = new FriBidiLevel[inputLength];
    FriBidiCharType *bidi_types = new FriBidiCharType[inputLength];
    fribidi_get_bidi_types (inputString, inputLength, bidi_types);
    baseDir = fribidi_get_par_direction(bidi_types, inputLength);
    FriBidiLevel ok = fribidi_get_par_embedding_levels(bidi_types, inputLength, &baseDir, embeddingLevels);

I’m not entirely sure if calling fribidi_get_par_direction is actually needed, but besides that, the embedding levels allow you to determine if a certain character is part of an RTL or LTR run. If the embeddign level is even, then it’s part of an LTR run, else if it’s odd then it’s part of an RTL run.

    /**
        Does character at index have an RTL embedding level?
     */
    bool BidiInfo::isRtlEmbedding(int index)
    {
        return embeddingLevels[index] % 2 == 1; // odd embedding levels are part of an RTL run
    }

Then we want to get ranges for runs, so we just scan the text until the run changes, and we have the start and end of a run. The way I did that is simple: nextRun(start, limit) searched for the start of the next run, starting the search from start and ending it at limit. The usage is intended to be something like this:

    start = 0
    end = start
    while(start < length):
        end = nextRun(start, length)
        // (start, end) is now a run, do something with it
        start = end

With that, we check if the run is RTL, and if so, we reverse the characters in that run to get a bidirectional display of the text. The way I did the reversing is a bit too much of a detail to be included here.

I only do this stuff after the textframe layout method has done its work, and that’s for a good reason: we have to do the reordering on a per-line basis, otherwise you get problems. And so we need to find out where lines start and end, and so what I did was “watch” the layout process as it happens, and whenever we spot a new line occuring, we record it; in other words, I injected some code everywhere I saw code that handles line breaks, which was about 4 places. This resulted of course in some duplicate code, but I tried to keep to a minimum: 1 line.

scribus, bidi, and arabic shaping

Scribus has had this bug report regarding Arabic support for five years now (it’s been open since December 2004, and now it’s December 2009): #1079

But wait, what is Scribus in the first place? Well, it’s a desktop publishing program, and it’s free (FOSS).

I’m actually not an active user of the program. I got introduced to the problem by OMLX, during the arabteam’s first programming contest. His friend Zeyad at itwadi.net he had a little adventure with the problem.

Scribus uses Qt, which does support Arabic and Bidirectional text properly, but the problem lies in the textframe layout; it uses a custom layout “engine”.

This layout “engine” is mostly a monstrous 1500 lines method filled with spaghetti code. No wonder that bug report has been open for 5 years.

All the pieced needed to build the support for Arabic are out there: HarfBuzz and GNU FriBidi. HarfBuzz mostly just lacks documentation; but any layout guru should be able to make use of it. It’s used in both Pango (GTK+) and Qt. It’s even used in the Linux port of google chrome.

I personally am not a layout guru in any way, shape, form, or sense of the word. I don’t even know the first thing about text-layout. But I know something about bidi: after all, I have created The Free Ressam (shameless self promotion). It’s a tool to do “fake” bidi and arabic-shaping at the text level; it transforms the text stream so that a layout-engine that’s not capable of bidi can be fed something that will seem as if it’s bidi. It doesn’t conform to the unicode bidirectional algorithm, and works only with Arabic, and when I did it I didn’t really know that fribidi already does a better job at it!

I tried to digg into harfbuzz, but so far it hasn’t been so fruitful. Though, I’ve been sorta successful with GNU FriBidi. I managed to get the gist of the api (it’s documented through man pages), and my attempt to integrate it into the PageItem_TextFrame::layout seems to be getting somewhere.

Bidi is kinda there, but not quite. Shaping is not there yet (actually it was there in an earlier test, but it was just too buggy and hackish that I had to take it away and just focus on getting bidi correctly first).

Look at this picture:

scribus

The text below (where it says Qt) is the story editor: it uses a plain text Qt text area, which has all the proper support for Arabic.

The text above (where it says TextFrame) is, obviously, the text frame. You can see that Arabic runs are displayed right to left, but the letters are disjoint. That’s because I’m not playing with the glyph selection process, but it’s ok, my first priority now is fixing the bugs in the bidi part of the problem.

Two problems appear in my sample text:

Something is wrong with line breaking: the first character in the RTL run that crosses a line break is missing.

Something is wrong with the end of the text: the last few characters aren’t detected properly as RTL. This one has been killing me for a while now!! It probably has to do with text length, or maybe the new lines have \rs which are some how confusing the text length when I transform it from QString to FriBidiChar *. I eliminated some possibilities for the cause of this bug, but others are still open.

Also, sometimes I experience crashes with Signal #11, which I was told is a memory access error (on the scribus-dev irc channel).

So, things are still shaky, but inshalla we’re getting somewhere.

Quick Update!

Just as I was writing the last paragraph, I realized what the bugs where, and I fixed them!

Here’s an updated picture of the current state of affairs on the bidi front:

bidi

Addendum

I’m putting my changes public on github: http://github.com/hasenj/scribus/tree/hasen

I talked to andreas “avox” on the dev channel yesterday, he pointed me to a series of patches by pierre that use harfbuzz. http://bugs.scribus.net/view.php?id=4645

Those patches don’t seem related to shaping, but they do use HarfBuzz, so at least they can serve as pointers when I go back to explore HarfBuzz some more. I don’t have a log of the conversation so I forgot whether pierre was in fact working on shaping or not.

For now I still have some work to do on the bidi front: in the current state of affairs, the whole text in the text frame is treated as a single paragraph. This is bad because, I’m using the method of specifying the base paragraph direction according to the first character with a hard-wired direction. So if the text has 2 paragraphs: the first is arabic, and the second is english, then all of the text (i.e. including the english paragraph) will be treated as if having an RTL base direction, which affects how neutral characters (such as punctuation marks) are positioned.