scribus, bidi, and arabic shaping

Scribus has had this bug report regarding Arabic support for five years now (it’s been open since December 2004, and now it’s December 2009): #1079

But wait, what is Scribus in the first place? Well, it’s a desktop publishing program, and it’s free (FOSS).

I’m actually not an active user of the program. I got introduced to the problem by OMLX, during the arabteam’s first programming contest. His friend Zeyad at itwadi.net he had a little adventure with the problem.

Scribus uses Qt, which does support Arabic and Bidirectional text properly, but the problem lies in the textframe layout; it uses a custom layout “engine”.

This layout “engine” is mostly a monstrous 1500 lines method filled with spaghetti code. No wonder that bug report has been open for 5 years.

All the pieced needed to build the support for Arabic are out there: HarfBuzz and GNU FriBidi. HarfBuzz mostly just lacks documentation; but any layout guru should be able to make use of it. It’s used in both Pango (GTK+) and Qt. It’s even used in the Linux port of google chrome.

I personally am not a layout guru in any way, shape, form, or sense of the word. I don’t even know the first thing about text-layout. But I know something about bidi: after all, I have created The Free Ressam (shameless self promotion). It’s a tool to do “fake” bidi and arabic-shaping at the text level; it transforms the text stream so that a layout-engine that’s not capable of bidi can be fed something that will seem as if it’s bidi. It doesn’t conform to the unicode bidirectional algorithm, and works only with Arabic, and when I did it I didn’t really know that fribidi already does a better job at it!

I tried to digg into harfbuzz, but so far it hasn’t been so fruitful. Though, I’ve been sorta successful with GNU FriBidi. I managed to get the gist of the api (it’s documented through man pages), and my attempt to integrate it into the PageItem_TextFrame::layout seems to be getting somewhere.

Bidi is kinda there, but not quite. Shaping is not there yet (actually it was there in an earlier test, but it was just too buggy and hackish that I had to take it away and just focus on getting bidi correctly first).

Look at this picture:

scribus

The text below (where it says Qt) is the story editor: it uses a plain text Qt text area, which has all the proper support for Arabic.

The text above (where it says TextFrame) is, obviously, the text frame. You can see that Arabic runs are displayed right to left, but the letters are disjoint. That’s because I’m not playing with the glyph selection process, but it’s ok, my first priority now is fixing the bugs in the bidi part of the problem.

Two problems appear in my sample text:

Something is wrong with line breaking: the first character in the RTL run that crosses a line break is missing.

Something is wrong with the end of the text: the last few characters aren’t detected properly as RTL. This one has been killing me for a while now!! It probably has to do with text length, or maybe the new lines have \rs which are some how confusing the text length when I transform it from QString to FriBidiChar *. I eliminated some possibilities for the cause of this bug, but others are still open.

Also, sometimes I experience crashes with Signal #11, which I was told is a memory access error (on the scribus-dev irc channel).

So, things are still shaky, but inshalla we’re getting somewhere.

Quick Update!

Just as I was writing the last paragraph, I realized what the bugs where, and I fixed them!

Here’s an updated picture of the current state of affairs on the bidi front:

bidi

Addendum

I’m putting my changes public on github: http://github.com/hasenj/scribus/tree/hasen

I talked to andreas “avox” on the dev channel yesterday, he pointed me to a series of patches by pierre that use harfbuzz. http://bugs.scribus.net/view.php?id=4645

Those patches don’t seem related to shaping, but they do use HarfBuzz, so at least they can serve as pointers when I go back to explore HarfBuzz some more. I don’t have a log of the conversation so I forgot whether pierre was in fact working on shaping or not.

For now I still have some work to do on the bidi front: in the current state of affairs, the whole text in the text frame is treated as a single paragraph. This is bad because, I’m using the method of specifying the base paragraph direction according to the first character with a hard-wired direction. So if the text has 2 paragraphs: the first is arabic, and the second is english, then all of the text (i.e. including the english paragraph) will be treated as if having an RTL base direction, which affects how neutral characters (such as punctuation marks) are positioned.