Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode "Grapheme Break" support (user-perceived character breaks, with libunistring) #1603

Open
Explorer09 opened this issue Feb 8, 2025 · 10 comments
Labels
build system 🔧 Affects the build system rather then the user experience dependencies Pull requests that update a dependency file enhancement Extension or improvement to existing feature

Comments

@Explorer09
Copy link
Contributor

This is a discussion of a feature that would use a library.

htop often needs breaks of file paths of command lines, as well as other strings that may contain arbitrary Unicode text.

In discussions of #462 and #854, about shortening a path of a working directory for display, I think it becomes apparent that we need to support the Unicode algorithm of user-perceived character boundaries.

In Unicode, this is defined in UAX 29 section 3:
https://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries
with a normative database:
https://www.unicode.org/Public/UCD/latest/ucd/auxiliary/GraphemeBreakProperty.txt

Not to be confused with Unicode line breaking algorithm (UAX 14).

What are the "Grapheme Cluster Boundaries" in Unicode and why are they important for htop?

When htop needs to truncate a file path or a command line argument string, it is supposed to remove "characters" from the path string until the string fits the given terminal column width. However, Unicode characters (code points) do not always have one-to-one relationships to the "characters" that user could perceive as one unit.

Special cases can include:

  • Combining character sequences
  • Character sequences with ZWJ (Zero Width Joiner, U+200D), commonly found in emoji sequences.
  • Regional indicator symbols (a.k.a. flag emojis). These regional indicator character should be in pairs and break cannot happen within each pair.
  • Languages that use a syllable as the minimum grapheme unit. This is the least concern for htop but I mention for completeness.

I don't think people would like a series of United Kingdom flag emojis (🇬🇧🇬🇧🇬🇧🇬🇧) suddenly transform into Bulgaria flag emojis plus a letter B (🇧🇬🇧🇬🇧🇬🇧) when characters are cut at a wrong location. This is just an example.

The grapheme cluster boundary algorithm requires a database, and I'm not wishing to embed the whole database into htop. What I am suggesting is to incorporate libunistring, a library from GNU that had the database for our needs. Specifically the unigbrk.h APIs.

The license for libunistring is "LGPLv3+ or GPLv2+" (dual-licensed), and so is compatible to htop's license (GPLv2+).

@BenBE
Copy link
Member

BenBE commented Feb 8, 2025

Is using the library doable in a way that makes runtime binding to this library feasible?
Or would this have to be a compile-time configure option?

@BenBE BenBE added enhancement Extension or improvement to existing feature build system 🔧 Affects the build system rather then the user experience dependencies Pull requests that update a dependency file labels Feb 8, 2025
@Explorer09
Copy link
Contributor Author

@BenBE It's likely a configure option with a possibly of statically linking it. The issue I see is the library's soname version is not quite stable as I would expect. Example:

https://packages.debian.org/bookworm/libunistring2
https://packages.debian.org/trixie/libunistring5

@BenBE
Copy link
Member

BenBE commented Feb 8, 2025

How stable is its API? Or will we be chasing API change after API change for this lib?

@Explorer09
Copy link
Contributor Author

@BenBE The API I was considering right now is uc_graphemeclusterbreak_property(), that is, simply retrieving the "Grapheme_Cluster_Break" property from the database. The other functions can't work directly with htop's RichString so it's likely I need to build wrappers for them. (They can work with Unicode strings stored in uint8_t, uint16_t or uint32_t arrays but do not support a custom iterator that I need for RichString.)

@rubyFeedback
Copy link

Compiling libunistring (https://ftp.gnu.org/gnu/libunistring/libunistring-1.3.tar.xz) works fine for me, never had an issue, so this option would not affect me negatively (I think).

I do, however had, have one question:

  • Would it be possible to retain htop's old behaviour, e. g. via --disable-libunistring at configure-time? Being flexible may be useful here, if for some reason a user may not want or need libunistring or wishes to retain the old behaviour of htop. So in other words, whether libunistring would be optional or become mandatory. (As said, I don't mind either way, but I think it may be useful to declare this up-front.)

@Explorer09
Copy link
Contributor Author

@rubyFeedback I'm personally also reluctant to introduce an additional dependency to htop, so yes, I expect the dependency can be turned off with a configure option as you said.

Currently, I'm trying other libraries with similar functionalities, for example libgrapheme, as I discovered some issues with libunistring that look like bugs. I'll post a summary of the issues I found in a later post.

@Explorer09
Copy link
Contributor Author

I was evaluating three libraries that have the potential of supporting grapheme break detection needed for htop.

The three libraries are: ICU4C from Unicode, libunistring from GNU, and libgrapheme from suckless.org.

htop has an unusual requirement when it comes to grapheme break detection: Due to the use of ncurses, our Unicode strings are not always stored as arrays of UTF-32 characters (uint32_t []) or UTF-8 bytes (char []). Rather, our types used are cchar_t and chtype (both from curses.h). Therefore many grapheme break APIs that operate on UTF-8 arrays or UTF-32 arrays can't work with us. We largely need to roll out an iterator on our own.

I have a rough idea on how an iterator can look like. In htop's OOP style:

typedef struct GraphemeBreakScanner_ {
   size_t charPosition;
   wchar_t lastChar;
   /* internal */ graphemeBreakState;
} GraphemeBreakScanner;

void GraphemeBreakScanner_reset(GraphemeBreakScanner *this);
// "direction" may be 0 ("forward") or 1 ("backward")
// When scanning forward, returns true if a grapheme boundary happens before the just added character.
// When scanning backward, returns true if a grapheme boundary happens after the just added character.
bool GraphemeBreakScanner_scanChar(GraphemeBreakScanner *this, wchar_t ch, int direction);

// Example use
{
   GraphemeBreakScanner gbs;
   bool ret;
   GraphemeBreakScanner_reset(&gbs);
   ret = GraphemeBreakScanner_scanChar(&gbs, L'A', 0 /*forward*/); /* Returns true */
   ret = GraphemeBreakScanner_scanChar(&gbs, L'B', 0 /*forward*/); /* Returns true */

   // U+AC00 (Hangul Syllable Ga) U+11A8 (Hangul Jongseong Kiyeok)
   ret = GraphemeBreakScanner_scanChar(&gbs, (wchar_t)0xAC00, 0 /*forward*/); /* Returns true */
   ret = GraphemeBreakScanner_scanChar(&gbs, (wchar_t)0x11A8, 0 /*forward*/); /* Returns false */

   // U+AC01 (Hangul Syllable Gag) U+11BA (Hangul Jongseong Sios)
   ret = GraphemeBreakScanner_scanChar(&gbs, (wchar_t)0xAC01, 0 /*forward*/); /* Returns true */
   ret = GraphemeBreakScanner_scanChar(&gbs, (wchar_t)0x11BA, 0 /*forward*/); /* Returns false */

   // | A | B | Ga(LV) ~g(T) | Gag(LVT) ~s(T)
   //                                   ^^^^^ lastChar

   // Scan backwards
   ret = GraphemeBreakScanner_scanChar(&gbs, (wchar_t)0xAC01, 1 /*backward*/); /* Returns false */
   ret = GraphemeBreakScanner_scanChar(&gbs, (wchar_t)0x11A8, 1 /*backward*/); /* Returns true */
   ret = GraphemeBreakScanner_scanChar(&gbs, (wchar_t)0xAC00, 1 /*backward*/); /* Returns false */

   // | A | B | Ga(LV) ~g(T) | Gag(LVT) ~s(T)
   //           ^^^^^^ lastChar

   ret = GraphemeBreakScanner_scanChar(&gbs, L'B', 1 /*backward*/); /* Returns true */

   // We can "scan forward" again with a different character.
   // It works as if the string is overwritten at the character position.
   ret = GraphemeBreakScanner_scanChar(&gbs, L'\0', 0 /*forward*/); /* Returns true */

   // | A | B | '\0'
   //           ^^^^ lastChar
}

For the three libraries just mentioned:

ICU

It has high OOP abstraction, and they are "abstracted" in a way that is difficult for a simple use case of detecting grapheme boundaries. Its main use requires creating a UBreakIterator object in C (its C++ counterpart is BreakIterator class), using ubrk_open(), with a UTF-16 string prepared for the scanner. I didn't find any lower API for just grabbing the "Grapheme Break" property of a particular character. Given the complex design of this library, I give up.

libunistring

libunistring is C only, and has a simpler API to use, if the strings are coded in UTF-8, UTF-16, or UTF-32 arrays. (If your text is in an encoding such as GB 18030, you are out of luck.) It's main scanning functions such as u8_grapheme_next() just requires you to supply two pointers (start of string and end of string) and it would return you a character pointer to the next grapheme boundary.

The lower level APIs are:

Also there are bugs during my testing. u8_grapheme_next() didn't work properly. The only functions that seem to work perfectly are u8_grapheme_breaks() and u16, u32, uc versions of the same function.

libgrapheme

libgrapheme is a C-only library as well.
grapheme_next_character_break_utf8() is the main function of finding the next grapheme break of a UTF-8 string. grapheme_next_character_break() is the UTF-32 counterpart. (But I'm not using any of these.)

grapheme_is_character_break() is the closest for the use case of GraphemeBreakScanner above. It uses a state so it can detect a Regional Indicator sequence.

The catch:

  • libgrapheme doesn't yet support Unicode 16. It didn't implement the GB9c "Indic_Conjunct_Break" rule that was added in Unicode 15.1.0.
  • The state of grapheme_is_character_break() is one way only - it only scans forward. There is no backward scanning at all.

That's mostly it. The lengthly report of Unicode string and grapheme break libraries.

One more thing. I would add #1038 as related to this. The detection of grapheme breaks in Unicode can make the backspace behavior in search function or text boxes more usable for users.

@BenBE
Copy link
Member

BenBE commented Feb 28, 2025

TL;DR: SNAFU. ;-)

I already heard not to much good about ICU, thus it didn't surprise me, that ICU4U suffers similarly. For non-UTF-8 encoding you'd thus basically have to go full ICU, convert once, do things and convert back. No thanks.

With libunistring you mentioned an unstable ABI IIRC? That's not something that is suitable if you want to have easy maintainence …

So libgrapheme, even with its caveats, sounds as the least horrible way to go rn …

@Explorer09
Copy link
Contributor Author

@BenBE What surprised me is that none of the three libraries provided the low level interfaces that I consider to Keep It Simple and to Do One Thing Well. While they provide interfaces to loop over strings that are properly UTF-8 or UTF-32 encoded, the world still exists other encodings that applications need to work with (the most notable one is GB 18030, but I consider C-quoted and escaped strings, numeric character reference in HTML and Quoted-Printable to also be in this category), and so I expected a lower level interface for that.

What I mean by Doing One Thing Well is that the scanner should not require the client code to pre-convert the whole string to UTF-8 or UTF-32 (that would require a malloc and free and limit its use in, e.g. signal handlers). The ideal approach for me is that the scanner accepts one character at the time (from the client code) and track the state with it.

My main concern with libunistring is not its ABI instability, but that uc_is_grapheme_break() didn't do the thing right in the API so it's useless to scan any complex sequence.

libgrapheme grapheme_is_character_break() works but the prototype isn't ideal. Below I try to state what the ideal API could look like.

bool uc_is_grapheme_break(ucs4_t a, ucs4_t b);
bool grapheme_is_character_break(uint_least32_t cp1, uint_least32_t cp2, GRAPHEME_STATE *state);
// The behavior is undefined if you call grapheme_is_character_break() out of sequence;
// It basically requires to do
//    grapheme_is_character_break(s[0], s[1], &state);
//    grapheme_is_character_break(s[1], s[2], &state);
//    grapheme_is_character_break(s[2], s[3], &state);
// ... etc., but it's easy to call the function wrong.

// The ideal API for me:
// "tribool" allows FALSE (==0), TRUE (==1), and UNKNOWN (==-1) three states.
typedef int tribool;
bool grapheme_state_reset(GRAPHEME_STATE *state);
bool grapheme_state_set_unknown(GRAPHEME_STATE *state);
tribool grapheme_scan_forward(uint_least32_t next_cp, GRAPHEME_STATE *state);
tribool grapheme_scan_backward(uint_least32_t prev_cp, GRAPHEME_STATE *state);

@Explorer09
Copy link
Contributor Author

Update. The recent development version of libgrapheme supports Unicode 15.1.0 (commit).

And I have one minor complaint with libgrapheme's API. It uses uint_least16_t for some of its function parameters, and AFAIK, the type is not a good idea for performance. They should have been using unsigned int instead. (Local variables with register storage and function parameters, if defined with types shorter than int, then the compiler may generate extra code just for clearing upper bits in the register. Bad for performance and bad for code size. uint_least16_t (usually typedef'd as unsigned short) are better used in arrays and structs and other memory storage.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build system 🔧 Affects the build system rather then the user experience dependencies Pull requests that update a dependency file enhancement Extension or improvement to existing feature
Projects
None yet
Development

No branches or pull requests

3 participants