-
-
Notifications
You must be signed in to change notification settings - Fork 461
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unicode "Grapheme Break" support (user-perceived character breaks, with libunistring) #1603
Comments
Is using the library doable in a way that makes runtime binding to this library feasible? |
@BenBE It's likely a https://packages.debian.org/bookworm/libunistring2 |
How stable is its API? Or will we be chasing API change after API change for this lib? |
@BenBE The API I was considering right now is |
Compiling libunistring (https://ftp.gnu.org/gnu/libunistring/libunistring-1.3.tar.xz) works fine for me, never had an issue, so this option would not affect me negatively (I think). I do, however had, have one question:
|
@rubyFeedback I'm personally also reluctant to introduce an additional dependency to htop, so yes, I expect the dependency can be turned off with a configure option as you said. Currently, I'm trying other libraries with similar functionalities, for example libgrapheme, as I discovered some issues with libunistring that look like bugs. I'll post a summary of the issues I found in a later post. |
I was evaluating three libraries that have the potential of supporting grapheme break detection needed for htop. The three libraries are: ICU4C from Unicode, libunistring from GNU, and libgrapheme from suckless.org. htop has an unusual requirement when it comes to grapheme break detection: Due to the use of I have a rough idea on how an iterator can look like. In htop's OOP style: typedef struct GraphemeBreakScanner_ {
size_t charPosition;
wchar_t lastChar;
/* internal */ graphemeBreakState;
} GraphemeBreakScanner;
void GraphemeBreakScanner_reset(GraphemeBreakScanner *this);
// "direction" may be 0 ("forward") or 1 ("backward")
// When scanning forward, returns true if a grapheme boundary happens before the just added character.
// When scanning backward, returns true if a grapheme boundary happens after the just added character.
bool GraphemeBreakScanner_scanChar(GraphemeBreakScanner *this, wchar_t ch, int direction);
// Example use
{
GraphemeBreakScanner gbs;
bool ret;
GraphemeBreakScanner_reset(&gbs);
ret = GraphemeBreakScanner_scanChar(&gbs, L'A', 0 /*forward*/); /* Returns true */
ret = GraphemeBreakScanner_scanChar(&gbs, L'B', 0 /*forward*/); /* Returns true */
// U+AC00 (Hangul Syllable Ga) U+11A8 (Hangul Jongseong Kiyeok)
ret = GraphemeBreakScanner_scanChar(&gbs, (wchar_t)0xAC00, 0 /*forward*/); /* Returns true */
ret = GraphemeBreakScanner_scanChar(&gbs, (wchar_t)0x11A8, 0 /*forward*/); /* Returns false */
// U+AC01 (Hangul Syllable Gag) U+11BA (Hangul Jongseong Sios)
ret = GraphemeBreakScanner_scanChar(&gbs, (wchar_t)0xAC01, 0 /*forward*/); /* Returns true */
ret = GraphemeBreakScanner_scanChar(&gbs, (wchar_t)0x11BA, 0 /*forward*/); /* Returns false */
// | A | B | Ga(LV) ~g(T) | Gag(LVT) ~s(T)
// ^^^^^ lastChar
// Scan backwards
ret = GraphemeBreakScanner_scanChar(&gbs, (wchar_t)0xAC01, 1 /*backward*/); /* Returns false */
ret = GraphemeBreakScanner_scanChar(&gbs, (wchar_t)0x11A8, 1 /*backward*/); /* Returns true */
ret = GraphemeBreakScanner_scanChar(&gbs, (wchar_t)0xAC00, 1 /*backward*/); /* Returns false */
// | A | B | Ga(LV) ~g(T) | Gag(LVT) ~s(T)
// ^^^^^^ lastChar
ret = GraphemeBreakScanner_scanChar(&gbs, L'B', 1 /*backward*/); /* Returns true */
// We can "scan forward" again with a different character.
// It works as if the string is overwritten at the character position.
ret = GraphemeBreakScanner_scanChar(&gbs, L'\0', 0 /*forward*/); /* Returns true */
// | A | B | '\0'
// ^^^^ lastChar
} For the three libraries just mentioned: ICUIt has high OOP abstraction, and they are "abstracted" in a way that is difficult for a simple use case of detecting grapheme boundaries. Its main use requires creating a UBreakIterator object in C (its C++ counterpart is libunistringlibunistring is C only, and has a simpler API to use, if the strings are coded in UTF-8, UTF-16, or UTF-32 arrays. (If your text is in an encoding such as GB 18030, you are out of luck.) It's main scanning functions such as The lower level APIs are:
Also there are bugs during my testing. libgraphemelibgrapheme is a C-only library as well.
The catch:
That's mostly it. The lengthly report of Unicode string and grapheme break libraries. One more thing. I would add #1038 as related to this. The detection of grapheme breaks in Unicode can make the backspace behavior in search function or text boxes more usable for users. |
TL;DR: SNAFU. ;-) I already heard not to much good about ICU, thus it didn't surprise me, that ICU4U suffers similarly. For non-UTF-8 encoding you'd thus basically have to go full ICU, convert once, do things and convert back. No thanks. With libunistring you mentioned an unstable ABI IIRC? That's not something that is suitable if you want to have easy maintainence … So libgrapheme, even with its caveats, sounds as the least horrible way to go rn … |
@BenBE What surprised me is that none of the three libraries provided the low level interfaces that I consider to Keep It Simple and to Do One Thing Well. While they provide interfaces to loop over strings that are properly UTF-8 or UTF-32 encoded, the world still exists other encodings that applications need to work with (the most notable one is GB 18030, but I consider C-quoted and escaped strings, numeric character reference in HTML and Quoted-Printable to also be in this category), and so I expected a lower level interface for that. What I mean by Doing One Thing Well is that the scanner should not require the client code to pre-convert the whole string to UTF-8 or UTF-32 (that would require a My main concern with libunistring is not its ABI instability, but that libgrapheme bool uc_is_grapheme_break(ucs4_t a, ucs4_t b);
bool grapheme_is_character_break(uint_least32_t cp1, uint_least32_t cp2, GRAPHEME_STATE *state);
// The behavior is undefined if you call grapheme_is_character_break() out of sequence;
// It basically requires to do
// grapheme_is_character_break(s[0], s[1], &state);
// grapheme_is_character_break(s[1], s[2], &state);
// grapheme_is_character_break(s[2], s[3], &state);
// ... etc., but it's easy to call the function wrong.
// The ideal API for me:
// "tribool" allows FALSE (==0), TRUE (==1), and UNKNOWN (==-1) three states.
typedef int tribool;
bool grapheme_state_reset(GRAPHEME_STATE *state);
bool grapheme_state_set_unknown(GRAPHEME_STATE *state);
tribool grapheme_scan_forward(uint_least32_t next_cp, GRAPHEME_STATE *state);
tribool grapheme_scan_backward(uint_least32_t prev_cp, GRAPHEME_STATE *state); |
Update. The recent development version of libgrapheme supports Unicode 15.1.0 (commit). And I have one minor complaint with libgrapheme's API. It uses |
This is a discussion of a feature that would use a library.
htop often needs breaks of file paths of command lines, as well as other strings that may contain arbitrary Unicode text.
In discussions of #462 and #854, about shortening a path of a working directory for display, I think it becomes apparent that we need to support the Unicode algorithm of user-perceived character boundaries.
In Unicode, this is defined in UAX 29 section 3:
https://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries
with a normative database:
https://www.unicode.org/Public/UCD/latest/ucd/auxiliary/GraphemeBreakProperty.txt
Not to be confused with Unicode line breaking algorithm (UAX 14).
What are the "Grapheme Cluster Boundaries" in Unicode and why are they important for htop?
When htop needs to truncate a file path or a command line argument string, it is supposed to remove "characters" from the path string until the string fits the given terminal column width. However, Unicode characters (code points) do not always have one-to-one relationships to the "characters" that user could perceive as one unit.
Special cases can include:
I don't think people would like a series of United Kingdom flag emojis (🇬🇧🇬🇧🇬🇧🇬🇧) suddenly transform into Bulgaria flag emojis plus a letter B (🇧🇬🇧🇬🇧🇬🇧) when characters are cut at a wrong location. This is just an example.
The grapheme cluster boundary algorithm requires a database, and I'm not wishing to embed the whole database into htop. What I am suggesting is to incorporate libunistring, a library from GNU that had the database for our needs. Specifically the
unigbrk.h
APIs.The license for libunistring is "LGPLv3+ or GPLv2+" (dual-licensed), and so is compatible to htop's license (GPLv2+).
The text was updated successfully, but these errors were encountered: