Replace all CVTUTF code #567

LukeShu · 2024-02-23T03:49:40Z

From 1998 to 2007, the Unicode Consortium maintained a library called CVTUTF. In 2009, CVTUTF was removed from unicode.org, and the Unicode Consortium said that every version of CVTUTF had bugs, and that folks should use the ICU library instead.

CVTUTF was under a custom license that was not Free under the FSF's definition, not Open Source under the OSI's definition, and not GPL-compatible. json/ext uses code taken-from/based-on CVTUTF. This has caused much consternation among folks who care about any of those 3 things.

So, I

did some software-archeology to collect all versions of CVTUTF (https://git.lukeshu.com/2git/cvtutf-make/, https://git.lukeshu.com/2git/cvtutf/), so that I could identify precisely which code in json/ext is based on CVTUTF,
deleted that code from json/ext,
got drunk so I'd forget all of the code that I'd just read, and
wrote some new code to replace the deleted code.

I hope that you'll find my version of convert_UTF8_to_JSON to be clearer and more maintainable.

I have not benchmarked it, but I do not expect a significant performance difference. If I had to guess, I'd suspect that my UTF-8 decoder is slightly slower (I use val & const == const in an if/else chain, where I think CVTUTF used a [256]char lookup table), while my JSON encoder is slightly faster (I suspect that by virtue of being simpler the compiler is better able to optimize it).

Fixes #277

hsbt · 2024-03-12T02:06:53Z

@LukeShu Thanks for your proposal. I agree to replace this with current implementation.

voxik · 2024-03-26T13:24:22Z

@LukeShu Thanks for your proposal. I agree to replace this with current implementation.

What is the plan here? Is it safe to assume you are going to merge this, therefore it should be fine use this patch in Fedora to avoid legal issues?

hsbt · 2024-03-26T22:54:19Z

I didn't review this yet. Don't rush this.

LukeShu · 2024-04-05T19:37:34Z

Preliminary benchmarking (using your old generator_benchmark.rb) is showing that my version is faster across the board:

LukeShu · 2024-04-06T03:12:17Z

I'm not sure what was up with those large "before" numbers, but now being more careful with CPU throttling and background processes and cron jobs, I'm seeing much smaller deltas (generally 1-3µs/op or 1-7% improvement; with the exception that generator2_benchmark.rb's ASCII test got a big win). I definitely didn't make things slower.

I am seeing some higher standard deviations though.

LukeShu · 2024-05-15T20:10:53Z

For distros still shipping Ruby 2.7 packages, I've backported this to ruby-json 2.3.0 (the version bundled with Ruby 2.7.8).

Parabola is currently shipping:

ruby2.7 2.7.7-1.parabola1 https://repo.parabola.nu/other/ruby-libre/ruby-2.7.8-libre1.tar.gz (ruby 2.7.8 patched to use https://github.com/parabola-gnulinuxlibre/ruby-json/releases/tag/ruby-2.7.8-libre1)
ruby 3.0.6-1.parabola1 https://repo.parabola.nu/other/ruby-libre/ruby-3.0.6-libre1.tar.gz (ruby 3.0.6 patched to use https://github.com/parabola-gnulinuxlibre/ruby-json/releases/tag/ruby-3.0.6-libre1 AKA https://github.com/parabola-gnulinuxlibre/ruby-json/releases/tag/v2.7.1-1.parabola1)
ruby-json 2.7.1-1.parabola1 https://github.com/parabola-gnulinuxlibre/ruby-json/archive/6e75be64c896e093075ec99bf94a3f5fc576c283.tar.gz (https://github.com/parabola-gnulinuxlibre/ruby-json/releases/tag/v2.7.1-1.parabola1)

Feel free to grab any of those tarballs or Git tags.

nobu · 2024-09-03T03:39:20Z

The parser code seems unrelated to the replacement.

LukeShu · 2024-09-03T06:17:46Z

The parser code seems unrelated to the replacement.

As can be seen if you look at it commit-by-commit, there was a small amount of CVTUTF code in parser.h that abf962a drops; a few typedefs and a few #defines. The subsequent 8720b46 adjusts parser.h and parser.rl to those being gone. These are fairly trivial adjustments:

using uint32_t instead of the now-gone UTF32 (the commit message discusses whether it is safe to rely on uint32_t having been defined)
parser.rl:unescape_unicode() now defines its own replacement_char instead of the now-gone UNI_REPLACEMENT_CHAR
parser.rl:json_string_unescape() now says 0xD800 instead of the now-gone UNI_SUR_HIGH_START, with a comment explaining where 0xD800 comes from.

Then of course parser.c is re-generated from parser.rl.

I did this based on manual inspection, comparing the code to my re-created history of CVTUTF at https://git.lukeshu.com/2git/cvtutf/ (created by the scripts at https://git.lukeshu.com/2git/cvtutf-make/)

LukeShu · 2024-09-03T06:32:47Z

I see that there is now a merge-conflict in generator.c. I will rebase to resolve that tomorrow. I also have an idea for how I can get the stdev of the benchmarks down; I will benchmark that tomorrow and hopefully include it in the new version.

I, Luke T. Shumaker, am the sole author of the added code. I did not reference CVTUTF when writing it. I did reference the Unicode standard (15.0.0), the Wikipedia article on UTF-8, and the Wikipedia article on UTF-16. When I saw some tests fail, I did reference the old deleted code (but a JSON-specific part, inherently not as based on CVTUTF) to determine that script_safe should also escape U+2028 and U+2029. I targeted simplicity and clarity when writing the code--it can likely be optimized. In my mind, the obvious next optimization is to have it combine contiguous non-escaped characters into just one call to fbuffer_append(), instead of calling fbuffer_append() for each character. Regarding the use of the "modern" types `uint32_t`, `uint16_t`, and `bool`: - ruby.h is guaranteed to give us uint32_t and uint16_t. - Since Ruby 3.0.0, ruby.h is guaranteed to give us bool... but we support down to Ruby 2.3. But, ruby.h is guaranteed to give us HAVE_STDBOOL_H for the C99 stdbool.h; so use that to include stdbool.h if we can, and if not then fall back to a copy of the same bool definition that Ruby 3.0.5 uses with C89.

LukeShu · 2024-09-06T06:37:57Z

OK, updated. Sorry that took so long.

I'm not sure what changed; gcc or glibc's allocator or ruby's allocator, but I'm not seeing such increased variance in performance anymore. My idea for getting it down (which I've put on the lukeshu/no-cvtutf-prealloc) does indeed improve the variance, but at IMO an unacceptable hit to average performance.

This is using the benchmark summaries generated by #599.

byroot · 2024-10-08T01:38:24Z

Thanks for the new implementation.

* ruby/json#567 * ruby/json@c96351f874

LukeShu force-pushed the lukeshu/no-cvtutf branch from b9c495e to 6e75be6 Compare February 23, 2024 03:51

Delete code that is based on CVTUTF

9d5e250

I did this based on manual inspection, comparing the code to my re-created history of CVTUTF at https://git.lukeshu.com/2git/cvtutf/ (created by the scripts at https://git.lukeshu.com/2git/cvtutf-make/)

LukeShu added 2 commits September 6, 2024 00:24

generator.c: Optimize by combining calls to fbuffer_append

82bfbcf

LukeShu force-pushed the lukeshu/no-cvtutf branch from 6e75be6 to 82bfbcf Compare September 6, 2024 06:27

byroot closed this in c57d33e Oct 8, 2024

matzbot pushed a commit to ruby/ruby that referenced this pull request Oct 15, 2024

The part of ext/json/generator/generator.c is replaced from CVTUTF code.

914608b

* ruby/json#567 * ruby/json@c96351f874

byroot mentioned this pull request Oct 17, 2024

Hilarious surrogate(?) bug #514

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace all CVTUTF code #567

Replace all CVTUTF code #567

LukeShu commented Feb 23, 2024

hsbt commented Mar 12, 2024

voxik commented Mar 26, 2024

hsbt commented Mar 26, 2024

LukeShu commented Apr 5, 2024

LukeShu commented Apr 6, 2024

LukeShu commented May 15, 2024

nobu commented Sep 3, 2024

LukeShu commented Sep 3, 2024

LukeShu commented Sep 3, 2024

LukeShu commented Sep 6, 2024 •

edited

Loading

byroot commented Oct 8, 2024

Replace all CVTUTF code #567

Replace all CVTUTF code #567

Conversation

LukeShu commented Feb 23, 2024

hsbt commented Mar 12, 2024

voxik commented Mar 26, 2024

hsbt commented Mar 26, 2024

LukeShu commented Apr 5, 2024

LukeShu commented Apr 6, 2024

LukeShu commented May 15, 2024

nobu commented Sep 3, 2024

LukeShu commented Sep 3, 2024

LukeShu commented Sep 3, 2024

LukeShu commented Sep 6, 2024 • edited Loading

byroot commented Oct 8, 2024

LukeShu commented Sep 6, 2024 •

edited

Loading