-
Notifications
You must be signed in to change notification settings - Fork 343
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace all CVTUTF code #567
Conversation
b9c495e
to
6e75be6
Compare
@LukeShu Thanks for your proposal. I agree to replace this with current implementation. |
What is the plan here? Is it safe to assume you are going to merge this, therefore it should be fine use this patch in Fedora to avoid legal issues? |
I didn't review this yet. Don't rush this. |
I'm not sure what was up with those large "before" numbers, but now being more careful with CPU throttling and background processes and cron jobs, I'm seeing much smaller deltas (generally 1-3µs/op or 1-7% improvement; with the exception that I am seeing some higher standard deviations though. |
For distros still shipping Ruby 2.7 packages, I've backported this to ruby-json 2.3.0 (the version bundled with Ruby 2.7.8). Parabola is currently shipping:
Feel free to grab any of those tarballs or Git tags. |
The parser code seems unrelated to the replacement. |
As can be seen if you look at it commit-by-commit, there was a small amount of CVTUTF code in
Then of course |
I did this based on manual inspection, comparing the code to my re-created history of CVTUTF at https://git.lukeshu.com/2git/cvtutf/ (created by the scripts at https://git.lukeshu.com/2git/cvtutf-make/)
I see that there is now a merge-conflict in |
I, Luke T. Shumaker, am the sole author of the added code. I did not reference CVTUTF when writing it. I did reference the Unicode standard (15.0.0), the Wikipedia article on UTF-8, and the Wikipedia article on UTF-16. When I saw some tests fail, I did reference the old deleted code (but a JSON-specific part, inherently not as based on CVTUTF) to determine that script_safe should also escape U+2028 and U+2029. I targeted simplicity and clarity when writing the code--it can likely be optimized. In my mind, the obvious next optimization is to have it combine contiguous non-escaped characters into just one call to fbuffer_append(), instead of calling fbuffer_append() for each character. Regarding the use of the "modern" types `uint32_t`, `uint16_t`, and `bool`: - ruby.h is guaranteed to give us uint32_t and uint16_t. - Since Ruby 3.0.0, ruby.h is guaranteed to give us bool... but we support down to Ruby 2.3. But, ruby.h is guaranteed to give us HAVE_STDBOOL_H for the C99 stdbool.h; so use that to include stdbool.h if we can, and if not then fall back to a copy of the same bool definition that Ruby 3.0.5 uses with C89.
6e75be6
to
82bfbcf
Compare
OK, updated. Sorry that took so long. I'm not sure what changed; gcc or glibc's allocator or ruby's allocator, but I'm not seeing such increased variance in performance anymore. My idea for getting it down (which I've put on the This is using the benchmark summaries generated by #599. |
Thanks for the new implementation. |
From 1998 to 2007, the Unicode Consortium maintained a library called CVTUTF. In 2009, CVTUTF was removed from unicode.org, and the Unicode Consortium said that every version of CVTUTF had bugs, and that folks should use the ICU library instead.
CVTUTF was under a custom license that was not Free under the FSF's definition, not Open Source under the OSI's definition, and not GPL-compatible.
json/ext
uses code taken-from/based-on CVTUTF. This has caused much consternation among folks who care about any of those 3 things.So, I
json/ext
is based on CVTUTF,json/ext
,I hope that you'll find my version of
convert_UTF8_to_JSON
to be clearer and more maintainable.I have not benchmarked it, but I do not expect a significant performance difference. If I had to guess, I'd suspect that my UTF-8 decoder is slightly slower (I use
val & const == const
in an if/else chain, where I think CVTUTF used a[256]char
lookup table), while my JSON encoder is slightly faster (I suspect that by virtue of being simpler the compiler is better able to optimize it).Fixes #277