This is a special third kind of structure found in Unicode, besides
singletons and ranges.
This dramatically reduces the number of explicit singletons in the
lookup tables.
Also, I changed the awk-script so that it can sort trivial
translations as well, breaking down the LOC even more.
The binary size of tr dropped from 67K to 51K.
Previously, the to*rune function would have to jiggle with two
arrays, and it somehow evaded me that it is actually way simpler
to just add another entry to the arrays if needed.
Binary size goes slightly down, e.g. tr statically linked against
musl: 68072 -> 67688
Behind the scenes though the conversion should be a bit faster and,
more importantly, the scary case-conversion function is simplified
and easier to understand.
It also drops nearly half the LOC in upperrune.c and lowerrune.c.
Interface and function as proposed by cls.
The reasoning behind this function is that cls expressed his
interest to keep memory allocation out of libutf, which is a
very good motive.
This simplifies the function a lot and should also increase the
speed a bit, but the most important factor here is that there's
no malloc anywhere in libutf, making it a lot smaller and more
robust with a smaller attack-surface.
Look at the paste(1) and tr(1) changes for an idiomatic way to
allocate the right amount of space for the Rune-array.
Interface as proposed by cls, but internally rewritten after a few
considerations.
The code is much shorter and to the point, aligning itself with other
standard functions. It should also be much faster, which is not bad.
This optimizes the binary size for each tool that uses these functions.
Previously, if a program just used one single function, maybe even a
one-liner, it would statically compile in all lookup-tables, bloating
the binary by up to 20K.
All these changes are derived from a local libutf where I do the
primary changes. So I hope that I can merge these things into libutf
sooner or later, as discussed on the ml.
This basically means that we now have an autogenerating typecheck
and case-conversion tool.
Don't freak out when you see the added LOC. Given we now have
an additional mapping to the uppercase-characters, some ranges got
"lost" and have to be written literally by the generating awk-script.
The runetypebody.h was generated by myself using my modified version
of mkrunetype.awk and I'll push the changed version as soon as this
has been discussed on the ml.
If you worry about speed, consider, that bsearch is just the right
tool for this job and can even handle a long array like this.
This is a useful behavior if you want to reorder the lines,
because otherwise you might end up with originally two lines
on one, e.g.
$ echo -ne "foo\nbar" | sort
barfoo