Previously, the string-length was limited to BUFSIZ, which is an
obvious deficiency.
Now the buffer only needs to be as long as the user specifies the
minimal string length.
I added UTF-8-support, because that's how POSIX wants it and there
are cases where you need this. It doesn't add ELF-barf compared to
the previous implementation.
The t-flag is also pretty important for POSIX-compliance, so I added
it.
The only trouble previously was the a-flag, but given that POSIX
leaves undefined what the a-flag actually does, we set it as default
and don't care about parsing ELF-headers, which has already
turned out to be a security issue in GNU coreutils[0].
[0]: http://lcamtuf.blogspot.ro/2014/10/psa-dont-run-strings-on-untrusted-files.html
This is a particularly interesting program.
I managed to implement everything according to POSIX except how
octal escapes are specified in the standard, which is yet another
format compared to the one demanded for tr(1).
This not only confuses people, it also adds unnecessary cruft
for no real gain.
So in order to be able to use unescape() easily and for consistency,
I used our initial format \o[oo] instead of \0[ooo].
Marked as optional is UTF-8 support for %c in the POSIX specification.
Given how well-developed libutf has become, doing this here was more
or less trivial, putting us yet again ahead of the competition.
and mark it as finished in the README.
Specifically, add a small section on the compression flags, which
are basically an infected GNU limb which should be removed from
the face of the earth as soon as possible.
The algorithm had some areas which had potential for improvement.
This should make cmp(1) faster.
There have been changes to behaviour as well:
1) If argv[0] and argv[1] are the same, cmp(1) returns Same.
2) POSIX specifies the format of the difference-message to be:
"%s %s differ: char %d, line %d\n", file1, file2,
<byte number>, <line number>
However, as cmp(1) operates on bytes, not characters, I changed
it to
"%s %s differ: byte %d, line %d\n", file1, file2,
<byte number>, <line number>
This is one example where the standard just keeps the old format
for backwards-compatibility. As this is harmful, this change
makes sense in the sense of consistentcy (and because we take
the difference of char and byte very seriously in sbase, as
opposed to GNU coreutils).
The manpage has been annotated, reflecting the second change, and
sections shortened where possible.
Thus I marked cmp(1) as finished in README.
Use size_t for all counts, fix the manpage and refactor the code.
Here's yet another place where GNU coreutils fail:
sbase:
$ echo "GNU/Turd sucks" | wc -cm
15
coreutils:
$ echo "GNU/Turd sucks" | wc -cm
15 15
Take a bloody guess which behaviour is correct[0].
[0]: http://pubs.opengroup.org/onlinepubs/009604499/utilities/wc.html
and mark it as finished in the README.
Previously, it would only parse octal mode strings. Given
we have the parsemode()-function in util.h anyway, why not
also use it?
and mark it as finished in the README.
This is another example showing how broken the GNU coreutils are:
$ echo -e "äää\tüüü\tööö" | gnu-expand -t "5,10,20"
äää üüü ööö
$ echo -e "äää\tüüü\tööö" | sbase-expand -t "5,10,20"
äää üüü ööö
This is due to the fact that they are still not UTF8-aware and
actually see "ä" as two single characters, expanding the "äää" with
4 spaces to a tab of length 10.
The correct way however is to expand the "äää" with 2 spaces to a
tab of length 5.
One can only imagine how this silently breaks a lot of code around
the world.
WHAT WERE THEY THINKING?
which we are not planning to include into sbase.
What's left to discuss is how we're going to handle them in the
tools (dump usage() or silently ignore them).
Now you can specify a multibyte-delimiter to cut, which should
definitely be possible for the end-user (Fuck POSIX).
Looking at GNU/coreutils' cut(1)[0], which basically ignores the difference
between characters and bytes, the -n-option and which is bloated as hell,
one has to wonder why they are still default. This is insane!
Things like this personally keep me motivated to make sbase better
every day.
[0]: http://git.savannah.gnu.org/gitweb/?p=coreutils.git;a=blob;f=src/cut.c;hb=HEAD
NSFW! You have been warned.
One major milestone is to have the sbase-tools supporting UTF-8.
Tools like cut(1) with the -n flag don't make sense otherwise.
And while the gnu coreutils cut(1) blatantly ignores such an
important aspect, we will not tolerate this madness and mark it
as a TODO in the main README.
Since most tools inherently support UTF-8 anyway, this just concerns
tools which mangle with text or search in it in special ways.
and mark it as finished in README.
One small rationale on the way the manpage is set up: Looking at
the coreutils manpage, it does not invite to be a quick reference
guide, whereas I wrote this manpage to be short and concise in regard
to the information the advanced user needs.
No one needs to explain what an octal number is. That's not part of
the scope of this manpage.
Also, nobody wants to read a block of text just to find out how
to build an octal mode string.
to mark tools considered finished.
Finished doesn't mean work has stopped on these, but that these
programs are in a satisfying state according to the current suckless
coding practices, this includes having a
1) mandoc manpage
2) clean code
In most cases, 1) was the failing criterion. So in the interest of
finishing more tools and if you want to, well-written mandoc man-
pages are very much appreciated.
Get rid of it for now as it is not really widely used. We can do
a simple implementation when time comes.
Remove the table from README because it is not easy to edit unless
you use emacs.
We seem to have problems building individual tools across various
make implementations. If anyone can step up and fix this we will
remove the dependency on GNU make.