Fixing C strings

8

spc476 4 months ago

A downside of using struct str everywhere is that the compiler can’t check printf strings. … But this only works on functions that have the same signature as printf so it doesn’t work on my implementation. Overall, I think this is an acceptable tradeoff because format strings are easier to reason about than arbitrary code and all possible issues are localized in calls to print functions.

At $PREVIOUS_JOB, a team member left for employment elsewhere, and I was tasked with taking over his code. He had needlessly wrapped syslog() (with a function called Syslog() with the same parameters, and our code would only ever run on Unix). When I removed the wrapper and fixed all the call sites (easy enough), boy, did I find errors. The compiler knew to check the format string for syslog(), but not of the wrapper. [1] I’m a proponent of having the compiler find bugs so you don’t have to.

[1] It didn’t help things that all the format strings were #defined elsewhere in the code, so all I initially saw was:

Syslog(BAD_MESSAGE37,x,foo->bar);

16
waddlesplash edited 4 months ago
While in your particular case it obviously made sense to just drop the wrapper and use the real method, if one ever does want the compiler to check printf-format strings that are being passed to a function not named printf, you can use __attribute__ ((format (__printf__, (a), (b)))). For example:
```
extern void debug_printf(const char *format, ...) __attribute__ ((format (__printf__, 1, 2)));
```
declares a function debug_printf that takes a printf format string in argument #1 and the parameters to it starting in argument #2. So, both in the case you describe here as well as the one in the article, this attribute could be used to eliminate the “tradeoff” and keep the compiler type checking.

There’s also another, related attribute called format_arg which tells the compiler that any string passed to the specified argument will cause a “similar” format string to be returned. The GCC documentation notes this can be used for translations: e.g. if one declared a translate function this way, then
```
printf(translate("There are %d objects."), count);
```
will result in the compiler checking the string inside the call to translate against the printf invocation.
1. 3
  
  spc476 edited 4 months ago
  
  True, and I knew about that, but not the other developer. Also, while we did use GCC (on Linux) and clang (on Mac OS-X), we also had to compile for Solaris using their native C compiler. I don’t think it supports __attribute__().
  
  Edited to add: yes, I know, conditional compilation and #defines can work around this. I just wanted to mention that not all C compilers support __attribute()__.

6

spc476 4 months ago

And a bit unrelated to the post, but this statement:

For my current project, correctness and safety have higher priority than performance

and this code sample:

struct result com_write(u16 port, struct str str)
{
    if (!str.dat || str.len <= 0)
        return result_error(EINVAL);

    while (str.len--) {
        while (!(inb(port + OFFSET_LINE_STATUS) & LINE_STATUS_TX_READY))
            ;
        outb(port, *str.dat++);
    }

    return result_ok();
}

I know, defensive coding is still in style, but it hides bugs! Why are you passing a NULL string? How can the size be less than 0? (oh, the sz field is signed? Okay, I might not agree with that, but whatever). Better (in my opinion) is this:

void com_write(u16 port, struct str str)
{
    assert(port    == VALID_PORT);
    assert(str.dat != NULL);
    assert(str.len >= 0);

    while (str.len--) {
        while (!(inb(port + OFFSET_LINE_STATUS) & LINE_STATUS_TX_READY))
            ;
        outb(port, *str.dat++);
    }
}

With that, any call with what I’m calling “an illegal string” (NULL, or the length is negative) is caught and you can track down why the function was called with a bad string. And this checks that the port is a valid port (well, the assert() might be different, but it conveys the idea). Now that’s there no error possible, you don’t have to return a status code. [1]

[1] One other possible change—if you can’t write to the port in a given amount of time, return a timeout value.

6

robn 4 months ago

Standard asserts are compiled out of production code, so your version has no checks at all. But even if you didn’t compile them out, their usual action is to crash the program. And this post is describing an embedded system, so who knows if that crash can even yield useful information. And what does that mean in production. Does the machine we’re controlling explode?

The author’s version has the same basic runtime checks, and returns error if they’re wrong. The caller can then take appropriate action, maybe logging something, or drop to a diagnostic mode, or enter a fail safe state.

I like asserts, use ‘em all the time. They don’t make errors impossible though; we’re just making a choice about what happens next.
1. 6
  
  spc476 4 months ago
  
  Any my point still stands––how is an invalid string being passed to the routine? That’s a bug. Checking and returning an error is, in my opinion, hiding the bug. If I had to argue, okay, keep the asserts, and add back in the error check, waste the space for code that might never be called. The code is tested, right?
  1. 11
    
    robn 4 months ago
    
    “How is an invalid string being passed to the routine?” is exactly the right question. In this case, I don’t know either, because I don’t know what context this function is operating in.
    
    In your version, with assertions compiled out, this function crashes if called with str.dat is NULL when str.len is non-zero. With assertions, it also crashes, just slightly earlier.
    
    That leads to at least two follow on questions.
    
    Is it possible for this function to be called with NULL? I have no idea in this case, because I don’t know the surrounding program. If all inputs are hardcoded, or generated inside the program, or validated first, then maybe not!
    
    Is a crash the best way of responding? If you have crash reporting/logging infrastructure, perhaps it is. But still, if this was part of a loop processing discrete units of incoming work, and that loop already has some error response facility, perhaps its better to record the error and move on to the next unit of work.
    
    I work in filesystems in the kernel. Every bit of data coming through may be tainted. Even stuff I read back off disk that I wrote there myself may be damaged. Pretty much every codepath has the ability to report an error to someone. When my code crashes, it takes your entire machine down with it. So it’s nearly always better for me to err on the side of a graceful error return even for an “impossible” situation.
    
    Neither method does or doesn’t hide bugs. It’s entirely how to respond to it. Your approach might be right in this particular situation. I have no idea. I don’t see how you can either. Given that, I think its a bold call to say the error checking is unnecessary.

6

Cloudef 4 months ago

I don’t write much C anymore since the introduction of Zig language. But I prefer the SDS approach to C strings as they are compatible with C strings [1]

1: https://github.com/antirez/sds

5

kornel 4 months ago

This could have been so much simpler and more ergonomic if C had a slice type.

12

snej 4 months ago

Just STOP USING C for chrissake. Especially if “ correctness and safety have higher priority than performance.” Every other language has better/safer string types.

16
kivikakk edited 4 months ago
Sometimes you have to.

edit: pre-empting the usual things:
- The author is working an a bare-metal environment. Do you know there’s a language that’s going to be completely appropriate for their use case?
- Let’s say there is! (It looks like x86-64, but knowing nothing about the rest of the environment, we can’t actually say for sure.) But maybe at work they deal with use cases where there isn’t. Are they not allowed to practice their chops?
- And if they are, are they not allowed to write about it publicly? This kind of response is shit. Do better.
1. 9
  
  olegkovalov 4 months ago
  
  Massive upvote for this wise comment. Shouting „just use another language” is a dumb advice in 99% cases.
2. 5
  
  snej 4 months ago
  
  Fair enough. I was in a grumpy mood last night and someone should have taken my iPad away.
  
  But apropos your first point: Yes, C++. Arduino, one of the most popular SDKs for bare metal environments all the way down to 8-bit microcontrollers, uses a subset of C++, they just don’t call it that. It’s pretty easy to build an ergonomic and safe slice-like string class in C++, even without any of the rest of the STL.
  1. 2
    
    kivikakk 4 months ago
    
    Frankly, so was I! Sorry. Life is just hard lately.
    
    (And apropos your apropos: we can’t know it’s appropriate for their precise use case, even if it might appear to us to be! And like I said in a later comment, there’s plenty of times when it’s just adding complexity — not adding C++ to a FreeRTOS code base etc etc. Sometimes we just want to get on with it.)
3. 5
  
  david_chisnall 4 months ago
  
  Even C++ is better than C because it lets you build abstractions. There are very few embedded platforms that have a C compiler but not a C++ one (PIC24, for example, thought their C compiler isn’t really C either, it’s a weird C-like dialect). In most cases, where you have a working C toolchain, you also have a working C++ toolchain. And that lets you create a string type where any operation that modifies the length always modifies the field that tracks the length.
  
  Having written an RTOS in C++ for small embedded devices (we need a minimum of around 10 KiB of code + data memory, more practically 64 KiB if you want to do useful things and 256 KiB if you want a network stack with TLS), I would absolutely not use C for any low-level task.
  
  If you have a large C codebase, you can incrementally move it to C++. Start using smart pointers for ownership, for example, and use .get() for interoperability with existing APIs. You can define a struct where the method are all wrapped in #ifdef __cplusplus and the layout is compatible between both languages and use them directly in the C and via methods in C++ and gradually reduce the use of C. You can make this conversion incrementally, by just changing individual files from C to C++ and adding the necessary casts, then adopting C++ idioms per function.
  1. 4
    
    nikki93 edited 4 months ago
    
    TBH, the way that C++ idioms tend to hide what’s going on has made things less clear and caused issues in my experience (looking at things that came up in a team environment on large projects). One of the gotchas that was run into is how member destructors fire in reverse order, but move constructors fire in forward order (so interdependencies between fields were causing issues in the move case because they were implemented assuming reverse destructor calls). The developer involved ended up just giving up and using manual new / delete because you can at least read and see what it’s (supposed to be) doing. There’s various little things like that. And in theory “safer abstractions” can be built but in practice you get things like std::string_view.
    
    I think the better design than C++ would be C with linear types with explicit consume calls (a better way to make sure you don’t forget to free or double free, vs. implicit hidden frees from destructors) and with pointer arithmetic and raw array index etc. disallowed except in functions / blocks marked trusted / unchecked. Because of the amount of decision overhead you get having to figure out which of the C++ many ways to do something you should use, it overall feels like C’s design is actually better even without the improvements I mentioned…
    
    Beyond those things what I’d like to see is actual functional correctness mechanisms – I do think that the Dafny style of assertions, function contracts and predicate transformers fits the imperative C style well, but that design space is definitely quite open and will be interesting to explore.
  2. 2
    
    dfawcus 4 months ago
    
    The problem with that is that in a corporate environment, there are a lot of folks who believe they know C++ when they don’t, and there are not tools in place to enforce “safe and sensible” subsets.
    
    Given the sharp tool that is accidental unrestricted use of any of C++, these folks who are able to mess up with C, and make an even bigger mess using C++.
    
    So in that scenario, sticking with C is actually safer.
    
    One can create abstractions, it is just that one can not usually make the compiler check them fully. One may gain that ability to check stuff with C++, but the downside is often too great.
  3. kivikakk 4 months ago
    
    [Comment removed by author]
4. 3
  
  mort 4 months ago
  
  And maybe the situation is such that they could use another language, and the situation at work is such that they could have used another language, but they still use C at work because it’s a huge old project and rewriting it all in Rust is a monumental investment that they haven’t done yet.
  
  In short, I agree.

2

jrwren 4 months ago

2

malxau 4 months ago

In all the code I wrote, there is not at single null-terminated string

I’ve been doing similar things, but the issue is calling into the OS when it expects a null-terminated string. That leads to two classes of strings: length counted string with a null terminator for the OS, and length counted strings without a null terminator that can be used internally within the program. That distinction ends up being the footgun.

3

mikejsavage edited 4 months ago

In my experience the only place where this happens is interacting with the filesystem, which is already slow so it’s fine to copy into a static 64k buffer/allocate a new string and null terminate at the boundary and use spanny strings everywhere else.
2

dfawcus edited 4 months ago

A common approach with such counted string schemes, is to allocate the buffer 1 larger than required, and always ensure there is a ‘\0’ 1 beyond the length.

Then there is only one type of string, always having a ‘\0’. While it works for output, it isn’t so useful for input.

(I vaguely recall that DJB’s version of this scheme, as used in qmail, actually shoved a ‘z’ after the counted length)