Fixing C strings : programming

[–]SonOfMrSpock 30 points31 points32 points 4 months ago (26 children)

[–]TwoIsAClue 29 points30 points31 points 4 months ago* (23 children)

[–]PeaSlight6601 5 points6 points7 points 4 months ago (4 children)

[–]MokoshHydro 11 points12 points13 points 4 months ago (3 children)

[–]PeaSlight6601 4 points5 points6 points 4 months ago* (2 children)

[–]MokoshHydro 8 points9 points10 points 4 months ago (1 child)

[–]PeaSlight6601 3 points4 points5 points 4 months ago (0 children)

[–]SonOfMrSpock 2 points3 points4 points 4 months ago (0 children)

[–]alphaglosined 2 points3 points4 points 4 months ago (4 children)

[–]TwoIsAClue 4 points5 points6 points 4 months ago* (3 children)

[–]alphaglosined 2 points3 points4 points 4 months ago (2 children)

[–]TwoIsAClue 0 points1 point2 points 4 months ago* (1 child)

[–]alphaglosined 4 points5 points6 points 4 months ago (0 children)

[–]evaned 3 points4 points5 points 4 months ago (11 children)

[–]Ameisen 3 points4 points5 points 4 months ago (8 children)

So, there is a downside to this - which is the same thing as that these two definitions actually do define subtly-different things:

static const char* const foo = "bar";

static const char foo[] = "bar";

If you have a 4, 8, 12, or 16 byte object defining a size and pointer (effectively C++'s std::string_view):

It is a performance malus with ABIs like Win64 that won't pass that structure in registers.
It requires an additional dereference to get to the character data.
It is mainly useful for APIs where the string may come from anywhere - within a library or application, storing the size with your character data is almost always better, but complicates things if you need to support both a view and an inline string - worst case, the inline strings get converted to views anyways.
Extending the previous, inline strings have less size overhead, and potentially less performance overhead. You read the size, then the character data is sequentially after it: incredibly cache-friendly.
Can end up larger than expected due to alignment issues.
it effectively doubles the overhead associated with potentially having allocated a char array.

In my low-alloc low-latency APIs, I support both string structures and inline strings and use C++ templates and enable_if/concept to simplify implementation, with special handling for rvalue refs.

[–]XtremeGoose 2 points3 points4 points 4 months ago* (3 children)

[–]evaned 0 points1 point2 points 4 months ago (1 child)

[–]Ameisen -1 points0 points1 point 4 months ago* (0 children)

[–]Ameisen 0 points1 point2 points 4 months ago* (0 children)

We're talking about designing c right? This is the 60s, win64 doesn't exist yet so I'd imagine calling conventions would have been around fat pointers in this alternative history.

I mean, we can knock ourselves out:

I could ask some people I know about the specifics of how the PDP-7 was used for procedure calls.

It doesn't have many instructions, and it only has a handful of registers - and they're not general-purpose registers (I believe that the PDP-11 introduced those to the PDP line).

If you look at the PDP-7's architecture, it makes sense that they used a NUL to indicate the end of a string, since that's largely how the CPU worked.

I don't think cache locality matters if you're, say, iterating over the string because you'd almost certainly just hold that in a register.

How do you think that the data ends up in the register (and it almost certainly is not in a register, at least on x86)?

The length value may be in-register depending on ABI, as may the data pointer. You will have to read from memory to actually access the data pointer's data, though. On x86, that includes some fun L1/L2 cache interactions.

There's a reason modern string types are designed this way.

Because it's the ideal layout for general-purpose mutable strings. There are cases where other layouts have benefits.

And lots of operations on strings only care about the length, not the actual data so you'll be avoiding indirection in lots of cases.

Depending on use-case, you either have more, the same number, or fewer indirections.

If you're only using length, "fat" strings are better, though even better is to then adopt an SoA architecture (say, for interned string storage).

[–]evaned 1 point2 points3 points 4 months ago* (3 children)

It is a performance malus with ABIs like Win64 that won't pass that structure in registers.

I will admit this is a downside... but my response to that would be to use a better ABI. If you're defining your own string type anyway, then you've got control over the calling convention of the functions that use it.

It requires an additional dereference to get to the character data.

I wonder if we're talking past each other here, because it's not true for the picture I have in my head.

In fact, it saves a dereference when all you need is the size, though admittedly I'm not convinced that has enough value on its own to make a difference, as long as you can tell empty from non-empty without a dereference.

(I do strongly suspect that empty/non-empty without a deference is important enough that it would justify "my" design, but if you use a null pointer to represent an empty string then you don't need "my" design to achieve that goal.)

I will point out though that something like a += b to destructively append a string will, barring a reallocation, do the above: use the size of a without reading the contents at the start of the string a. That operation would benefit from "my" design.

You read the size, then the character data is sequentially after it: incredibly cache-friendly.

I claim that cache friendlieness actually goes to "my" design, to the extent "you can access the size without reading the initial part of the string data" has practical value. (And to the extent it doesn't, they're basically equivalent.) You need the pointer in either case, and with the size next to the pointer it will be readily available in "my" design as well.

I will point out that every current C++ standard library implementation uses a design similar to what I describe: the string object itself holds not just the pointer-to-data but also the size and current buffer capacity, and that object duals as a SSO buffer. libstdc++ even moved to that design, away from one that has the string object itself as just a pointer pointing to a length-prefixed buffer (length and capacity). (They dropped COW at the same time, but I think that design choice is largely orthogonal.) From what I can tell, Rust's String object is basically the same as that, though no SSO (which surprises me). I'm very inclined to believe that with the engineering effort that goes into these performance-oriented languages, they have reasonably good evidence that a fat object is a better design for your default, go-to string type vs. a pointer to a struct.

[–]Ameisen 0 points1 point2 points 4 months ago* (2 children)

I will admit this is a downside... but my response to that would be to use a better ABI. If you're defining your own string type anyway, then you've got control over the calling convention of the functions that use it.

MSVC has no ABI you can specify for this case. Both Win64 and vectorcall will push this on the stack. There's no reliable way to avoid the stack on Win64. Maybe if you lie to it and make the compiler think that it's a vector value, it will pass it in an XMM register...? Only with vectorcall though.

This is a known problem: https://quuxplusone.github.io/blog/2021/11/19/string-view-by-value-ps/

LTCG/LTO may avoid this problem if the function ends up inclined, or if the function is defined in a header.

In fact, it saves a dereference when all you need is the size

Only if the structure is already in a register.

My design

Your design is largely identical to most string views, including most implementations of std::string_view and Unreal's FStringView.

I will point out though that something like a += b to destructively append a string will, barring a reallocation, do the above: use the size of a without reading the contents at the start of the string a. That operation would benefit from "my" design.

It will then dereference the pointer regardless in order to write b to it, or at least &a[len], which may have already been in a cache line.

I will point out that every current C++ standard library implementation uses a design similar to what I describe

Yes, as I said, though your design matches string_view more than string, though is a mutable one.

They dropped COW at the same time, but I think that design choice is largely orthogonal.

The standard doesn't specify how std::string must be implemented.

Copy-on-write was indirectly forbidden due to changes in required iterator semantics, namely changing when they are allowed to be invalidated. Copy-on-write implementations generally violate that, as non-const operator[] would invalidate the string. This was largely changed to better-support concurrency (and as neither non-const operator[], nor the other functions they specified, have a way to know if the data is being mutated). There are also concurrency benefits to mutable, non-CoW "fat" strings.

Their design change in libstdc++ was absolutely due to the iterator invalidation changes. This also resulted in the Committee now being afraid to break the (non-specified) ABI.

As far as I can tell, the standard doesn't prohibit the string being inline, though I'd need to investigate further.

they have reasonably good evidence that a fat object is a better design for your default, go-to string type vs. a pointer to a struct.

For a mutable string, inline strings aren't really advantageous - their benefit comes as owned, immutable strings.

Especially if you're passing it by reference anyways - then it's effectively free to access relative to containing a pointer.

In the end, it all depends on access patterns, though. An array of inline strings (yes, an array of these is complicated as traversing requires the size - it's not quite an array) where all you do is check the size is very cache-unfriendly, though you could always have a seperate, matching array of sizes for just that purpose (effectively mimicking SoA).

On the flipside, if I am using the character data, "fat" strings jump me all over memory since those pointers are highly-unlikely to be sequential.

Inline strings, and fancy collections of them, are not rare in systems were you are storing immutable strings (like interning).

[–]evaned 1 point2 points3 points 4 months ago* (1 child)

MSVC has no ABI you can specify for this case. Both Win64 and vectorcall will push this on the stack. There's no reliable way to avoid the stack on Win64. Maybe if you lie to it and make the compiler think that it's a vector value, it will pass it in an XMM register...?

I did make that work, and you could do something like pack a pointer+size into an _m128i (or add capacity, and/or SSO, in a second _m128i)... but you then need a few other SSE instructions to extract the relevant values and such. I suspect this winds up not worth it.

That is unfortunate, and I didn't realize MSVC wouldn't have a better way to handle it. I'm pretty surprised by that, actually. (Not surprised it's not default, because Microsoft, but I would have thought this was an important enough thing to get.)

(I will say in my defense that TFA is talking about writing an embedded project; I suspect they're not using MSVC.)

In fact, it saves a dereference when all you need is the size

Only if the structure is already in a register.

OK, fair enough, but it would be in L1 cache which (at least on x86 derivatives) have pretty comparable speed to registers by my understanding.

(And if you're using GCC or Clang, it may well be in a register.)

If you're in a situation where the object itself (handle, pointer, whatever) isn't in L1, then no design is going to help you.

My [sic] design

Your design is largely identical to most string views, including most implementations of std::string_view and Unreal's FStringView.

... and the one in TFA.

Yes, I realize that. That's why I put scare quotes around "my" every time I said it, scare quotes you dishonestly omitted from that "quote" for some reason.

I will point out though that something like a += b to destructively append a string will, barring a reallocation, do the above: use the size of a without reading the contents at the start of the string a. That operation would benefit from "my" design.

It will then dereference the pointer regardless in order to write b to it, or at least &a[len], which may have already been in a cache line.

I'm not sure what you're trying to get at here. With "my" design, it wouldn't need to read or write right at the pointer's address, only at the end. Because the new size isn't stored at the pointer.

(And we're back in the design we had before -- if your string object itself is out of cache/registers, then you've lost anyway.)

The standard doesn't specify how std::string must be implemented.

I would argue that fact makes the observation that three different teams (GNU, LLVM, Microsoft/Dinkumware) all converged on a similar fat-object design (over a pointer to a structure) even more compelling that that's the "right" option for a generic string data structure.

I will point out that every current C++ standard library implementation uses a design similar to what I describe

Yes, as I said, though your design matches string_view more than string, though is a mutable one.

Even in my first comment I was talking about SSO, which doesn't match string_view, and in the followup I mentioned Rust's String which is the buffer-owning string object as well as std::string.

I will admit though that I've been playing fast and loose with whether capacity is present in the fat object... but that's because I don't view it as super important or interesting. I'd say my arguments apply in both cases.

In the end, it all depends on access patterns, though.

Sure, I don't claim that "my" design is going to be better in all circumstances. But at the same time, if you're writing a generic string data structure you have to make some choice that hopefully represents the best tradeoffs across a wide range of access patterns.

My claim is that I think the evidence points toward that being a fat object, not a pointer to a structure.

[–]Ameisen -1 points0 points1 point 4 months ago* (0 children)

I did make that work, and you could do something like pack a pointer+size into an _m128i ...

You can force the compiler to generate the instructions itself like so, though the GPR/MEM->SIMD->GPR conversions are likely worse than using the stack - particularly for view creation.

unsigned __int64 get_length(stringy_view) PROC       ; get_length, COMDAT
        movq    rax, xmm0
        ret     0
unsigned __int64 get_length(stringy_view) ENDP       ; get_length

$T1 = 0
__$ReturnUdt$ = 32
data$ = 40
length$ = 48
stringy_view get_view(char *,unsigned __int64) PROC       ; get_view, COMDAT
$LN6:
        sub     rsp, 24
        mov     QWORD PTR $T1[rsp], r8
        mov     rax, rcx
        mov     QWORD PTR $T1[rsp+8], rdx
        movups  xmm0, XMMWORD PTR $T1[rsp]
        movups  XMMWORD PTR [rcx], xmm0
        add     rsp, 24
        ret     0
stringy_view get_view(char *,unsigned __int64) ENDP       ; get_view

The ideal would be to get it passed in two GPRs, but I cannot think of a way to do that automatically.

I will say in my defense that TFA is talking about writing an embedded project; I suspect they're not using MSVC.

I wrote and compiled an entire bootloader (multiboot and EFI) and a small kernel using MSVC :/. Was a PITA, though.

have pretty comparable speed to registers by my understanding.

L1 is at least 2-4× slower than a register, and can be worse than that depending on circumstances.

Note that the L1 cache is also susceptible to things like false sharing, where registers are not.

I take advantage of the L1 cache in certain systems (VeMIPS assumes that the 32 32-bit register file fits neatly into a single L1 cache line).

Since this is on the stack, it is indeed very likely to be resident in-cache, though.

Yes, I realize that. That's why I put scare quotes around "my" every time I said it, scare quotes you dishonestly omitted from that "quote" for some reason.

Because I dishonestly wrote my entire comment on my phone with an arm suffering from dishonestly severe tendonitis, and the Reddit mobile app on Android is dishonestly terrible.

I'm not sure what you're trying to get at here. With "my" design, it wouldn't need to read or write right at the pointer's address, only at the end. Because the new size isn't stored at the pointer.

That only matters if the string is longer than 128 bytes (a normal cache line size). The cache operates on that granularity and alignment.

So, if it's <= 124 bytes, then size+data fits into a single cache line. Otherwise, it does not. In yours, it has to make sure that size is in-cache (if it's on the stack, then it is likely within a cache line or two (depending on offset) for the stack variables. It will then need to make sure that the 128 bytes of string data that straddle where you're writing are in-cache, as x86 at least (x86 has a... unique cache architecture) writes to the L1 and L2 cache and then it propagates that write later to memory (unless you specify that you want non-temporal stores using, say, movnti - there's no guarantee that this is a benefit, though, depending again on usage/access patterns). This should still be pretty fast, but it's still faster if all the data is already present.

This is largely a very, very-reduced version of arrays-of-structs vs structs-of-arrays.

And we're back in the design we had before -- if your string object itself is out of cache/registers, then you've lost anyway.

Yup. I'm just in the habit of designing specifically for that case or for the specific case that it's not - but the vast majority of strings that I encounter fit neatly into a cache line with 4 bytes of size.

I would argue that fact makes the observation that three different teams (GNU, LLVM, Microsoft/Dinkumware) all converged on a similar fat-object design (over a pointer to a structure) even more compelling that that's the "right" option for a generic string data structure.

Well, yes. it's ideal for general-case mutable strings. There are very specific circumstances - that happen to be more common in certain fields - where inline strings are more optimal.

For immutable strings that own their data, it's significantly more tricky. The general-case is probably still 'good enough', but it really depends on what exactly you're doing.

C++ doesn't have an immutable string type with ownership semantics.

Even in my first comment I was talking about SSO

I apologize, I must have missed that, or I got mixed up with the general discussion in this post where people were largely posting C structs that effectively were string views.

My claim is that I think the evidence points toward that being a fat object, not a pointer to a structure.

I don't disagree, I just want to point out the cases where it's not optimal. They're more common than you'd expect.

For a full mutable string, the ABI issue doesn't matter as people pass those by const reference anyways. It's mainly an issue for views - which is why I was approaching the issue from the view standpoint, and also from the immutable string standpoint.

ED: I just realized that I forgot to mark get_view as __vectorcall. That improves the resultant assembly somewhat:

$T1 = 0
data$ = 32
length$ = 40
stringy_view get_view(char *,unsigned __int64) PROC       ; get_view, COMDAT
$LN6:
        sub     rsp, 24
        mov     QWORD PTR $T1[rsp], rdx
        mov     QWORD PTR $T1[rsp+8], rcx
        movups  xmm0, XMMWORD PTR $T1[rsp]
        add     rsp, 24
        ret     0
stringy_view get_view(char *,unsigned __int64) ENDP       ; get_view

[–]PeaSlight6601 0 points1 point2 points 4 months ago (1 child)

[–]evaned 1 point2 points3 points 4 months ago (0 children)

[–]nerd4code 0 points1 point2 points 4 months ago (1 child)

No, Pascal strings are more sensible, because the length is inline with the string data; unfortunately in the standard case this limits you to 255 chars per string.

In C, you’d preferably

typedef size_t Str_Len;
typedef struct {
    Str_Len len;
    char c[];
} Str;

flexly for Pascal format.

[–]SonOfMrSpock 2 points3 points4 points 4 months ago (0 children)

[–]zhivago 3 points4 points5 points 4 months ago (0 children)

[–]matthewt 0 points1 point2 points 4 months ago (0 children)

[–]jacobb11 0 points1 point2 points 4 months ago (8 children)

[–]pseudomonica 2 points3 points4 points 4 months ago (6 children)

[–]jacobb11 0 points1 point2 points 4 months ago (5 children)

[–]pseudomonica 2 points3 points4 points 4 months ago (4 children)

[–]jacobb11 -1 points0 points1 point 4 months ago (3 children)

[–]pseudomonica 3 points4 points5 points 4 months ago (2 children)

[–]jacobb11 -1 points0 points1 point 4 months ago (1 child)

[–]pseudomonica 2 points3 points4 points 4 months ago (0 children)

[–]seamsay 0 points1 point2 points 4 months ago (0 children)

[–]todo_code -5 points-4 points-3 points 4 months ago (2 children)

[–]InfinitePoints 11 points12 points13 points 4 months ago (1 child)

[–]todo_code -5 points-4 points-3 points 4 months ago (0 children)

[–]TimeSuck5000 -4 points-3 points-2 points 4 months ago (3 children)

[–]stianhoiland 9 points10 points11 points 4 months ago (2 children)

[+]TimeSuck5000 comment score below threshold-12 points-11 points-10 points 4 months ago (1 child)

[–]matthewt 4 points5 points6 points 4 months ago (0 children)

programming

MODERATORS

Welcome to Reddit,

Want to add to the discussion?