Add PyBytes_Join() function #36

vstinner · 2024-07-23T15:06:53Z

API: PyObject* PyBytes_Join(PyObject *sep, PyObject *iterable)

Similar to sep.join(iterable) in Python.

sep must be Python bytes object.

iterable must be an iterable object yielding objects that implement the buffer protocol.

On success, return a new bytes object. On error, set an exception and return NULL.

UPDATE: Don't accept sep=NULL.

It's different than PyUnicode_Join(NULL, iterable) which treats NULL separator as a whitespace (' '). This PyUnicode_Join() behavior is not documented. The PyUnicode_Join() documentation only says:

Join a sequence of strings using the given separator and return the resulting Unicode string.

The text was updated successfully, but these errors were encountered:

encukou · 2024-07-24T09:44:04Z

Looks good to me! I agree that b' ' doesn't make much sense as a default for bytes.

We can't change that behaviour of PyUnicode_Join with NULL, but IMO we should document it, and PyBytes_Join docs should mention that the default is different.

vstinner · 2024-07-24T12:28:59Z

We can't change that behaviour of PyUnicode_Join with NULL, but IMO we should document it

I agree. I don't think that it would be a good idea to change the behavior because the function exists since forever in Python (ex: it exists in Python 2.7 with NULL treated as a whitespace).

malemburg · 2024-07-24T12:48:39Z

FYI: This originates from the default for string.join() (the module function) in Python 2. The default separator was a blank. It's been the default for PyUnicode_Join() ever since the API was added to Python.

Today, I would not allow for this corner case anymore, though. Passing in NULL as first argument is bound to mask potential errors in code,

picnixz · 2024-07-25T10:27:36Z

We can't change that behaviour of PyUnicode_Join with NULL, but IMO we should document it

Isn't it possible to change its behaviour using the proper deprecation process? (or is it impossible because it's part of the stable ABI that we cannot touch it like this?)

encukou · 2024-07-25T11:09:02Z

It's possible, yes. But it would mean that everyone who uses it this way needs to update their code. It would be quite cruel of us to do it without a very good reason.

picnixz · 2024-07-25T11:21:24Z

Woulnd't be the following be legitimate reasons: 1) it's not documented 2) it's something introduced for Python 2, 3) it would be inconsistent with PyBytes_Join?

encukou · 2024-07-25T11:59:21Z

None of those are reasons to change it.
A reason to change it would be, as Mark said, that passing NULL can mask potential errors in code. IMO, that on its own is a rather weak reason to break any working code.

It doesn't really matter if it's documented or not. If the docs, are missing, people read the code. Or they definitely did it in the past. (Also see PEP-387: “Note that if something is not documented at all, it is not automatically considered private.”)
If it's been around since Python 2, there are decades of code that use it and would need to change. This is an argument against changing it.
If we want consistency, we should look at the newly added function: PyBytes_Join should use ASCII space by default, or not allow NULL at all.
But, we seem to agree that defaulting to b'' is worth the inconsistency. Bytes and text are different beasts; the default separator can be different too.

vstinner · 2024-07-27T19:42:28Z

All usages of the current private _PyBytes_Join() in the Python code base are with an empty bytes string, so IMO it's worth it and convenient to accept NULL and treat NULL as an empty bytes string. It avoids having to "create" the empty bytes string singleton, handle errors, etc.

malemburg · 2024-07-28T09:17:27Z

To avoid the same error masking issue as with PyUnicode_Join() I'd suggest to not use NULL as default parameter, but instead a use separate macro PY_BYTES_EMPTY or perhaps even an interned and immortal singleton Py_BYTES_EMPTY (haven't checked whether we already have something like this).

malemburg · 2024-07-28T09:59:58Z

Had a look... we already have something like this in form of bytes_get_empty() in bytesobject.c. It's just not exposed in the public API.

picnixz · 2024-07-28T11:13:01Z

You can just do Py_GetConstant(Py_CONSTANT_EMPTY_BYTES) as well. This is how we publicly expose singletons.

malemburg · 2024-07-28T12:05:23Z

Thanks for mentioning this. I wasn't aware of that new API: https://docs.python.org/3.14/c-api/object.html#c.Py_GetConstant

Unfortunately, this returns a strong reference, so you'd still have the ref count manage the object instead of just doing PyBytes_Join(Py_EMPTY_BYTES, iterator)

There is Py_GetConstantBorrowed(Py_CONSTANT_EMPTY_BYTES), which could be used instead, but that seems very verbose for such a simple and common parameter value.

serhiy-storchaka · 2024-07-29T10:34:00Z

We could also use Py_None as a special value for an empty separator in PyUnicode_Join() and PyBytes_Join().

encukou · 2024-07-29T12:31:48Z

Or have the sep=NULL branch check PyErr_Occurred(), and raise a stern SystemError.
(I'm assuming the main error to worry about is using a return value of a Py* function without error checking.)

PyUnicode_Join could do that as well.

malemburg · 2024-07-29T12:56:37Z

Both solutions sound like a good alternative approach. Petr's version would even solve the potential issue with PyUnicode_Join().

My concern is mostly about passing in NULL as the first parameter, since you normally would pass in the object you want to work on as this parameter. A forgotten NULL check could then easily result in the join function doing it's job and leaving a dangling error around which would then show up at some later point in the execution of the program - which is really hard to debug. I've run into such issues too often to not pay close attention to this anymore.

While this can be an issue with other parameters as well, the first one is special, since working on NULLs is rather uncommon 😄

vstinner · 2024-07-29T16:58:20Z

We could also use Py_None as a special value for an empty separator in PyUnicode_Join() and PyBytes_Join().

I like this approach.

vstinner · 2024-07-30T12:57:26Z

I propose to:

Accept Py_None in PyBytes_Join() and PyUnicode_Join(): treated as an empty string
Raise SystemError in PyUnicode_Join() if called with NULL separator with an exception set
Don't accept NULL in PyBytes_Join()

picnixz · 2024-07-30T13:03:38Z

I like those suggestions. When you say "don't accept NULL in PyBytes_Join", do you mean a simple assert? For PyUnicode_Join, do you mean calling PyErr_BadInternalCall() or having something more explicit (i.e., a better message)?

malemburg · 2024-07-30T14:00:46Z

I propose to:

Accept Py_None in PyBytes_Join() and PyUnicode_Join(): treated as an empty string

Raise SystemError in PyUnicode_Join() if called with NULL separator with an exception set

Don't accept NULL in PyBytes_Join()

Sounds good.

vstinner · 2024-07-30T15:48:33Z

I mean PyErr_BadInternalCall() yes, raise SystemError.

picnixz · 2024-07-30T16:02:18Z

To summarize:

a. Make a PR where you call PyErr_BadInternalCall when calling PyUnicode_Join with a NULL.
b. Handle Py_None as being equivalent to "". In particular, we don't have a special casing for a whitespace " " anymore.

Should this change be backported to 3.12 and 3.13 as well without notice? Or should it only be a 3.14 change?

a. Update your PR for PyBytes_Join to call PyErr_BadInternalCall if NULL is passed as a separator.
b. Update your PR to accept Py_None as being equivalent to b"".

serhiy-storchaka · 2024-07-30T17:02:21Z

I think that in this case we may add a SystemError with more specific error message (similar to these that are raised when C implemented function returns non-NULL with an error set). It can also be chained with the original exception. But this is an implementation detail.

I would prefer to add special references for empty str and bytes to the public C API, but using Py_None is the second best option. Definitely better than using NULL with different semantic.

After adding _PyLong_Zero and _PyLong_One I thought about adding corresponding global constants for empty string, bytes object, tuple, etc, but did not have enough use cases for them. Since then, evolution has gone in a different direction, _PyLong_Zero and _PyLong_One were replaced with _PyLong_GetZero() and _PyLong_GetOne() which return a borrowed reference. I think this made them less ready for the public C API.

encukou · 2024-07-30T21:42:45Z

It's all personal opinions now. As for me, I don't like using one Python object to stand in for another.

Do we even have a precedent for C API taking Py_None to mean “default”? (edit: other than implementing/mirroring Python functions that take None)

I'd prefer any of:

Using b'' itself -- that is, Py_GetConstantBorrowed(Py_CONSTANT_EMPTY_BYTES). If you use it once, it's nice and descriptive; if you need it many times you can #define a short name.
Using NULL to mean no separator (with the PyErr_Occurred() check, it's not that dangerous)
A separate one-arg function, like PyBytes_ConcatIterable

picnixz · 2024-07-31T09:15:45Z

I would prefer to add special references for empty str and bytes to the public C API,

If we can, I would also prefer it. Returning an empty string or using an empty string might be common enough (for instance search for PyUnicode_FromString("")) and it would reflect the usage in the code (like, you'll read the code and translate it in your head as "".join(...) for instance and not None.join(...)).

We seem to have #define emptystring (PyObject *)&_Py_SINGLETON(bytes_empty) for code objects, (bytes_empty is in _Py_static_objects but there does not seem to have a PyUnicodeObject containing the empty string). So we could do the same for an empty string maybe? (or just expose the macro itself in a clearer way).

erlend-aasland · 2024-07-31T21:28:21Z

I'm fine with either Py_None or Petr's first alternative:

Using b'' itself -- that is, Py_GetConstantBorrowed(Py_CONSTANT_EMPTY_BYTES). If you use it once, it's nice and descriptive; if you need it many times you can #define a short name.

vstinner · 2024-08-01T14:29:09Z

Accepting NULL is causing too much trouble:

Maybe passing NULL was not the intent, but the result of a failing function call. I don't want to add PyErr_Occurred() in this case, worst errors don't even set an exception.
PyUnicode_Join(NULL, iterable) uses a space rather than an empty string.

I prefer to abandon the NULL idea at this point.

vstinner · 2024-08-01T14:31:01Z

@encukou:

Do we even have a precedent for C API taking Py_None to mean “default”? (edit: other than implementing/mirroring Python functions that take None)

I'm not aware of any existing C API doing that, so maybe Py_None is a bad idea here, especially because getting an empty bytes string became cheap and easy (Py_GetConstantBorrowed) in Python 3.13.

vstinner · 2024-08-01T14:33:27Z

Ok, let's vote on the simple API: sep must always be a Python bytes object (it cannot be NULL, it cannot be Py_None).

API: PyObject* PyBytes_Join(PyObject *sep, PyObject *iterable)

Similar to sep.join(iterable) in Python.
sep must be Python bytes object.
iterable must be an iterable object yielding objects that implement the buffer protocol.
On success, return a new bytes object. On error, set an exception and return NULL.

Vote:

malemburg · 2024-08-01T19:54:36Z

If we go with the above proposal, please add a macro to return a borrowed reference to the empty bytes constant (= Py_GetConstantBorrowed(Py_CONSTANT_EMPTY_BYTES)) and similarly for the empty Unicode constant.

vstinner · 2024-08-05T11:59:08Z

Ping @mdboom and @zooba for the vote ;-)

zooba · 2024-08-05T17:01:16Z

_PyLong_Zero and _PyLong_One were replaced with _PyLong_GetZero() and _PyLong_GetOne() which return a borrowed reference. I think this made them less ready for the public C API.

Less ready for a C API, true, but more ready for a generic native API (that can support languages other than C), as well as more ready for a thread-aware API. It was a worthwhile change.

please add a macro to return a borrowed reference to the empty bytes constant

Yeah, I think this is worth adding ourselves. Py_EMPTY_BYTES and Py_EMPTY_STR that call the GetConstantBorrowed function aren't really adding any more risk or burden to the API.

picnixz · 2024-08-06T08:02:39Z

Yeah, I think this is worth adding ourselves. Py_EMPTY_BYTES and Py_EMPTY_STR that call the GetConstantBorrowed function aren't really adding any more risk or burden to the API.

Could we do it for all known constants to be consistent? there are multiple places where empty str/bytes are being returned and some files use local helpers for that. I think we can have a PR only for this (namely implement a correspondence between constants and macros and remove those local helpers).

zooba · 2024-08-06T15:54:06Z

Provided there are no name conflicts, sure. Macros are cheap, and I believe all of these constants are already immortal/true-constant, which means there's no likely future where refcounting will actually matter.

We do want to deprecate functions that return borrowed references, as they make refcounting very complicated. But these constants are effectively tagged pointers now rather than live objects (the refcount is still writable, but properly-built extensions will leave it alone, and they are interpreter- and thread-agnostic), so whether strong or borrowed isn't a big deal.

vstinner · 2024-08-26T17:30:19Z

@mdboom: You didn't vote yet in #36 (comment) - what's your call on this API?

vstinner · 2024-08-27T08:43:52Z

The C API Working Group adopted PyObject* PyBytes_Join(PyObject *sep, PyObject *iterable) API where sep must be a bytes object (cannot be NULL). I close the issue and I will update the PR python/cpython#121646.

vstinner mentioned this issue Jul 27, 2024

gh-121645: Add PyBytes_Join() function python/cpython#121646

Merged

vstinner closed this as completed Aug 27, 2024

This was referenced Sep 3, 2024

gh-123660: Internal macros for accessing empty string/bytes singletons python/cpython#123643

Closed

[pycore] internal macros for accessing empty string/bytes and other singletons python/cpython#123660

Closed

Add PyBytes_Join() function #36

Add PyBytes_Join() function #36

Comments

vstinner commented Jul 23, 2024 • edited Loading

encukou commented Jul 24, 2024

vstinner commented Jul 24, 2024

malemburg commented Jul 24, 2024

picnixz commented Jul 25, 2024

encukou commented Jul 25, 2024

picnixz commented Jul 25, 2024

encukou commented Jul 25, 2024

vstinner commented Jul 27, 2024 • edited Loading

malemburg commented Jul 28, 2024 • edited Loading

malemburg commented Jul 28, 2024

picnixz commented Jul 28, 2024

malemburg commented Jul 28, 2024

serhiy-storchaka commented Jul 29, 2024

encukou commented Jul 29, 2024

malemburg commented Jul 29, 2024

vstinner commented Jul 29, 2024

vstinner commented Jul 30, 2024

picnixz commented Jul 30, 2024

malemburg commented Jul 30, 2024

vstinner commented Jul 30, 2024

picnixz commented Jul 30, 2024 • edited Loading

serhiy-storchaka commented Jul 30, 2024

encukou commented Jul 30, 2024 • edited Loading

picnixz commented Jul 31, 2024

erlend-aasland commented Jul 31, 2024

vstinner commented Aug 1, 2024

vstinner commented Aug 1, 2024

vstinner commented Aug 1, 2024 • edited by mdboom Loading

malemburg commented Aug 1, 2024

vstinner commented Aug 5, 2024

zooba commented Aug 5, 2024

picnixz commented Aug 6, 2024

zooba commented Aug 6, 2024

vstinner commented Aug 26, 2024

vstinner commented Aug 27, 2024

vstinner commented Jul 23, 2024 •

edited

Loading

vstinner commented Jul 27, 2024 •

edited

Loading

malemburg commented Jul 28, 2024 •

edited

Loading

picnixz commented Jul 30, 2024 •

edited

Loading

encukou commented Jul 30, 2024 •

edited

Loading

vstinner commented Aug 1, 2024 •

edited by mdboom

Loading