Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add PyBytes_Join() function #36

Closed
vstinner opened this issue Jul 23, 2024 · 35 comments
Closed

Add PyBytes_Join() function #36

vstinner opened this issue Jul 23, 2024 · 35 comments

Comments

@vstinner
Copy link

vstinner commented Jul 23, 2024

API: PyObject* PyBytes_Join(PyObject *sep, PyObject *iterable)

Similar to sep.join(iterable) in Python.

sep must be Python bytes object.

iterable must be an iterable object yielding objects that implement the buffer protocol.

On success, return a new bytes object. On error, set an exception and return NULL.

UPDATE: Don't accept sep=NULL.


It's different than PyUnicode_Join(NULL, iterable) which treats NULL separator as a whitespace (' '). This PyUnicode_Join() behavior is not documented. The PyUnicode_Join() documentation only says:

Join a sequence of strings using the given separator and return the resulting Unicode string.

@encukou
Copy link
Collaborator

encukou commented Jul 24, 2024

Looks good to me! I agree that b' ' doesn't make much sense as a default for bytes.

We can't change that behaviour of PyUnicode_Join with NULL, but IMO we should document it, and PyBytes_Join docs should mention that the default is different.

@vstinner
Copy link
Author

We can't change that behaviour of PyUnicode_Join with NULL, but IMO we should document it

I agree. I don't think that it would be a good idea to change the behavior because the function exists since forever in Python (ex: it exists in Python 2.7 with NULL treated as a whitespace).

@malemburg
Copy link

FYI: This originates from the default for string.join() (the module function) in Python 2. The default separator was a blank. It's been the default for PyUnicode_Join() ever since the API was added to Python.

Today, I would not allow for this corner case anymore, though. Passing in NULL as first argument is bound to mask potential errors in code,

@picnixz
Copy link

picnixz commented Jul 25, 2024

We can't change that behaviour of PyUnicode_Join with NULL, but IMO we should document it

Isn't it possible to change its behaviour using the proper deprecation process? (or is it impossible because it's part of the stable ABI that we cannot touch it like this?)

@encukou
Copy link
Collaborator

encukou commented Jul 25, 2024

It's possible, yes. But it would mean that everyone who uses it this way needs to update their code. It would be quite cruel of us to do it without a very good reason.

@picnixz
Copy link

picnixz commented Jul 25, 2024

Woulnd't be the following be legitimate reasons: 1) it's not documented 2) it's something introduced for Python 2, 3) it would be inconsistent with PyBytes_Join?

@encukou
Copy link
Collaborator

encukou commented Jul 25, 2024

None of those are reasons to change it.
A reason to change it would be, as Mark said, that passing NULL can mask potential errors in code. IMO, that on its own is a rather weak reason to break any working code.

  1. It doesn't really matter if it's documented or not. If the docs, are missing, people read the code. Or they definitely did it in the past. (Also see PEP-387: “Note that if something is not documented at all, it is not automatically considered private.”)
  2. If it's been around since Python 2, there are decades of code that use it and would need to change. This is an argument against changing it.
  3. If we want consistency, we should look at the newly added function: PyBytes_Join should use ASCII space by default, or not allow NULL at all.
    But, we seem to agree that defaulting to b'' is worth the inconsistency. Bytes and text are different beasts; the default separator can be different too.

@vstinner
Copy link
Author

vstinner commented Jul 27, 2024

All usages of the current private _PyBytes_Join() in the Python code base are with an empty bytes string, so IMO it's worth it and convenient to accept NULL and treat NULL as an empty bytes string. It avoids having to "create" the empty bytes string singleton, handle errors, etc.

@malemburg
Copy link

malemburg commented Jul 28, 2024

To avoid the same error masking issue as with PyUnicode_Join() I'd suggest to not use NULL as default parameter, but instead a use separate macro PY_BYTES_EMPTY or perhaps even an interned and immortal singleton Py_BYTES_EMPTY (haven't checked whether we already have something like this).

@malemburg
Copy link

Had a look... we already have something like this in form of bytes_get_empty() in bytesobject.c. It's just not exposed in the public API.

@picnixz
Copy link

picnixz commented Jul 28, 2024

You can just do Py_GetConstant(Py_CONSTANT_EMPTY_BYTES) as well. This is how we publicly expose singletons.

@malemburg
Copy link

Thanks for mentioning this. I wasn't aware of that new API: https://docs.python.org/3.14/c-api/object.html#c.Py_GetConstant

Unfortunately, this returns a strong reference, so you'd still have the ref count manage the object instead of just doing PyBytes_Join(Py_EMPTY_BYTES, iterator)

There is Py_GetConstantBorrowed(Py_CONSTANT_EMPTY_BYTES), which could be used instead, but that seems very verbose for such a simple and common parameter value.

@serhiy-storchaka
Copy link

We could also use Py_None as a special value for an empty separator in PyUnicode_Join() and PyBytes_Join().

@encukou
Copy link
Collaborator

encukou commented Jul 29, 2024

Or have the sep=NULL branch check PyErr_Occurred(), and raise a stern SystemError.
(I'm assuming the main error to worry about is using a return value of a Py* function without error checking.)

PyUnicode_Join could do that as well.

@malemburg
Copy link

Both solutions sound like a good alternative approach. Petr's version would even solve the potential issue with PyUnicode_Join().

My concern is mostly about passing in NULL as the first parameter, since you normally would pass in the object you want to work on as this parameter. A forgotten NULL check could then easily result in the join function doing it's job and leaving a dangling error around which would then show up at some later point in the execution of the program - which is really hard to debug. I've run into such issues too often to not pay close attention to this anymore.

While this can be an issue with other parameters as well, the first one is special, since working on NULLs is rather uncommon 😄

@vstinner
Copy link
Author

We could also use Py_None as a special value for an empty separator in PyUnicode_Join() and PyBytes_Join().

I like this approach.

@vstinner
Copy link
Author

I propose to:

  • Accept Py_None in PyBytes_Join() and PyUnicode_Join(): treated as an empty string
  • Raise SystemError in PyUnicode_Join() if called with NULL separator with an exception set
  • Don't accept NULL in PyBytes_Join()

@picnixz
Copy link

picnixz commented Jul 30, 2024

I like those suggestions. When you say "don't accept NULL in PyBytes_Join", do you mean a simple assert? For PyUnicode_Join, do you mean calling PyErr_BadInternalCall() or having something more explicit (i.e., a better message)?

@malemburg
Copy link

I propose to:

  • Accept Py_None in PyBytes_Join() and PyUnicode_Join(): treated as an empty string
  • Raise SystemError in PyUnicode_Join() if called with NULL separator with an exception set
  • Don't accept NULL in PyBytes_Join()

Sounds good.

@vstinner
Copy link
Author

I mean PyErr_BadInternalCall() yes, raise SystemError.

@picnixz
Copy link

picnixz commented Jul 30, 2024

To summarize:

  1. a. Make a PR where you call PyErr_BadInternalCall when calling PyUnicode_Join with a NULL.
    b. Handle Py_None as being equivalent to "". In particular, we don't have a special casing for a whitespace " " anymore.

Should this change be backported to 3.12 and 3.13 as well without notice? Or should it only be a 3.14 change?

  1. a. Update your PR for PyBytes_Join to call PyErr_BadInternalCall if NULL is passed as a separator.
    b. Update your PR to accept Py_None as being equivalent to b"".

@serhiy-storchaka
Copy link

I think that in this case we may add a SystemError with more specific error message (similar to these that are raised when C implemented function returns non-NULL with an error set). It can also be chained with the original exception. But this is an implementation detail.

I would prefer to add special references for empty str and bytes to the public C API, but using Py_None is the second best option. Definitely better than using NULL with different semantic.

After adding _PyLong_Zero and _PyLong_One I thought about adding corresponding global constants for empty string, bytes object, tuple, etc, but did not have enough use cases for them. Since then, evolution has gone in a different direction, _PyLong_Zero and _PyLong_One were replaced with _PyLong_GetZero() and _PyLong_GetOne() which return a borrowed reference. I think this made them less ready for the public C API.

@encukou
Copy link
Collaborator

encukou commented Jul 30, 2024

It's all personal opinions now. As for me, I don't like using one Python object to stand in for another.

Do we even have a precedent for C API taking Py_None to mean “default”? (edit: other than implementing/mirroring Python functions that take None)

I'd prefer any of:

  • Using b'' itself -- that is, Py_GetConstantBorrowed(Py_CONSTANT_EMPTY_BYTES). If you use it once, it's nice and descriptive; if you need it many times you can #define a short name.
  • Using NULL to mean no separator (with the PyErr_Occurred() check, it's not that dangerous)
  • A separate one-arg function, like PyBytes_ConcatIterable

@picnixz
Copy link

picnixz commented Jul 31, 2024

I would prefer to add special references for empty str and bytes to the public C API,

If we can, I would also prefer it. Returning an empty string or using an empty string might be common enough (for instance search for PyUnicode_FromString("")) and it would reflect the usage in the code (like, you'll read the code and translate it in your head as "".join(...) for instance and not None.join(...)).

We seem to have #define emptystring (PyObject *)&_Py_SINGLETON(bytes_empty) for code objects, (bytes_empty is in _Py_static_objects but there does not seem to have a PyUnicodeObject containing the empty string). So we could do the same for an empty string maybe? (or just expose the macro itself in a clearer way).

@erlend-aasland
Copy link

I'm fine with either Py_None or Petr's first alternative:

Using b'' itself -- that is, Py_GetConstantBorrowed(Py_CONSTANT_EMPTY_BYTES). If you use it once, it's nice and descriptive; if you need it many times you can #define a short name.

@vstinner
Copy link
Author

vstinner commented Aug 1, 2024

Accepting NULL is causing too much trouble:

  • Maybe passing NULL was not the intent, but the result of a failing function call. I don't want to add PyErr_Occurred() in this case, worst errors don't even set an exception.
  • PyUnicode_Join(NULL, iterable) uses a space rather than an empty string.

I prefer to abandon the NULL idea at this point.

@vstinner
Copy link
Author

vstinner commented Aug 1, 2024

@encukou:

Do we even have a precedent for C API taking Py_None to mean “default”? (edit: other than implementing/mirroring Python functions that take None)

I'm not aware of any existing C API doing that, so maybe Py_None is a bad idea here, especially because getting an empty bytes string became cheap and easy (Py_GetConstantBorrowed) in Python 3.13.

@vstinner
Copy link
Author

vstinner commented Aug 1, 2024

Ok, let's vote on the simple API: sep must always be a Python bytes object (it cannot be NULL, it cannot be Py_None).

API: PyObject* PyBytes_Join(PyObject *sep, PyObject *iterable)

  • Similar to sep.join(iterable) in Python.
  • sep must be Python bytes object.
  • iterable must be an iterable object yielding objects that implement the buffer protocol.
  • On success, return a new bytes object. On error, set an exception and return NULL.

Vote:

@malemburg
Copy link

If we go with the above proposal, please add a macro to return a borrowed reference to the empty bytes constant (= Py_GetConstantBorrowed(Py_CONSTANT_EMPTY_BYTES)) and similarly for the empty Unicode constant.

@vstinner
Copy link
Author

vstinner commented Aug 5, 2024

Ping @mdboom and @zooba for the vote ;-)

@zooba
Copy link

zooba commented Aug 5, 2024

_PyLong_Zero and _PyLong_One were replaced with _PyLong_GetZero() and _PyLong_GetOne() which return a borrowed reference. I think this made them less ready for the public C API.

Less ready for a C API, true, but more ready for a generic native API (that can support languages other than C), as well as more ready for a thread-aware API. It was a worthwhile change.

please add a macro to return a borrowed reference to the empty bytes constant

Yeah, I think this is worth adding ourselves. Py_EMPTY_BYTES and Py_EMPTY_STR that call the GetConstantBorrowed function aren't really adding any more risk or burden to the API.

@picnixz
Copy link

picnixz commented Aug 6, 2024

Yeah, I think this is worth adding ourselves. Py_EMPTY_BYTES and Py_EMPTY_STR that call the GetConstantBorrowed function aren't really adding any more risk or burden to the API.

Could we do it for all known constants to be consistent? there are multiple places where empty str/bytes are being returned and some files use local helpers for that. I think we can have a PR only for this (namely implement a correspondence between constants and macros and remove those local helpers).

@zooba
Copy link

zooba commented Aug 6, 2024

Provided there are no name conflicts, sure. Macros are cheap, and I believe all of these constants are already immortal/true-constant, which means there's no likely future where refcounting will actually matter.

We do want to deprecate functions that return borrowed references, as they make refcounting very complicated. But these constants are effectively tagged pointers now rather than live objects (the refcount is still writable, but properly-built extensions will leave it alone, and they are interpreter- and thread-agnostic), so whether strong or borrowed isn't a big deal.

@vstinner
Copy link
Author

@mdboom: You didn't vote yet in #36 (comment) - what's your call on this API?

@vstinner
Copy link
Author

The C API Working Group adopted PyObject* PyBytes_Join(PyObject *sep, PyObject *iterable) API where sep must be a bytes object (cannot be NULL). I close the issue and I will update the PR python/cpython#121646.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants