Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: smithy spuriously dropping valid configurations #2012

Closed
h-vetinari opened this issue Aug 5, 2024 · 33 comments
Closed

BUG: smithy spuriously dropping valid configurations #2012

h-vetinari opened this issue Aug 5, 2024 · 33 comments
Labels

Comments

@h-vetinari
Copy link
Member

h-vetinari commented Aug 5, 2024

This has been one of the most painful debugging sessions in a long time - the bug went away with random debug prints at random points in the code, even if those prints were long after the (presumed) point of failure.

The situation is as follows, wanting to add another configuration to clang-win-activation, like so:

 CLANG_VERSION:
   - 16.0.6
   - 17.0.6
   - 18.1.8
+  - 19.1.0.rc1

the combination of conda-smithy 3.37.2 and conda-build 24.7.1 would spuriously drop the builds for 16.0.6, rather than doing something purely additively. Adding a 5th line would bring back all 5 variants, but 4 somehow means only 3 variants get created. There's also some zipping involved, so I put the exact commit I was rerendering (over and over and over) here.

I ultimately managed to get this to render correctly without debug prints (on top of 3.37.2 & main) with the following diff:

--- a/conda_smithy/configure_feedstock.py
+++ b/conda_smithy/configure_feedstock.py
@@ -983,7 +983,7 @@ def _render_ci_provider(

         config = conda_build.config.get_or_merge_config(
             None,
-            exclusive_config_file=forge_config["exclusive_config_file"],
+            exclusive_config_file=deepcopy(forge_config["exclusive_config_file"]),
             platform=platform,
             arch=arch,
         )

Since this happens at the interface between smithy and conda-build, I cannot tell who is responsible here; whether conda-build incorrectly mutates state that's passed in, or whether smithy incorrectly assumes that state will not be mutated.

It's also possible that my diagnosis is still not correct, and that the deepcopy just happen to randomly enforce some collision/synchronization that's not there otherwise, but doesn't (fully) get rid of the bug yet (like the print statements before; sidenote, there was often - but not always - a difference that pprint.pprint would remove the bug, but bare print wouldn't).

@h-vetinari h-vetinari added the bug label Aug 5, 2024
@h-vetinari
Copy link
Member Author

@beckermr, any thoughts about this one? Would that patch make sense in your eyes? TBH I don't see how this could affect things (in the sense that forge_config["exclusive_config_file"] is just the filename, not the content), but at least it was 100% reproducible.

@beckermr
Copy link
Member

beckermr commented Aug 6, 2024

Did you fix the pythonhashseed when debugging?

@h-vetinari
Copy link
Member Author

Didn't need to - every run was perfectly reproducible in itself, even though there seemed to be no rhyme or reason which specific debug change caused it.

@beckermr
Copy link
Member

beckermr commented Aug 6, 2024

Well as we all know computers are not magic. Deep copying a string should do nothing.

Can you try a cast via "str(...)"?

Are we sure that dict entry is a string and not a Path object?

@beckermr
Copy link
Member

beckermr commented Aug 6, 2024

The debug prints fixing things versus not also seems like a red herring.

@beckermr
Copy link
Member

beckermr commented Aug 6, 2024

To directly answer you, that patch makes no sense to me at all.

@h-vetinari
Copy link
Member Author

Well as we all know computers are not magic.

To that I counter "Any sufficiently advanced technology is indistinguishable from magic." 😛

IOW, just because it's reproducible doesn't meant it isn't (close to) magic. I wish python had a way to opt into pure functions... mutating self-referential state creates mind-bending puzzles. 😑

To directly answer you, that patch makes no sense to me at all.

Fair enough, I did point out in the OP that that's a distinct possibility (and that was even before I understood that the copied variable is just a string).

Can you try a cast via "str(...)"?

With

--- a/conda_smithy/configure_feedstock.py
+++ b/conda_smithy/configure_feedstock.py
@@ -983,7 +983,7 @@ def _render_ci_provider(

         config = conda_build.config.get_or_merge_config(
             None,
-            exclusive_config_file=forge_config["exclusive_config_file"],
+            exclusive_config_file=str(forge_config["exclusive_config_file"]),
             platform=platform,
             arch=arch,
         )

on top of 3.37.2, things fail again 🙃

@beckermr
Copy link
Member

beckermr commented Aug 6, 2024

Yeah that's just crazy town right there. Whatever is happening it is subtle.

I don't think we should patch just yet only because I don't think we've actually found the bug.

Hrmmmmm.

@h-vetinari
Copy link
Member Author

The plot thickens further... in conda-forge/ctng-compilers-feedstock#148, even the bot gets it wrong (and the above patch doesn't help).

I parked a branch with that commit on my fork. The issue is that the rerender sets the c_stdlib_version to 2.17 universally in the variant configs, even though the CBC clearly sets one member of the key to 2.12.

@beckermr
Copy link
Member

beckermr commented Aug 6, 2024

Yeah this version always fails for me on the original issue (even with the patch)

PYTHONHASHSEED=100 CONDA_SMITHY_LOGLEVEL=debug conda-smithy rerender 

This version works without the patch above

PYTHONHASHSEED=4 CONDA_SMITHY_LOGLEVEL=debug conda-smithy rerender 

We should always try the hashseed with issues like this even if things appear stable.

@beckermr
Copy link
Member

beckermr commented Aug 6, 2024

The value is not coming out of the function _conda_build_api_render_for_smithy in the list of metas so the bug is pretty low-level.

@h-vetinari
Copy link
Member Author

h-vetinari commented Aug 6, 2024

Yeah, just got something similar. I put the following debug statements in (among others),

hidden now because I was looking at the wrong variable
--- a/conda_smithy/configure_feedstock.py
+++ b/conda_smithy/configure_feedstock.py
@@ -1094,6 +1128,9 @@ def _render_ci_provider(
             if os.path.exists(_recipe_cbc + ".conda.smithy.bak"):
                 os.rename(_recipe_cbc + ".conda.smithy.bak", _recipe_cbc)

+        print("individual configs")
+        print(sorted(list({el["c_stdlib_version"] for meta in metas for el in meta[0].config.variants})))
+
         # render returns some download & reparsing info that we don't care about
         metas = [m for m, _, _ in metas]

@@ -1134,6 +1171,9 @@ def _render_ci_provider(
         fancy_platforms = []
         unfancy_platforms = set()

+        print("merged configs")
+        pprint.pprint(sorted(list({el["c_stdlib_version"] for metas in metas_list_of_lists for meta in metas for el in meta.config.variants})))
+
         configs = []
         for metas, platform, arch, enable, upload in zip(
             metas_list_of_lists,

and the result was

individual configs
['2.17']
individual configs
['2.17']
individual configs
['2.17']
individual configs
['2.17']
individual configs
[]                     <---- !
individual configs
['12']
merged configs
['12', '2.17']         <--- missing 2.12

Or maybe I was on the wrong track there, in the sense that 2.12 should have been one of he 2.17's... 🤷

@h-vetinari
Copy link
Member Author

h-vetinari commented Aug 6, 2024

FWIW, I had tried the clang-win one with conda-build 24.5 out of curiosity (seeing that I just ran into a regression w.r.t. CBC-handling in 24.7 today), but the result was the same.

@beckermr
Copy link
Member

beckermr commented Aug 6, 2024

The bug is here: https://github.com/conda-forge/conda-smithy/blob/main/conda_smithy/configure_feedstock.py#L895

For whatever reason, conda-build is marking the missing variant as skipped.

@beckermr
Copy link
Member

beckermr commented Aug 6, 2024

The original recipe has a jinja2 if statement around the skip:

  {% if clang_major|int == 16 and cl_minor|int >= 40 %}
  # VS 17.10 requires clang>=17
  skip: true
  {% endif %}

Maybe we convert that to selectors and try again?

@beckermr
Copy link
Member

beckermr commented Aug 6, 2024

the default jinja2 values are also suspicious:

{% set CLANG_VERSION = "16.0.6" %}
{% set CL_VERSION = "19.29" %}

@h-vetinari
Copy link
Member Author

Maybe we convert that to selectors and try again?

The problem is I didn't manage to make that combination work with selectors...

the default jinja2 values are also suspicious:

why?

@beckermr
Copy link
Member

beckermr commented Aug 6, 2024

So for selectors, this works fine, but is ofc hard coded:

skip: true  # [CLANG_VERSION == "16.0.6" and CL_VERSION == "19.40.33808"]

The ordering of when conda-build does the selectors vs jinja2 and it fills in values for jinja2 vars that are missing is a mystery to me. We shouldn't have to set defaults and yet we have to a lot. (edit to something less speculative)

@h-vetinari
Copy link
Member Author

We shouldn't have to set defaults and yet we have to a lot.

AFAIU that's primarily due to the linter, which doesn't run per target, but only once without any config, and so would fail if the jinja variables being used aren't set.

@beckermr
Copy link
Member

beckermr commented Aug 6, 2024

Adding a single random selector on some line like this

# [CLANG_VERSION and CL_VERSION]

causes things to work again.

@beckermr
Copy link
Member

beckermr commented Aug 6, 2024

Also smithy fails to render without the defaults, ending in an error inside conda-build. So it is not only the linter that needs the defaults.

@h-vetinari
Copy link
Member Author

The original recipe has a jinja2 if statement around the skip:

FWIW, the ctng-compilers recipe has no such jinja-skip. Perhaps the issues are separate though.

@beckermr
Copy link
Member

beckermr commented Aug 6, 2024

I think the easiest path forward is to change the CBC to zip the version parts you need into the variant variables and use them in the selectors.

@beckermr
Copy link
Member

beckermr commented Aug 6, 2024

One common thing is that both recipes set jinja2 default values that are also set in the variant configs. I wonder if that is causing wires to cross.

@h-vetinari
Copy link
Member Author

h-vetinari commented Aug 6, 2024

I think the easiest path forward is to change the CBC to zip the version parts you need into the variant variables and use them in the selectors.

As a temporary workaround or as a longer-term fix? I'd be fine with the former, not excited about the latter (not least because this used to work, and it's a legitimate use of the various mechanisms).

@beckermr
Copy link
Member

beckermr commented Aug 6, 2024

Can you find a combination of current smithy and old conda-build where this does work? That would help eliminate sources of bugs.

@beckermr
Copy link
Member

beckermr commented Aug 6, 2024

@h-vetinari here is a minimal reproducer

conda/conda-build#5445

This confirms the bug is in conda-build, not smithy.

@h-vetinari
Copy link
Member Author

Thank you so much for chasing down that bug and coming up with a fix already! 🙏😊

@h-vetinari
Copy link
Member Author

One thing it does confirm though is that the issue in ctng-compilers is a separate one; I tried a local install of conda/conda-build#5447, and it still incorrectly bumps the c_stdlib_version. 😑

@beckermr
Copy link
Member

beckermr commented Aug 6, 2024

Oh boy yay fun! We'll have to keep digging.

@h-vetinari
Copy link
Member Author

h-vetinari commented Aug 6, 2024

I tried to reduce to an example like you did in conda/conda-build#5445, but I still see 2.12 in the conda rerender . output, even though it doesn't show in the variant configs if I run smithy. 🤔

So it looks like this is at least partly smithy's fault... Indeed, I had been looking at the wrong variable above; if I do the same with cross_target_stdlib_version, I get:

individual configs
['12', '2.12', '2.17']
individual configs
['12', '2.12', '2.17']
individual configs
['12', '2.12', '2.17']
individual configs
['12', '2.12', '2.17']
individual configs
[]                        # presumably this is osx, which we're skipping
individual configs
['12']
merged configs
['12', '2.12', '2.17']
Misdiagnosis

Chasing this down a bit further, we seem to be dropping cross_target_stdlib_version somewhere in this block:

# handle grouping from zip_keys for everything in conform_dict
zip_key_groups = []
if "zip_keys" in squished_variants:
zip_key_groups = squished_variants["zip_keys"]
if zip_key_groups and not isinstance(zip_key_groups[0], list):
zip_key_groups = [zip_key_groups]
zipped_configs = []
top_level_dimensions = []
for key in top_level_keys:
if key in accounted_for_keys:
# remove the used variables from the collection of all variables - we have them in the
# other collections now
continue
if any(key in group for group in zip_key_groups):
for group in zip_key_groups:
if key in group:
accounted_for_keys.update(set(group))
# create a list of dicts that represent the different permutations that are
# zipped together. Each dict in this list will be a different top-level
# config in its own file
zipped_config = []
top_level_config_dict = OrderedDict()
for idx, variant_key in enumerate(squished_variants[key]):
top_level_config = []
for k in group:
if k in top_level_keys:
top_level_config.append(
squished_variants[k][idx]
)
top_level_config = tuple(top_level_config)
if top_level_config not in top_level_config_dict:
top_level_config_dict[top_level_config] = []
top_level_config_dict[top_level_config].append(
{k: [squished_variants[k][idx]] for k in group}
)
# merge dicts with the same `key` if `key` is repeated in the group.
for _, variant_key_val in top_level_config_dict.items():
squished_dict = merge_list_of_dicts(variant_key_val)
zipped_config.append(squished_dict)
zipped_configs.append(zipped_config)
for k in group:
del squished_variants[k]
break
else:
# dimension slice is just this one variable, all other dimensions keep their variability
top_level_dimensions.append(
[{key: [val]} for val in squished_variants[key]]
)
del squished_variants[key]

IOW, after that block squished_variants does not contain that key anymore (except as part of the zip specification)

OrderedDict([('openmp_ver', ['4.5']),
             ('docker_image', ['quay.io/condaforge/linux-anvil-cos7-x86_64']),
             ('binutils_version', ['2.40']),
             ('channel_sources',
              ['conda-forge/label/sysroot-with-crypt,conda-forge']),
             ('libgomp_ver', ['1.0.0']),
             ('c_stdlib', ['sysroot']),
             ('channel_targets', ['conda-forge main']),
             ('c_stdlib_version', ('2.17',)),
             ('cdt_name', ('cos7',)),
             ('zip_keys',
              [['c_stdlib_version', 'cdt_name'],
               ['gcc_version', 'gcc_maj_ver', 'libgfortran_soname'],
               ['cross_target_stdlib_version',
                'cross_target_stdlib',
                'triplet',
                'old_triplet',
                'cross_target_platform']])])

whereas before that block it looks like:

OrderedDict([('openmp_ver', ['4.5']),
             ('target_platform', ['linux-s390x']),
             ('docker_image', ['quay.io/condaforge/linux-anvil-cos7-x86_64']),
             ('binutils_version', ['2.40']),
             ('channel_sources',
              ['conda-forge/label/sysroot-with-crypt,conda-forge']),
             ('libgomp_ver', ['1.0.0']),
             ('c_stdlib', ['sysroot']),
             ('channel_targets', ['conda-forge main']),
             ('gcc_version', ('12.4.0', '13.3.0', '14.1.0')),
             ('gcc_maj_ver', ('12', '13', '14')),
             ('libgfortran_soname', ('5', '5', '5')),
             ('c_stdlib_version', ('2.17',)),
             ('cdt_name', ('cos7',)),
             ('cross_target_stdlib_version',
              ('2.17', '2.12', '2.17', '12', '2.17')),                       <--- SEE HERE
             ('old_triplet',
              ('s390x-conda-linux-gnu',
               'x86_64-conda_cos6-linux-gnu',
               'aarch64-conda_cos7-linux-gnu',
               'x86_64-w64-mingw32',
               'powerpc64le-conda_cos7-linux-gnu')),
             ('cross_target_platform',
              ('linux-s390x',
               'linux-64',
               'linux-aarch64',
               'win-64',
               'linux-ppc64le')),
             ('triplet',
              ('s390x-conda-linux-gnu',
               'x86_64-conda-linux-gnu',
               'aarch64-conda-linux-gnu',
               'x86_64-w64-mingw32',
               'powerpc64le-conda-linux-gnu')),
             ('cross_target_stdlib',
              ('sysroot', 'sysroot', 'sysroot', 'm2w64-sysroot', 'sysroot')),
             ('zip_keys',
              [['c_stdlib_version', 'cdt_name'],
               ['gcc_version', 'gcc_maj_ver', 'libgfortran_soname'],
               ['cross_target_stdlib_version',
                'cross_target_stdlib',
                'triplet',
                'old_triplet',
                'cross_target_platform']])])

@h-vetinari
Copy link
Member Author

h-vetinari commented Aug 7, 2024

Nevermind, that was a misdiagnosis, the bug happens later still. After the block I mentioned, the correct config is still found in zipped_configs:

more aimless investigating
[[OrderedDict([('cross_target_stdlib_version', ['12']),
               ('cross_target_stdlib', ['m2w64-sysroot']),
               ('triplet', ['x86_64-w64-mingw32']),
               ('old_triplet', ['x86_64-w64-mingw32']),
               ('cross_target_platform', ['win-64'])]),
  OrderedDict([('cross_target_stdlib_version', ['2.17']),
               ('cross_target_stdlib', ['sysroot']),
               ('triplet', ['powerpc64le-conda-linux-gnu']),
               ('old_triplet', ['powerpc64le-conda_cos7-linux-gnu']),
               ('cross_target_platform', ['linux-ppc64le'])]),
  OrderedDict([('cross_target_stdlib_version', ['2.12']),                          <-----
               ('cross_target_stdlib', ['sysroot']),
               ('triplet', ['x86_64-conda-linux-gnu']),
               ('old_triplet', ['x86_64-conda_cos6-linux-gnu']),
               ('cross_target_platform', ['linux-64'])]),
  OrderedDict([('cross_target_stdlib_version', ['2.17']),
               ('cross_target_stdlib', ['sysroot']),
               ('triplet', ['s390x-conda-linux-gnu']),
               ('old_triplet', ['s390x-conda-linux-gnu']),
               ('cross_target_platform', ['linux-s390x'])]),
  OrderedDict([('cross_target_stdlib_version', ['2.17']),
               ('cross_target_stdlib', ['sysroot']),
               ('triplet', ['aarch64-conda-linux-gnu']),
               ('old_triplet', ['aarch64-conda_cos7-linux-gnu']),
               ('cross_target_platform', ['linux-aarch64'])])],
 [OrderedDict([('gcc_version', ['14.1.0']),
               ('gcc_maj_ver', ['14']),
               ('libgfortran_soname', ['5'])]),
  OrderedDict([('gcc_version', ['12.4.0']),
               ('gcc_maj_ver', ['12']),
               ('libgfortran_soname', ['5'])]),
  OrderedDict([('gcc_version', ['13.3.0']),
               ('gcc_maj_ver', ['13']),
               ('libgfortran_soname', ['5'])])]]

@h-vetinari
Copy link
Member Author

UGH, the solution is: the rerender works, and I managed to confuse myself in a cross-compilation recipe. What was happening is that we were doing the first rerender since we dropped cos6, and so the c_stdlib_version coming in from the global pinning got bumped (but not the cross_target_stdlib_version that's specified in the CBC).

Sorry about the hassle here. I think we can close this since we've identified the error for clang-win-activation, realised my error for ctng-compilers. That leaves just conda/conda-build#5443 of the recent compiler-related rendering regressions unresolved 😅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants