Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add common engsys breaking changes doc #5938

Merged
merged 2 commits into from
Jun 14, 2023
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
104 changes: 104 additions & 0 deletions doc/common/breaking_changes.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,104 @@
This document contains a list of known breaking changes in the engineering system tools (mainly `eng/common` directories) and available workarounds. Typically this takes the form of infrastructure changes that are not backwards compatible with older versions of the engineering system configs/scripts/tools, mainly due to dependency changes in our test agent images.
Copy link
Contributor

@konrad-jamrozik konrad-jamrozik Apr 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I immediately have a question: how these breaking changes are surfaced to me if I make them? It would help me here if you could perhaps link to some examples; if you link to builds, then of course it would be good to retain them indefinitely (I do it all the time).

If the answer is "it depends on the tool that broke" then perhaps each breaking change should have examples, or at least some elaboration how I can verify the breaking change occurred. Like, if you share an error message snippet, I still need to know where I have to look for it. In fact, you did this in at least one place below:

This error may occur in pipelines that call git sparse-checkout without the --no-cone flag set.

Still, would be nice to link to one known occurrence of it.

Also, I understand the idea is that ideally we would put this doc under some aka.ms url and link to it from relevant errors? In general, what is the process of sharing this doc with people?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been trying to include the offending error messages so if people search for it they will find this doc. I don't intend that people would start from this doc and then inspect their branch, but rather that it can be a troubleshooting tool when one sees these errors, which ideally are very rare. If people have an old branch that they need to release from, then they should attempt to fully update our common directory via git checkout main eng/common. If for some reason that's not possible, then they can attempt to cherry pick any commits they are missing from this page.

Typing this out, I think that I should update the top of this doc to instruct people to first try git checkout main eng/common, but it also makes the diff for their hotfix branch a lot larger, which may not be desired.

I don't want to have to worry about build retention so I've been trying to include enough of the log message and a description to make it clear what the context is.

Also, I understand the idea is that ideally we would put this doc under some aka.ms url and link to it from relevant errors? In general, what is the process of sharing this doc with people?

Usually these errors come out of tooling/dependencies we don't control and we don't know which errors we'll see in advance. The original intent of this doc was to make it easier for us to help people who come to the channel with build breaks, since we can just link them to a section on the markdown page. In an ideal world, this doc would be as small as possible and used as little as possible. We want to keep our pipelines backwards compatible, but we've been a bit unlucky recently.

Copy link
Contributor

@konrad-jamrozik konrad-jamrozik Apr 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read what you wrote, but I still have a little bit of hard time conceptualizing the workflow here. The doc header says:

This document contains a list of known breaking changes in the engineering system tools (mainly eng/common directories) and available workarounds. Typically this takes the form of infrastructure changes that are not backwards compatible with older versions of the engineering system configs/scripts/tools, mainly due to dependency changes in our test agent images.

Plus here you wrote:

So I guess the doc title is somewhat misleading. It's not actually listing "Breaking Changes" but rather "Changes that are required for things to continue working." Not sure how I should title that (maybe "Known Backwards Compatibility Issues and Fixes" ???)

Basically I have problems answers the following questions:

  1. when, exactly, this doc is applicable? (Scope)
  2. if it is applicable, how will I (as a member of EngSys team) remember that I need to consult this doc? (Discoverability)

I am hoping we have a better answer to 1. than "potentially to any tooling or process owned by EngSys" and better answer to 2. than "You just need to memorize it is there and recall it while troubleshooting anything we own".

Regarding 1., you also wrote:

The original intent of this doc was to make it easier for us to help people who come to the channel with build breaks, since we can just link them to a section on the markdown page. In an ideal world, this doc would be as small as possible and used as little as possible. We want to keep our pipelines backwards compatible, but we've been a bit unlucky recently.

Sooo, is it about some pipelines? Can we scope-down/specify-down this file to something like "Troubleshooting guide for EngSys CI/CD build failures" and say something like "Consult this file if one of the pipelines from the following set fails [description of the pipeline set]" ?

Regarding 2.: maybe we could somehow hook to all relevant pipelines with some kind of workflow/decorator like "If this pipeline fails, then as part of error message, output link to aka.ms/[thisFile]" ? Not sure how technically feasible this is. If we have docs explaining our CI/CD setup, then perhaps we could add links there too, like "Troubleshooting CI/CD issues".

Re this:

I've been trying to include the offending error messages so if people search for it they will find this doc

So the expectation here is they will do Azure-org-wide search using GitHub for the error message snippet? I think a lot of folks won't do that.

In any case, not a blocker of course, just some thoughts how to make sure this guidance gets discovered and applied when appropriate.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like to think of this as just-in-time documentation. As in we add things to it when we discover folks are hitting an issue that we know we have fixed in main and then we point folks to those instructions for the next person that might hit this. Trying to discover what may or may not break is not an exercise I think we need to be in which is why I like simply adding as needed.

As a member of EngSys we will simply have to know from experience or to search when we are seeing a particular issue.

Copy link
Contributor

@konrad-jamrozik konrad-jamrozik Apr 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: use our ADO wiki for such just-in-time docs. Less friction to incrementally improve.

Copy link
Contributor

@konrad-jamrozik konrad-jamrozik Apr 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we going to delete the corresponding breaking changes doc in the wiki to avoid duplication?
We could instead link from the breaking changes doc here to the wiki.

From what I understood docs in repo should be tightly coupled to the stuff they are documenting (like READMEs), and wiki is better suited for documenting cross-cutting concerns, so stuff like in this document.

My intuition about searching stuff is the opposite - folks generally understand that wiki is the place for docs to be, and first place to go to. Repository search is for next level, when one wants to dig into the code. Plus I am concerned about people knowing they need to search exactly this repository - how will they know? (in other words: which set is bigger, the set of SDK devs that don't know about the wiki, or set of SDK devs that don't know about the fact there are documents in this repo they need to look for). Or maybe there is some widespread culture of doing global search on Azure GitHub org that I am not aware of?

I also understood, from what Wes said, we don't really care about discoverability here, we just need to know about this doc. Then this removes main hypothetical downside of it being in a wiki.

But again - this is just me thinking aloud about the problem.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these are fair points in favor of using the wiki for this doc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@weshaggard thoughts on Konrad's response? You originally wanted this taken out of the wiki.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree discoverability is going to be an issue in either place and likely we will aways have to point people to it. However, I much prefer things live closer to where they have impact and if they can be in the public they should be there to potentially help the community. In this case these apply to eng/common changes so I'd prefer they live here, and it also potentially helps the community if they hit similar issues they can see the fixes we did. So I would still keep them in the MD file in the repo here and just remove the wiki entry or leave a link to the MD file.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏻


Table of Contents
=================

* [Credscan - You must install or update .NET to run this application.](#credscan---you-must-install-or-update-net-to-run-this-application)
* [Workaround](#workaround)
* [fatal: specify directories rather than patterns](#fatal-specify-directories-rather-than-patterns)
* [Workaround](#workaround-1)


## Credscan - You must install or update .NET to run this application.

This error comes from the "CredScan running" step in the Compliance stage:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the Compliance stage

This is a bit vague to me because I am missing context. Compliance stage of what? Some build? Of which pipeline? An example would be usfeful.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, in this one it's pretty consistent which azure pipeline step this comes from. I will link to the config and add a screenshot.


```
##[error]You must install or update .NET to run this application.
```

On 4/5/2023, dotnet 3.1 was removed from the 1es managed images used for Azure Pipelines runs. Older versions of credscan required dotnet 3.1.

### Workaround

Fix made and synced to all repositories from https://github.com/Azure/azure-sdk-tools/pull/5918

To fix this issue, update your branch with the latest changes to `eng/common` in main, or cherry pick the following commit:

```
# azure-sdk-for-java
git cherry-pick 2f9cdd9319f1867707b5872f12d97d5cdbbb077a
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect you are being explicit to make this a simply copy and past but perhaps we just link to the tools PR and tell folks to find the sync merge. That would be more generic and isn't difficult to do.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main concern there is if you have to stack breaking changes on top of each other for an older branch, it could get pretty annoying to open up multiple sync PRs. It doesn't take me more than a minute to get all the cherry pick commits when I make a breaking change, since I already have the sync PRs open. But I can also see if we don't want the doc to get huge that it makes more sense to use the generic instruction.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not very worried about it at this point but something we might consider if we end up with more of these.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't think it's user friendly enough and it's going to create more work for us helping people. For example, I'm currently adding another breaking change referencing #5608. This PR has so many comments that the actual sync PR links are auto-hidden by the github UI. Then the user has to go and make sure to find the merge commit from the PR, not the actual PR commits. I think it's pretty likely we're going to have to help people troubleshoot either finding the right PR or the right commit, so it's going to be less effort to put the right commits into the markdown from the beginning. That way people can just click a link and copy a single command and there's less room for error.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@konrad-jamrozik konrad-jamrozik Apr 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, I'm currently adding another breaking change referencing #5608.

@benbp What did I break? 😖

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@konrad-jamrozik actually the opposite, you fixed something just in time (updating the codeowners tool version to the one built on net 6) because net 5 seems to have been removed from our build agents.

Copy link
Member Author

@benbp benbp Apr 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I guess the doc title is somewhat misleading. It's not actually listing "Breaking Changes" but rather "Changes that are required for things to continue working." Not sure how I should title that (maybe "Known Backwards Compatibility Issues and Fixes" ???)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I updated the description to hopefully clarify what is meant by "breaking changes" and I added collapsible sections (nice find @weshaggard!).


# azure-sdk-for-js
git cherry-pick 35e20cecff34765bbb102377a9db6328d158104e

# azure-sdk-for-net
git cherry-pick 9f5d8f81b25cbec8a3b7d71dea71d7875843edd5

# azure-sdk-for-python
git cherry-pick 8506be4e8db37b197c77a557fb0a2b342c1074dc

# azure-sdk-for-go
git cherry-pick e9c77a5f47d5858f0b84d5965c9fa0db22548c6a

# azure-sdk-for-android
git cherry-pick eaa0814173137f897a1c28835a3cbe29705e221d

# azure-sdk-for-ios
git cherry-pick 0f0eab74bf3e30c5c591122ae6227274df376eef

# azure-sdk-for-c
git cherry-pick 20cb4f73ef885838938d45e142867151d3136356

# azure-sdk-for-cpp
git cherry-pick d55e2906077b6edcbb42d51ad2becd0492f1ffcf

# azure-sdk-tools
git cherry-pick 9d27383bfaf56d28e94f71537eb04d4f1ec38e3f
```

## fatal: specify directories rather than patterns

Error messages like:
- `fatal: specify directories rather than patterns`
- `##[error]ENOENT: no such file or directory, stat '/mnt/vss/_work/1/s/eng/common/scripts/job-matrix/Create-JobMatrix.ps1`

This error may occur in pipelines that call `git sparse-checkout` without the `--no-cone` flag set. After git v2.37.0 was released the default behavior of sparse checkout was changed to `cone` mode (see [docs](https://github.blog/2022-06-27-highlights-from-git-2-37/#tidbits)). Git was auto-updated on our pipeline agent VM images, so any pipelines run with engsys source code from before 7/11/2022 (e.g. old release branches) may fail with this error.

### Workaround

The core fix that was synced to all repositories is located here: https://github.com/Azure/azure-sdk-tools/pull/3606

To fix this issue, cherry pick the commit that updates the sparse checkout command to use `--no-cone` mode.

```
# azure-sdk-for-java
git cherry-pick 853649891196aa9ac3080d69081901341cce4332

# azure-sdk-for-js
git cherry-pick f51e095156c0a6d45969b118f3c0367777742a8f

# azure-sdk-for-net
git cherry-pick ae4171f5d65223082406776be23482d4353629bd

# azure-sdk-for-python
git cherry-pick 7b206165e15e384c414aad4ade46df5716e0553b

# azure-sdk-for-go
git cherry-pick 9401c5ea4b6be25a7f7b2ef2ded53c097bca903b

# azure-sdk-for-android
git cherry-pick c800fc5bf1e84442eb5e6222a0c80b679d7d8241

# azure-sdk-for-ios
git cherry-pick 2908f149c075736de45e8eb6379381ebfbac334c

# azure-sdk-for-c
git cherry-pick 42c94ef95a363baf8d1d27cb84cae8e100dda2d5

# azure-sdk-for-cpp
git cherry-pick 07a56bc5e3e728fbbcf37a84bb9d37fec1ee7efc

# azure-sdk-tools
git cherry-pick ae4171f5d65223082406776be23482d4353629bd
```