Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fleet][EPM] Save installed package assets in ES #83391

Merged
merged 22 commits into from
Dec 7, 2020

Conversation

jfsiii
Copy link
Contributor

@jfsiii jfsiii commented Nov 15, 2020

Summary

Store package assets (from Registry or local upload) in Elasticsearch. Related to proposal issue & document

  • New epm-packages-assets saved objects are stored on .kibana index, like our existing saved object epm-packages
  • Asset id is uuid v5 based on the package name, package version & file path. See 1974324
  • Add a list of IDs of all the installed assets, to epm-packages saved object. Like the existing installed_ properties. Example from a test
Mapping for new Saved Object

mappings: {
properties: {
package_name: { type: 'keyword' },
package_version: { type: 'keyword' },
install_source: { type: 'keyword' },
asset_path: { type: 'keyword' },
media_type: { type: 'keyword' },
data_utf8: { type: 'text', index: false },
data_base64: { type: 'binary' },
},
},

Additional property on existing epm-packages Saved Object

package_assets: {
type: 'nested',
properties: {
id: { type: 'keyword' },
type: { type: 'keyword' },
},
},

I don't think the saved object changes are strictly required. It can be removed without changing much about how things work

  • Pros:
    - Preserves accurate record of the assets added at installation time. Separates what assets are currently available for package-version from what was installed. They should be the same, but things happen.
    - Avoids a query to get the installed assets before operating on them
  • Cons:
    - size/noise? Could be tens or hundreds of ids
    - migration?

More details

When are saved objects added?
During installation, after all other actions have succeeded, just before marking the save object as installed, we commit all the files from the package to ES

const packageAssetRefs: PackageAssetReference[] = packageAssetResults.saved_objects.map(
(result) => ({
id: result.id,
type: ASSETS_SAVED_OBJECT_TYPE,
})
);

When are documents removed from the index?

In the removeInstallation function which is called in response to a DELETE /api/fleet/epm/packages/pkgkey

await removeArchiveEntries({ savedObjectsClient, refs: installation.package_assets });

or a failed package (re-)installation

await removeInstallation({ savedObjectsClient, pkgkey, callCluster });

How are we using these assets?
We're not, currently. Here's an example showing how we could update getFileHandler to check for local assets before reaching out to the Registry if we wished. It's not DRY, but it does work

const esDocRoot = `http://elastic:changeme@localhost:9200/${PACKAGE_ASSETS_INDEX_NAME}/_doc`;
const escapedDocId = encodeURIComponent(`${pkgName}-${pkgVersion}/${filePath}`);
const esRes = await fetch(`${esDocRoot}/${escapedDocId}`);
const esJson = await esRes.json();
if (esJson.found) {
  const asset: PackageAsset = esJson._source;
  const body = asset.data_utf8 || Buffer.from(asset.data_base64, 'base64');
  return response.ok({
    body,
    headers: {
      'content-type': asset.media_type,
      // should add our own `cache-control` header here
      // kibana default is prevents caching: `private, no-cache, no-store, must-revalidate`
      // https://github.com/elastic/kibana/issues/83631
    },
  });
}

Checklist

updated tests to include new saved object output, no tests added yet

@jfsiii jfsiii changed the title Initial pass at saving package assets in ES [Fleet][EPM] Initial pass at saving package assets in ES Nov 16, 2020
@jfsiii jfsiii marked this pull request as ready for review November 16, 2020 15:00
@jfsiii jfsiii requested a review from a team November 16, 2020 15:00
@jfsiii jfsiii self-assigned this Nov 16, 2020
@jfsiii jfsiii added release_note:skip Skip the PR/issue when compiling release notes v7.11.0 v8.0.0 labels Nov 16, 2020
@botelastic botelastic bot added the Team:Fleet Team label for Observability Data Collection Fleet team label Nov 16, 2020
@elasticmachine
Copy link
Contributor

Pinging @elastic/ingest-management (Team:Ingest Management)

path: { type: 'text' },
media_type: { type: 'keyword' },
data_utf8: { type: 'text' },
data_base64: { type: 'binary' },
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Went with two fields because I thought we wanted binary type for the binary assets vs text but perhaps one field is possible/desirable?

package_version: { type: 'keyword' },
install_source: { type: 'keyword' },
path: { type: 'text' },
media_type: { type: 'keyword' },
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also considered mime_type. Happy to hear any comments / suggestions.

@nchaulet
Copy link
Member

@jfsiii How are we going to consume these assets later? Do we want to use a generated id like pkgName:pkgVersion:path ?

@jfsiii
Copy link
Contributor Author

jfsiii commented Nov 16, 2020

How are we going to consume these assets later? Do we want to use a generated id like pkgName:pkgVersion:path ?

@nchaulet good question, perhaps we should discuss in #83426 so others see & comment?

I believe combining the pkgName, pkgVersion, path, and installSource properties should give only one result.

Copy link
Member

@nchaulet nchaulet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me

@jfsiii jfsiii force-pushed the write-to-es-strawman branch from 19b9365 to cfdb0c9 Compare November 17, 2020 18:05
@jfsiii
Copy link
Contributor Author

jfsiii commented Nov 19, 2020

@elasticmachine merge upstream

@neptunian
Copy link
Contributor

I left a comment #83426 (comment) to see if there was some concern about having so many saved objects. If not, I'm happy for this to go in.

@jfsiii jfsiii force-pushed the write-to-es-strawman branch from 579f355 to 5ae363a Compare November 20, 2020 18:16
@jfsiii jfsiii requested review from a team as code owners November 20, 2020 18:16
@jfsiii
Copy link
Contributor Author

jfsiii commented Nov 21, 2020

@elasticmachine merge upstream

@joshdover
Copy link
Contributor

  • Saved object or plain ES documents?

    • Chose plain ES documents, but no strong feelings or data either way. Should be trivial to update the implementation.
  • Use a new index or existing one?

    • Chose a new public index. Chose public out of spirit of transparency, but perhaps it should be system or private?
  • What are the ramifications of a new index? Particularly a private or system index.

    • Do we need to follow any rules, notify other teams, etc
    • Ownership & access privileges on that new index?

The biggest cons to not using the .kibana SavedObjects index:

  • No import/export support
  • No spaces support (unsure if needed in this case?)
    • This also means no built-in RBAC support. Authorization rules would have to be custom enforced by the Fleet API endpoints, rather than using our built-in privileges model.
  • No system index protection
    • This can be accomplished but may require extra work in ES to support it. System indices help prevent accidental breakages on our indices by users. This helps eliminate an entire class of bugs that our customers run into.
  • Additional permissions need to be granted to the kibana_system role in Elasticsearch. For most customers, this is not an issue, but for any customers that use a custom role, this could be a problem. From my understanding, the main use case for a custom role was for usage with our "legacy multi-tenancy" feature which allowed users to have multiple Kibana installations on a single cluster. We are phasing this feature out in 8.0, so I don't believe this is a concern.

Without a strong justification, I lean heavily towards using the .kibana index rather than a new index. The only concern I can think of right now is going to be around document size due to the potentially large sizes of these files. Long ago we decided to store PDFs and PNGs created by reporting in a separate index. I'm not sure if this was also done because of the file size concerns or not. @kobelb do you have any context here?

@jfsiii
Copy link
Contributor Author

jfsiii commented Nov 30, 2020

@joshdover Thanks for the quick and helpful reply re: .kibana index. Does that mean you'd also suggest biasing towards using Saved Objects vs ES documents?

The initial commit used Saved Objects out of habit, but was switched to ES documents because I'm not familiar with the tradeoffs. Getting migrations seemed like a Good Thing ™️ but I didn't know about the impact that number (typically hundreds or thousands but could be tens of thousands) or size of documents. Distribution of the file sizes is shown in https://docs.google.com/document/d/18XoS6CSl9UxxPPBt9LXuJngf1Jv-4tl3jY6l19U1yH8/edit?ts=5fc4f129#heading=h.qnioeo56ltp4 if that helps. Let me know if I can provide any more context

@kobelb
Copy link
Contributor

kobelb commented Nov 30, 2020

The only concern I can think of right now is going to be around document size due to the potentially large sizes of these files. Long ago we decided to store PDFs and PNGs created by reporting in a separate index. I'm not sure if this was also done because of the file size concerns or not. @kobelb do you have any context here?

You're correct that we decided to store the PDFs and PNGs created by Reporting in a separate index because of file size concerns. Currently, Reporting uses a new index per week to allow users to implement their own retention policy for how long to keep these reports around.

In general, I think we should be defaulting to using saved-objects to store Kibana-specific assets until there's a substantial reason to not do so. #80912 includes some of the historic reasons for why saved-objects haven't been used, but if it's possible to address the underlying deficiencies in saved-objects, we should do so instead of resorting to a custom index.

How large of documents are we talking about for packages?

If we do decide to go the custom index route, it should be prefixed by a .. https://github.com/elastic/dev/issues/1525 has a lot of the context around this prescription.

@joshdover
Copy link
Contributor

Does that mean you'd also suggest biasing towards using Saved Objects vs ES documents?

Yes!

Getting migrations seemed like a Good Thing ™️ but I didn't know about the impact that number (typically hundreds or thousands but could be tens of thousands) or size of documents.

Migrations only get applied to objects that have an actual migration function registered, so this should not really be an issue, even with 10k objects.

Distribution of the file sizes

Most of these look mostly small (it appears 75% of the packages are less than 500kb). Are there any known reasons these will get larger in the future?

Also this text from the proposal:

Or “why not store only the compressed/zip file?” Primarily because the zip file is a binary asset and Elasticsearch stores text. Despite the name, the ‘binary’ type in Elasticsearch is still a string. The archive would have to be converted to a Base64 encoded string which increases the size by about 30%. So a 1MB zip becomes a 1.3MB string and a 15MB zip becomes a nearly 20MB string.

Have we actually verified that a base64 encoded zip file is larger than the unencoded strings of all the individual files? It's possible that the compression on the zip outweighs the 30% increase from base64 encoding.

From a long-term maintenance standpoint, I think it's better if we can limit the number of documents in Elasticsearch to just the zip'd packages. This will make removing & reinstalling any corrupted packages much simpler for our support team in the case that something breaks. I also think the runtime of unzipping a 1MB file is quite small and I don't imagine this to be a common operation (maybe I'm wrong?) that we need to optimize for.

@jfsiii
Copy link
Contributor Author

jfsiii commented Nov 30, 2020

Getting migrations seemed like a Good Thing ™️ but I didn't know about the impact that number (typically hundreds or thousands but could be tens of thousands) or size of documents.

Migrations only get applied to objects that have an actual migration function registered, so this should not really be an issue, even with 10k objects.

I could have worded this more clearly. I meant that migrations seemed like a good reason to use them but, migrations aside, I didn't know if the number & size of objects we'd be creating would be a concern. Doesn't seem so.

Most of these look mostly small (it appears 75% of the packages are less than 500kb). Are there any known reasons these will get larger in the future?

That's size of the individual files. Here are the current archive sizes

screenshot Screen Shot 2020-11-30 at 3 16 29 PM

The distribution might say the same, but I expect we will see an increase in packages at the larger end as we open to the public and get different types of integrations (e.g. example logs

Have we actually verified that a base64 encoded zip file is larger than the unencoded strings of all the individual files? It's possible that the compression on the zip outweighs the 30% increase from base64 encoding.

In my mind it's less about total size store on disk as it is minimum & median request. It's less taxing to transfer & process several smaller strings vs one large one. Especially because (de)serialization is CPU-intensive and often blocking. In my experience with web performance, time spent in JSON.parse/stringify and other string operations is a primary culprit.

This will make removing & reinstalling any corrupted packages much simpler for our support team in the case that something breaks

This should always be possible via a single action, whether an http call, a "delete by" query, or action in the UI.

I also think the runtime of unzipping a 1MB file is quite small and I don't imagine this to be a common operation (maybe I'm wrong?) that we need to optimize for.

We'll use the local assets on the package details page to avoid hitting the registry for installed assets. We don't have to have them stored individually to do this. We could simply keep the unzipped buffers in memory as we do now. It's more about what I mentioned above about keeping the transaction small.

@joshdover
Copy link
Contributor

You're correct that we decided to store the PDFs and PNGs created by Reporting in a separate index because of file size concerns. Currently, Reporting uses a new index per week to allow users to implement their own retention policy for how long to keep these reports around.

One difference here is that Reporting outputs could potentially grow indefinitely, whereas package assets are more finite. It seems unlikely that a user would install tens of thousands of packages (or that we'd ever even have that many in our registry). This makes file size potentially less of a concern, but I think we should verify it.

Could we do a simple test where we create 10,000 of these objects in the .kibana index with 1MB of zeros and measure the performance of Kibana and some basic queries (_search, GET /.kibana/_doc/<id>, etc.)? We have some load tests we could run to see if there is a performance impact compared to master.

In my mind it's less about total size store on disk as it is minimum & median request. It's less taxing to transfer & process several smaller strings vs one large one. Especially because (de)serialization is CPU-intensive and often blocking. In my experience with web performance, time spent in JSON.parse/stringify and other string operations is a primary culprit.

Makes sense, and I haven't spent time looking at how Fleet processes this data. If we store the decompressed output of files, wouldn't we still need to parse the JSON on the responses from ES? Even so, your point about minimum & median request times checks out. I thought the size distribution charts were the entire packages, not the files. I agree that decompressing & parsing 10MB+ during a request is not going to be a good experience.

@jfsiii
Copy link
Contributor Author

jfsiii commented Dec 3, 2020

@elasticmachine merge upstream

@jfsiii
Copy link
Contributor Author

jfsiii commented Dec 4, 2020

@elasticmachine merge upstream

@jfsiii
Copy link
Contributor Author

jfsiii commented Dec 6, 2020

@elasticmachine merge upstream

@kibanamachine
Copy link
Contributor

💚 Build Succeeded

Metrics [docs]

Distributable file count

id before after diff
default 46897 47662 +765

Page load bundle

Size of the bundles that are downloaded on every page load. Target size is below 100kb

id before after diff
fleet 374.9KB 375.2KB +311.0B

Saved Objects .kibana field count

Every field in each saved object type adds overhead to Elasticsearch. Kibana needs to keep the total field count below Elasticsearch's default limit of 1000 fields. Only specify field mappings for the fields you wish to search on or query. See https://www.elastic.co/guide/en/kibana/master/development-plugin-saved-objects.html#_mappings

id before after diff
epm-packages 15 18 +3
epm-packages-assets - 8 +8
total +11

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

Copy link
Contributor

@paul-tavares paul-tavares left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM John. Thanks again for teh changes to the endpoint generator

@jfsiii
Copy link
Contributor Author

jfsiii commented Dec 7, 2020

How large of documents are we talking about for packages?

@kobelb I missed this question. I believe I addressed it in comments to @joshdover but here's some data on them

Zip files (not stored by this PR) ~50% from 100-200kB. ~30% 500kB-1MB. (1) 10MB file (aws)

Individual Zip file sizes (log scale)
Distribution of Zip file sizes

Individual assets (stored by this PR). ~80% under 5kB. ~90% under 10kb. ~5% over 100kB. ~1% over 500kB. Only 30 files or 0.3% over 1MB

Distribution of asset sizes (log scale)
Asset sizes w_images base64 encoded (log scale)

@jfsiii
Copy link
Contributor Author

jfsiii commented Dec 7, 2020

@joshdover I didn't get a chance to run the load tests with 10,000 Saved Objects, but I did run them with every currently available package installed. That resulted in 1951 saved objects

$ curl --user elastic:changeme "localhost:5601/cwl/api/saved_objects/_find?type=epm-packages-assets&per_page=5000" | jq '.total'
1951

This command

$ curl --user elastic:changeme "localhost:5601/cwl/api/saved_objects/_find?type=epm-packages-assets&per_page=5000" | jq '.saved_objects[] | .attributes | {name:.package_name, version:.package_version, source: .install_source, asset_path: .asset_path}' 

shows the assets installed (only name, version & path) https://gist.github.com/jfsiii/9b2c25fe204a7ebc1912692ed11c13e3

Here are the results of the load tests. They look quite similar to me
master.zip
pr.zip

@jfsiii
Copy link
Contributor Author

jfsiii commented Dec 7, 2020

I'm merging this so it's available for @skh, @neptunian, and others to use before Feature Freeze.

I believe I've addressed all the feedback from @kobelb, @joshdover, @legrego and others but please comment if there's anything remaining you'd like addressed. We'll add more tests and perhaps instrumentation in follow up PRs.

@jfsiii jfsiii merged commit 81a340e into elastic:master Dec 7, 2020
jfsiii pushed a commit that referenced this pull request Dec 7, 2020
## Summary
Store package assets (from Registry or local upload) in Elasticsearch. Related to proposal [issue](#83426) & [document](https://docs.google.com/document/d/18XoS6CSl9UxxPPBt9LXuJngf1Jv-4tl3jY6l19U1yH8)

 * New `epm-packages-assets` saved objects are stored on `.kibana` index, like our existing saved object `epm-packages`
 * Asset id is uuid v5 based on the package name, package version & file path. See 1974324
 * Add a list of IDs of all the installed assets, to `epm-packages` saved object. Like the existing `installed_` properties.  [Example](https://github.com/elastic/kibana/pull/83391/files#diff-fa07cac51b6a49bf1e4824bc2250c9a77dac6c7d6b0a56020f559ef1ff9be25fR491-R512) from a test

<details><summary>Mapping for new Saved Object</summary>

https://github.com/elastic/kibana/blob/37f7b6ded747edb5cc487661b801c5e1c0a102a7/x-pack%2Fplugins%2Ffleet%2Fserver%2Fsaved_objects%2Findex.ts#L329-L339
</details>

<details><summary>Additional property on existing <code>epm-packages</code> Saved Object</summary>

https://github.com/elastic/kibana/blob/c4f27ab25715a225c710727a37f5f105364e2f72/x-pack/plugins/fleet/server/saved_objects/index.ts#L306-L312

 I don't think the saved object changes are strictly required. It can be removed without changing much about how things work

- Pros: 
      - Preserves accurate record of the assets added at installation time. Separates what assets are currently available for package-version from what was installed. They _should_ be the same, but things happen.
      - Avoids a query to get the installed assets before operating on them
- Cons:
      - size/noise? Could be tens or hundreds of ids
      - migration?
</details>

### More details

**When are saved objects added?**
During installation, after all other actions have succeeded, just before marking the save object as installed, we commit all the files from the package to ES

https://github.com/elastic/kibana/blob/37f7b6ded747edb5cc487661b801c5e1c0a102a7/x-pack%2Fplugins%2Ffleet%2Fserver%2Fservices%2Fepm%2Fpackages%2F_install_package.ts#L193-L198

**When are documents removed from the index?**

In the `removeInstallation` function which is called in response to a `DELETE /api/fleet/epm/packages/pkgkey`

https://github.com/elastic/kibana/blob/37f7b6ded747edb5cc487661b801c5e1c0a102a7/x-pack%2Fplugins%2Ffleet%2Fserver%2Fservices%2Fepm%2Fpackages%2Fremove.ts#L72

or a failed package (re-)installation

https://github.com/elastic/kibana/blob/bf068739acce044ac27902253e8fc31df621f081/x-pack%2Fplugins%2Ffleet%2Fserver%2Fservices%2Fepm%2Fpackages%2Finstall.ts#L145




**How are we using these assets?**
We're not, currently. Here's an example showing how we could update [`getFileHandler`](https://github.com/elastic/kibana/blob/514b50e4c2d7a3be79d77e73838ff57b6cf1304a/x-pack%2Fplugins%2Ffleet%2Fserver%2Froutes%2Fepm%2Fhandlers.ts#L101) to check for local assets before reaching out to the Registry if we wished. It's not DRY, but it does work

```typescript
const esDocRoot = `http://elastic:changeme@localhost:9200/${PACKAGE_ASSETS_INDEX_NAME}/_doc`;
const escapedDocId = encodeURIComponent(`${pkgName}-${pkgVersion}/${filePath}`);
const esRes = await fetch(`${esDocRoot}/${escapedDocId}`);
const esJson = await esRes.json();
if (esJson.found) {
  const asset: PackageAsset = esJson._source;
  const body = asset.data_utf8 || Buffer.from(asset.data_base64, 'base64');
  return response.ok({
    body,
    headers: {
      'content-type': asset.media_type,
      // should add our own `cache-control` header here
      // kibana default is prevents caching: `private, no-cache, no-store, must-revalidate`
      // #83631
    },
  });
}
```

### Checklist
_updated tests to include new saved object output, no tests added yet_
- [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios
install_started_at: { type: 'date' },
install_version: { type: 'keyword' },
install_status: { type: 'keyword' },
install_source: { type: 'keyword' },
},
},
},
[ASSETS_SAVED_OBJECT_TYPE]: {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need to add this new type to allSavedObjectTypes so that your users are granted access to these new saved objects? It's probably not strictly necessary since you still require superuser privileges for Fleet, and superusers can do everything regardless of what you specify here

const allSavedObjectTypes = [
OUTPUT_SAVED_OBJECT_TYPE,
AGENT_POLICY_SAVED_OBJECT_TYPE,
PACKAGE_POLICY_SAVED_OBJECT_TYPE,
PACKAGES_SAVED_OBJECT_TYPE,
AGENT_SAVED_OBJECT_TYPE,
AGENT_EVENT_SAVED_OBJECT_TYPE,
ENROLLMENT_API_KEYS_SAVED_OBJECT_TYPE,
];

jfsiii pushed a commit that referenced this pull request Dec 15, 2020
* Revert "[Fleet][EPM] Save installed package assets in ES (#83391)"

This reverts commit 81a340e.

* Revert 00c2e96

Co-authored-by: Kibana Machine <[email protected]>
jfsiii pushed a commit that referenced this pull request Dec 15, 2020
* Revert "[Fleet][EPM] Save installed package assets in ES (#83391)"

This reverts commit 81a340e.

* Revert 00c2e96

Co-authored-by: Kibana Machine <[email protected]>

Co-authored-by: Kibana Machine <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release_note:skip Skip the PR/issue when compiling release notes Team:Fleet Team label for Observability Data Collection Fleet team v7.11.0 v8.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants