Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for .tgz files in GeoIpDownloader #70725

Merged
merged 20 commits into from
Mar 29, 2021

Conversation

probakowski
Copy link
Contributor

We have to ship COPYRIGHT.txt and LICENSE.txt files alongside .mmdb files for legal compliance. Infra will pack these in single .tgz (gzipped tar) archive provided by GeoIP databases service.
This change adds support for that format to GeoIpDownloader and DatabaseRegistry

@probakowski probakowski added :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP v8.0.0 v7.13.0 labels Mar 23, 2021
@probakowski probakowski requested a review from martijnvg March 23, 2021 12:02
@elasticmachine elasticmachine added the Team:Data Management Meta label for data/management team label Mar 23, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-features (Team:Core/Features)

@probakowski
Copy link
Contributor Author

@elasticmachine update branch

@probakowski probakowski mentioned this pull request Mar 23, 2021
15 tasks
Copy link
Member

@martijnvg martijnvg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Left a few questions.


import java.util.Objects;

public class TarEntry {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should use a library that provides tar ball support? Like plexus-archiver or Apache commons-compress?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about it but both libraries adds quite big footprint (common-cmpress over 600kB, plexus over 200kB) just for this one use case.
I think tar is simple enough (at least for basic cases but I don't expect anything crazy from infra) to parse it by ourselves. That said I can reconsider if you are really against creating our own solution here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realized that plexus-archiver actually depends on common-compress. I agree it is a lot of additional bytes for just for extracting a tarball and the code itself isn't complex. On the hand, about half of this PR is about adding support for extracting a tarball and comparing the size of all our other dependencies (all jdk libraries and the jdk) I think about 600kb for common-compress is just a small part our total distribution.

I did find a lightweight library called jtar that adds support for just tar files, but that project doesn't seem very active and maintained by basically a single developer.

I think I'm ok with maintaining this tar code ourselves, but it should be moved to ingest-geoipv2 module and explained that it is not a general tar library and meant to be used for extracting the tar files provided by the infra service.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved it to geoip module and made it simpler by skipping everything that is not needed for our use case.

@probakowski probakowski requested a review from martijnvg March 25, 2021 19:40
Copy link
Member

@martijnvg martijnvg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

* {@link InputStream} with very basic support for tar format, just enough to parse archives provided by GeoIP database service from Infra.
* This class is not suitable for general purpose tar processing!
*/
class TarInputStream extends FilterInputStream {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

if (name.startsWith(databaseName)) {
Files.copy(is, databaseTmpFile, StandardCopyOption.REPLACE_EXISTING);
} else {
Files.copy(is, geoipTmpDirectory.resolve(databaseName + "_" + name), StandardCopyOption.REPLACE_EXISTING);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The removeStaleEntries() method doesn't remove these files, but I think that is ok.
These are small files and each update overwrites these files, during node startup the
geoip tmp dir is purged and the removeStaleEntries() is only invoked if at some
point in time we stop distributing databases by default or custom dbs are shipped
by a third party database webserver.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that was exactly my thinking, I think we can leave them behind as they are small and doesn't interfere with geoip processor itself.

@probakowski
Copy link
Contributor Author

@elasticmachine run elasticsearch-ci/1

@probakowski probakowski merged commit b025f51 into elastic:master Mar 29, 2021
@probakowski probakowski deleted the geoi-tar branch March 29, 2021 10:46
probakowski added a commit to probakowski/elasticsearch that referenced this pull request Mar 29, 2021
We have to ship COPYRIGHT.txt and LICENSE.txt files alongside .mmdb files for legal compliance. Infra will pack these in single .tgz (gzipped tar) archive provided by GeoIP databases service.
This change adds support for that format to GeoIpDownloader and DatabaseRegistry
# Conflicts:
#	modules/ingest-geoip/src/internalClusterTest/java/org/elasticsearch/ingest/geoip/GeoIpDownloaderIT.java
#	modules/ingest-geoip/src/test/java/org/elasticsearch/ingest/geoip/GeoIpDownloaderTests.java
probakowski added a commit that referenced this pull request Mar 29, 2021
* Add support for .tgz files in GeoIpDownloader (#70725)

We have to ship COPYRIGHT.txt and LICENSE.txt files alongside .mmdb files for legal compliance. Infra will pack these in single .tgz (gzipped tar) archive provided by GeoIP databases service.
This change adds support for that format to GeoIpDownloader and DatabaseRegistry
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP >enhancement Team:Data Management Meta label for data/management team v7.13.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants