Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial service commit #1

Merged
merged 2 commits into from
Aug 4, 2022
Merged

Initial service commit #1

merged 2 commits into from
Aug 4, 2022

Conversation

simonj2
Copy link
Contributor

@simonj2 simonj2 commented Aug 2, 2022

What

Initial commit of the search service.

Sorry for the large commit, but this allowed me to scope out the proposal better.

Contains:

  • Product document definition
  • Bulk import script
  • API definition
  • API implementation (partly implemented)
  • Docker, Elasticvue configuration

README.md Outdated

Make sure your environment is configured:
```commandline
export ELASTIC_PASSWORD=PASSWORD_HERE
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also suggest sensible defaults for the other variables? Or provide an env file maybe?
e.g. for MEM_LIMIT, STACK_VERSION, CLUSTER_NAME, ES_PORT etc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2064 export STACK_VERSION=8.3.3
2065 export CLUSTER_NAME=elasticcluster
2066 export MEM_LIMIT=1g
2069 export ES_PORT=7777

seem to be enough to launch the docker containers

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely - there was an .env file but it wasn't committed due to the .gitignore. I'll add this now.

Note, I found that 2GB of memory lead to better performance, so I'll put that in. Still works well with 1GB if we're memory constrained.

README.md Outdated

To start server:
```console
uvicorn api:app --reload
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I'm a bit lost as I know little docker and python unfortunately. Where do I need to run uvicorn? In one of the docker containers? (elasticvue?) Or in a Python virtual env outside of docker, where I should install everything listed in requirements.txt?

If you can put a bit more details in the Helpful commands for newbies like me, it would be very helpful indeed! :)

Our python and docker expert is @alexgarel , but he's on vacation until August 16th.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely! So, I've now structured this so that the search service runs in docker. I've also added commands for both using docker and using locally.

Note, as part of this I moved everything inside an app directory, so the PR change looks bigger than it is.

@stephanegigandet
Copy link
Contributor

That looks awesome, thank you very much @simonj2 !

The CSV export is very partial, so at some point it will be better to switch to the MongoDB export.

There are few things that we'll to think about at some point. In particular taxonomized fields like the categories are not stored with searchable textual names. The list of categories in in categories_tags, and it's an array with entries like "en:coffee". There is a field "categories" which can be in any language and should not be used for search.

One solution could be to include values for one language (e.g. for a French product, if there's categories_tags: "en:coffee", we also include categories: "café". Then search would work for one language (not so good for products sold in countries with multiple languages, like Belgium or Canada).

For language specific fields like product_name, we also have entries like product_name_en, product_name_fr etc. (product_name contains a copy of the value for the main language of the product).

@simonj2
Copy link
Contributor Author

simonj2 commented Aug 2, 2022

Thanks for the feedback @stephanegigandet ! I've incorporated the comments you had about the code.

The CSV export is very partial, so at some point it will be better to switch to the MongoDB export.

Makes sense - will look at this in the future.

There are few things that we'll to think about at some point. In particular taxonomized fields like the categories are not stored with searchable textual names. The list of categories in in categories_tags, and it's an array with entries like "en:coffee". There is a field "categories" which can be in any language and should not be used for search.

One solution could be to include values for one language (e.g. for a French product, if there's categories_tags: "en:coffee", we also include categories: "café". Then search would work for one language (not so good for products sold in countries with multiple languages, like Belgium or Canada).

For language specific fields like product_name, we also have entries like product_name_en, product_name_fr etc. (product_name contains a copy of the value for the main language of the product).

Interesting! I think this comes back to the schema discussion. Fields aren't discoverable, documented, or consistent (ie, certain times the language is embedded in the value en:coffee, sometimes it's in the field name: product_name_fr.

A unified schema for data representation could solve this, ie, every field could be stored as a map of countrycode->value:

  product_name: {
    en: “Quarter Pounder”,
    fr: “Royale with Cheese”
  }

This could then be returned to the clients directly, and a convenience API could be offered to return the correct value if a language parameter is provided. The same logic could apply for tags (filtering if a language parameter is provided).

I could write logic to do things like:

  • Pull the product_name_fr field into the product_name field in the response if a lang parameter is passed in with fr
  • Consolidate the tags at index time (ie as you suggest by using one language)

But, I think that's probably fragile - new fields will break this, we don't make our API consumers lives any easier by consolidating fields, etc.

Perhaps let's see how the schema work lands, then it could be incorporated here quite easily?

Copy link
Contributor

@stephanegigandet stephanegigandet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks very good to me, thank you!

@simonj2 simonj2 merged commit b82776d into main Aug 4, 2022
@teolemon teolemon mentioned this pull request Aug 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

4 participants