New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Pyfunc in Practice: Creative Applications of MLflow Pyfunc in Machine Learning Projects #86

Merged

BenWilson2 merged 16 commits into mlflow:main from iRahulPandey:pyfunc-in-practice

Aug 16, 2024

Contributor

hugodscarvalho commented Jul 26, 2024

Description

This blog post demonstrates the capabilities of MLflow PyFunc and how it can be utilized in a Machine Learning project. The mlflow.pyfunc offers creative freedom and flexibility. When utilized correctly, teams can build complex systems encapsulated as a model in MLflow that follows the same model lifecycle as traditional ones. This blog showcases how to create multimodal setups, seamlessly connect to databases, and implement a custom fit method using mlflow.pyfunc.

Additions

Created a folder for the blog content containing:
- index.md with the content of the blog.
- Relevant illustrations.
Added all authors' information and thumbnails to the correct locations.

Additional Information

Related to: #78

hugodscarvalho added 3 commits

July 26, 2024 13:19


          Add initial draft of "PyFunc in Practice" blog post

29fc93c


          Add illustration resources and fix fit code block

ecef61a


          Add authors information to authors.yml

618e0ae

github-actions bot commented Jul 26, 2024 •

edited

Loading

Preview for `de747ea`

For faster build, the doc pages are not included in the preview.
Redirects are disabled in the preview.

Open in StackBlitz

BenWilson2 reviewed

View reviewed changes

website/blog/2024-07-26-pyfunc-in-practice/index.md

+                      strategy_df (pd.DataFrame): DataFrame containing strategy information.
+                      db_path (str): Path to the DuckDB database where strategies will be stored.
+                  """
+                  try:

Member

BenWilson2 Aug 8, 2024

Which part of this is expecting to raise an error? Can we isolate the handling to the execute statements and keep the cursor cleanup in the finally block as it is shown?
(If the pd.concat throws, we'll get an odd Exception to deal with when con.close() is called on a reference that doesn't exist, even after the Exception is eaten)

Contributor Author

hugodscarvalho Aug 12, 2024

Thank you for the suggestion @BenWilson2, it makes total sense. I made the following changes:

Error Handling for Concatenation: Added a dedicated try-except block around pd.concat() to handle errors that might occur when concatenating the DataFrames.
Context Manager for Database Connection: Switched the approach to use a context manager (with statement) for the DuckDB connection. This way it automatically handles the connection cleanup, ensuring it's properly closed even if an error occurs during database operations.
Early Exit on Concatenation Failure: If concatenation fails, we exit early from the method using return. This avoids potential problems with database operations if the DataFrame isn't concatenated successfully.

BenWilson2 reviewed

View reviewed changes

website/blog/2024-07-26-pyfunc-in-practice/index.md

+                  self.models = {
+                      "random_forest": (
+                          RandomForestRegressor(random_state=42),
+                          {"n_estimators": [50, 100, 200], "max_depth": [None, 10, 20]},

Member

BenWilson2 Aug 8, 2024

It might be nice to leave a message to readers (but not show them for complexity's sake) how ensemble models might go through a process of hyperparameter tuning and how complex that can be - perhaps mention that brute force full-scope tuning can be done in isolation in order to reduce the time complexity of searching all models permutations within a single ensemble trainer (i.e., I tune my RF regressor and xgboost models with a wide search space and 2000 iterations, then analyze the best range results from optimal validation metric scores, narrowing my ensemble search space. I do this with all models, then run a bayesian optimizer over search spaces for all models in a constrained space to determine the optima for each discrete model as they are considered parts of a whole).

While grid search is definitely simpler to reason about, it's not optimal, and most users will be wondering "how do I do this with optuna?".

Contributor Author

hugodscarvalho Aug 12, 2024

Added a note to clarify that Grid Search is used just for demonstration purposes and can be computationally costly and time-consuming. Also, I’ve highlighted some advanced optimization techniques that tools like Optuna and Hyperopt use, which can efficiently optimize hyperparameters.

BenWilson2 reviewed

View reviewed changes

website/blog/2024-07-26-pyfunc-in-practice/index.md

		### Adding Strategies and Saving to the Database

		The custom-defined `add_strategy_and_save_to_db` method enables the addition of new ensemble strategies to the model and their storage in a DuckDB database. This method accepts a pandas DataFrame containing the strategies and the database path as inputs. It appends the new strategies to the existing ones and saves them in the database specified during the initialization of the ensemble model. This method facilitates the management of various ensemble strategies and ensures their persistent storage for future use.

Member

BenWilson2 Aug 9, 2024

Could the import block for all of the runnable code in this blog be added here to help readers execute the code examples contained herein?

Contributor Author

hugodscarvalho Aug 12, 2024

I was a bit uncertain about including imports for each block since the focus was more on the definition and inner workings of each method, with the Bringing It All Together section covering the execution part. However, it makes total sense to include the necessary dependencies for each block to help users run them (and since the Bringing It All Together section is also suffering some changes).
And again, I have some doubts about including all imports in the first method because it would clutter the block with more imports than the method itself. On the other hand, by including specific imports in each section, there is some repetition across blocks, but it should make it easier for readers to execute the examples. Let me know what you think about this approach, and if you have any other suggestions.

BenWilson2 reviewed

View reviewed changes

website/blog/2024-07-26-pyfunc-in-practice/index.md


		As already highlighted in the previous method, a key feature of MLflow PyFunc models is the ability to define custom methods, providing significant flexibility and customization for various tasks. In the multi-model PyFunc setup, the `fit` method is essential for customizing and optimizing multiple sub-models. It manages the training and fine-tuning of algorithms such as `RandomForestRegressor`, `XGBRegressor`, `DecisionTreeRegressor`, `GradientBoostingRegressor`, and `AdaBoostRegressor`. In addition, by utilizing techniques like GridSearchCV, the `fit` method ensures that each model variant is optimized for maximum performance.

		```python

Member

BenWilson2 Aug 9, 2024

Can we add a note that this is a method of the subclass of PythonModel that will be used for serialization and usage, and not a general function? Perhaps mention that to follow along with making a usable ensemble model, show earlier that a skeleton will need to be created with comments within the body of the class for where these methods will go? (Just thinking about a "I want to follow along here and build this so I can try it out" approach)

Contributor Author

hugodscarvalho Aug 12, 2024

I really liked the idea of including a skeleton for clarity. I've added a skeleton of the Python class to illustrate the structure and show where each method will be placed. This should help in understanding how each part fits together before we dive into the details of each method and also that each one is a method of the subclass of PythonModel.

BenWilson2 reviewed

View reviewed changes

website/blog/2024-07-26-pyfunc-in-practice/index.md Outdated

+                      # Extract the strategy and drop it from the input features
+                      print(f"Strategy: {model_input['strategy'].iloc[0]}")
+                      strategy = model_input["strategy"].iloc[0]
+                      model_input = model_input.drop(columns=["strategy"])

Member

BenWilson2 Aug 9, 2024

Suggested change

      
                    model_input = model_input.drop(columns=["strategy"])
          
                    model_input.drop(columns=["strategy"], inplace=True)

Contributor Author

hugodscarvalho Aug 12, 2024

Thank you for pointing that out, updated it.

BenWilson2 reviewed

View reviewed changes

website/blog/2024-07-26-pyfunc-in-practice/index.md Outdated

+                      self, strategy_df: pd.DataFrame, db_path: str
+                  ) -> None:
+                  """
+                  Adds new strategies from a DataFrame to the DuckDB database.

Member

BenWilson2 Aug 9, 2024

Can an example of the strategy DF be supplied as part of the blog so that readers can run this?

Contributor Author

hugodscarvalho Aug 12, 2024

Makes total sense; added an example of the strategy DataFrame, including an overview of the schema.

BenWilson2 reviewed

View reviewed changes

website/blog/2024-07-26-pyfunc-in-practice/index.md Outdated


		Having explored each method in detail, the next step is to integrate them to observe the complete implementation in action. This will offer a comprehensive view of how the components interact to achieve the objectives of the project.

		```python

Member

BenWilson2 Aug 9, 2024

For this part, we could forego the full example script and instead do a "build your own" with placeholders for each of the methods defined above. Users can then assemble the class in their own environment and run it to see the results.

Contributor Author

hugodscarvalho Aug 12, 2024

That’s a really interesting idea! Since we’ve added the class skeleton, it’s now easier for users to integrate the methods and customize them. I’ve removed the full example script and instead included a note guiding users to assemble the EnsembleModel class using the provided skeleton and add their own logic as needed (they can even expand the functionalities).

BenWilson2 reviewed

View reviewed changes

website/blog/2024-07-26-pyfunc-in-practice/index.md

+              joblib.dump(ensemble_model.preprocessor, "models/preprocessor.pkl")
+              # Define strategies for the ensemble model
+              strategy_data = {

Member

BenWilson2 Aug 9, 2024

I left a comment above about having a sample of this. It might be helpful just to show a single row strategy above that explains the required schema to aid the mental model while reading through things above.

Contributor Author

hugodscarvalho Aug 12, 2024

As mentioned above, added an example of the strategy DataFrame, including an overview of the schema. Thank you.

BenWilson2 reviewed

View reviewed changes

website/blog/2024-07-26-pyfunc-in-practice/index.md

+              # Add strategies to the database
+              ensemble_model.add_strategy_and_save_to_db(strategies_df, "models/strategies.db")
+              # Define the Conda environment configuration for the MLflow model

Member

BenWilson2 Aug 9, 2024

Is this being done purely for the requirements? Does defining the requirements argument in the log_model call not sufficient?

Contributor Author

hugodscarvalho Aug 12, 2024

You're totally right; defining the Conda environment as a dictionary directly in the log_model call is more efficient, there's no need to use the file option. Updated.

Member

BenWilson2 commented Aug 9, 2024

Added some relatively minor feedback. This looks great, is very well written, and teaches a lot of complex topics in a very understandable way. Excellent work here! Let me know when the responses have been submitted and we'll get this published! :)

hugodscarvalho added 8 commits

August 12, 2024 16:28


          Add class skeleton for EnsembleModel to provide context for methods

0d32175


          Update column removal to use inplace modification

525d382


          Refactor conda environment configuration to use dictionary instead of…

8b15c14

… file


          Improve error handling and use context manager in add_strategy_and_sa…

…ve_to_db


          Add relevant imports to each code block for clarity

acf836b


          Add note on hyperparameter tuning complexity for ensemble models

90c3500


          Add example and schema for add_strategy_and_save_to_db method


          Update Bringing It All Together section as a "build your own"

0c72717

Contributor Author

hugodscarvalho commented Aug 12, 2024

Thank you for the great feedback @BenWilson2! Really happy to contribute and hope the blog post will be a valuable resource for users. I’ve answered all the comments, and if you have any further suggestions, please let me know :)

hugodscarvalho added 4 commits

August 12, 2024 22:32


          Add missing code blocks dependencies

8e9678a


          Add note on Logging to a Tracking Server and mlflow.set_tracking_uri()

7ab6293


          Add consistency on mentioning mlflow.pyfunc keyword

729e08d


          Fix minor issues for consistency and add official documentation refer…

…ences

Contributor Author

hugodscarvalho commented Aug 12, 2024

The latest changes focus on enhancing consistency in terms throughout the codebase. Additionally, I noticed that some imports were missing and have included them. I also added a note and an "empty" line of code (properly documented) about configuring MLflow to point to the tracking server when logging the ensemble model, which was missing.


          rebuild trees

de747ea

Signed-off-by: Ben Wilson <[email protected]>

BenWilson2 approved these changes

View reviewed changes

Member

BenWilson2 left a comment

This is fantastic. Thanks for the contributions! It will go live with our 2.16.0 release (approximately 2.5 weeks from now).

BenWilson2 merged commit 24c751c into mlflow:main

3 checks passed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet