Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pyfunc in Practice: Creative Applications of MLflow Pyfunc in Machine Learning Projects #86

Merged
merged 16 commits into from
Aug 16, 2024

Conversation

hugodscarvalho
Copy link
Contributor

Description

This blog post demonstrates the capabilities of MLflow PyFunc and how it can be utilized in a Machine Learning project. The mlflow.pyfunc offers creative freedom and flexibility. When utilized correctly, teams can build complex systems encapsulated as a model in MLflow that follows the same model lifecycle as traditional ones. This blog showcases how to create multimodal setups, seamlessly connect to databases, and implement a custom fit method using mlflow.pyfunc.

Additions

  1. Created a folder for the blog content containing:

    • index.md with the content of the blog.
    • Relevant illustrations.
  2. Added all authors' information and thumbnails to the correct locations.

Additional Information

Related to: #78

Copy link

github-actions bot commented Jul 26, 2024

Preview for de747ea

  • For faster build, the doc pages are not included in the preview.
  • Redirects are disabled in the preview.
Open in StackBlitz

strategy_df (pd.DataFrame): DataFrame containing strategy information.
db_path (str): Path to the DuckDB database where strategies will be stored.
"""
try:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which part of this is expecting to raise an error? Can we isolate the handling to the execute statements and keep the cursor cleanup in the finally block as it is shown?
(If the pd.concat throws, we'll get an odd Exception to deal with when con.close() is called on a reference that doesn't exist, even after the Exception is eaten)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the suggestion @BenWilson2, it makes total sense. I made the following changes:

  1. Error Handling for Concatenation: Added a dedicated try-except block around pd.concat() to handle errors that might occur when concatenating the DataFrames.

  2. Context Manager for Database Connection: Switched the approach to use a context manager (with statement) for the DuckDB connection. This way it automatically handles the connection cleanup, ensuring it's properly closed even if an error occurs during database operations.

  3. Early Exit on Concatenation Failure: If concatenation fails, we exit early from the method using return. This avoids potential problems with database operations if the DataFrame isn't concatenated successfully.

self.models = {
"random_forest": (
RandomForestRegressor(random_state=42),
{"n_estimators": [50, 100, 200], "max_depth": [None, 10, 20]},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be nice to leave a message to readers (but not show them for complexity's sake) how ensemble models might go through a process of hyperparameter tuning and how complex that can be - perhaps mention that brute force full-scope tuning can be done in isolation in order to reduce the time complexity of searching all models permutations within a single ensemble trainer (i.e., I tune my RF regressor and xgboost models with a wide search space and 2000 iterations, then analyze the best range results from optimal validation metric scores, narrowing my ensemble search space. I do this with all models, then run a bayesian optimizer over search spaces for all models in a constrained space to determine the optima for each discrete model as they are considered parts of a whole).

While grid search is definitely simpler to reason about, it's not optimal, and most users will be wondering "how do I do this with optuna?".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a note to clarify that Grid Search is used just for demonstration purposes and can be computationally costly and time-consuming. Also, I’ve highlighted some advanced optimization techniques that tools like Optuna and Hyperopt use, which can efficiently optimize hyperparameters.

### Adding Strategies and Saving to the Database

The custom-defined `add_strategy_and_save_to_db` method enables the addition of new ensemble strategies to the model and their storage in a DuckDB database. This method accepts a pandas DataFrame containing the strategies and the database path as inputs. It appends the new strategies to the existing ones and saves them in the database specified during the initialization of the ensemble model. This method facilitates the management of various ensemble strategies and ensures their persistent storage for future use.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could the import block for all of the runnable code in this blog be added here to help readers execute the code examples contained herein?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was a bit uncertain about including imports for each block since the focus was more on the definition and inner workings of each method, with the Bringing It All Together section covering the execution part. However, it makes total sense to include the necessary dependencies for each block to help users run them (and since the Bringing It All Together section is also suffering some changes).
And again, I have some doubts about including all imports in the first method because it would clutter the block with more imports than the method itself. On the other hand, by including specific imports in each section, there is some repetition across blocks, but it should make it easier for readers to execute the examples. Let me know what you think about this approach, and if you have any other suggestions.


As already highlighted in the previous method, a key feature of MLflow PyFunc models is the ability to define custom methods, providing significant flexibility and customization for various tasks. In the multi-model PyFunc setup, the `fit` method is essential for customizing and optimizing multiple sub-models. It manages the training and fine-tuning of algorithms such as `RandomForestRegressor`, `XGBRegressor`, `DecisionTreeRegressor`, `GradientBoostingRegressor`, and `AdaBoostRegressor`. In addition, by utilizing techniques like GridSearchCV, the `fit` method ensures that each model variant is optimized for maximum performance.

```python
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a note that this is a method of the subclass of PythonModel that will be used for serialization and usage, and not a general function? Perhaps mention that to follow along with making a usable ensemble model, show earlier that a skeleton will need to be created with comments within the body of the class for where these methods will go? (Just thinking about a "I want to follow along here and build this so I can try it out" approach)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really liked the idea of including a skeleton for clarity. I've added a skeleton of the Python class to illustrate the structure and show where each method will be placed. This should help in understanding how each part fits together before we dive into the details of each method and also that each one is a method of the subclass of PythonModel.

# Extract the strategy and drop it from the input features
print(f"Strategy: {model_input['strategy'].iloc[0]}")
strategy = model_input["strategy"].iloc[0]
model_input = model_input.drop(columns=["strategy"])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
model_input = model_input.drop(columns=["strategy"])
model_input.drop(columns=["strategy"], inplace=True)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for pointing that out, updated it.

self, strategy_df: pd.DataFrame, db_path: str
) -> None:
"""
Adds new strategies from a DataFrame to the DuckDB database.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can an example of the strategy DF be supplied as part of the blog so that readers can run this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes total sense; added an example of the strategy DataFrame, including an overview of the schema.


Having explored each method in detail, the next step is to integrate them to observe the complete implementation in action. This will offer a comprehensive view of how the components interact to achieve the objectives of the project.

```python
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this part, we could forego the full example script and instead do a "build your own" with placeholders for each of the methods defined above. Users can then assemble the class in their own environment and run it to see the results.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That’s a really interesting idea! Since we’ve added the class skeleton, it’s now easier for users to integrate the methods and customize them. I’ve removed the full example script and instead included a note guiding users to assemble the EnsembleModel class using the provided skeleton and add their own logic as needed (they can even expand the functionalities).

joblib.dump(ensemble_model.preprocessor, "models/preprocessor.pkl")

# Define strategies for the ensemble model
strategy_data = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a comment above about having a sample of this. It might be helpful just to show a single row strategy above that explains the required schema to aid the mental model while reading through things above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned above, added an example of the strategy DataFrame, including an overview of the schema. Thank you.

# Add strategies to the database
ensemble_model.add_strategy_and_save_to_db(strategies_df, "models/strategies.db")

# Define the Conda environment configuration for the MLflow model
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this being done purely for the requirements? Does defining the requirements argument in the log_model call not sufficient?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're totally right; defining the Conda environment as a dictionary directly in the log_model call is more efficient, there's no need to use the file option. Updated.

@BenWilson2
Copy link
Member

Added some relatively minor feedback. This looks great, is very well written, and teaches a lot of complex topics in a very understandable way. Excellent work here! Let me know when the responses have been submitted and we'll get this published! :)

@hugodscarvalho
Copy link
Contributor Author

Thank you for the great feedback @BenWilson2! Really happy to contribute and hope the blog post will be a valuable resource for users. I’ve answered all the comments, and if you have any further suggestions, please let me know :)

@hugodscarvalho
Copy link
Contributor Author

The latest changes focus on enhancing consistency in terms throughout the codebase. Additionally, I noticed that some imports were missing and have included them. I also added a note and an "empty" line of code (properly documented) about configuring MLflow to point to the tracking server when logging the ensemble model, which was missing.

Signed-off-by: Ben Wilson <[email protected]>
Copy link
Member

@BenWilson2 BenWilson2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fantastic. Thanks for the contributions! It will go live with our 2.16.0 release (approximately 2.5 weeks from now).

@BenWilson2 BenWilson2 merged commit 24c751c into mlflow:main Aug 16, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants