-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pyfunc in Practice: Creative Applications of MLflow Pyfunc in Machine Learning Projects #86
Conversation
Preview for de747ea
|
strategy_df (pd.DataFrame): DataFrame containing strategy information. | ||
db_path (str): Path to the DuckDB database where strategies will be stored. | ||
""" | ||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which part of this is expecting to raise an error? Can we isolate the handling to the execute statements and keep the cursor cleanup in the finally block as it is shown?
(If the pd.concat throws, we'll get an odd Exception to deal with when con.close()
is called on a reference that doesn't exist, even after the Exception is eaten)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the suggestion @BenWilson2, it makes total sense. I made the following changes:
-
Error Handling for Concatenation: Added a dedicated
try-except
block aroundpd.concat()
to handle errors that might occur when concatenating the DataFrames. -
Context Manager for Database Connection: Switched the approach to use a context manager (
with
statement) for the DuckDB connection. This way it automatically handles the connection cleanup, ensuring it's properly closed even if an error occurs during database operations. -
Early Exit on Concatenation Failure: If concatenation fails, we exit early from the method using
return
. This avoids potential problems with database operations if the DataFrame isn't concatenated successfully.
self.models = { | ||
"random_forest": ( | ||
RandomForestRegressor(random_state=42), | ||
{"n_estimators": [50, 100, 200], "max_depth": [None, 10, 20]}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be nice to leave a message to readers (but not show them for complexity's sake) how ensemble models might go through a process of hyperparameter tuning and how complex that can be - perhaps mention that brute force full-scope tuning can be done in isolation in order to reduce the time complexity of searching all models permutations within a single ensemble trainer (i.e., I tune my RF regressor and xgboost models with a wide search space and 2000 iterations, then analyze the best range results from optimal validation metric scores, narrowing my ensemble search space. I do this with all models, then run a bayesian optimizer over search spaces for all models in a constrained space to determine the optima for each discrete model as they are considered parts of a whole).
While grid search is definitely simpler to reason about, it's not optimal, and most users will be wondering "how do I do this with optuna?".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a note to clarify that Grid Search is used just for demonstration purposes and can be computationally costly and time-consuming. Also, I’ve highlighted some advanced optimization techniques that tools like Optuna and Hyperopt use, which can efficiently optimize hyperparameters.
### Adding Strategies and Saving to the Database | ||
|
||
The custom-defined `add_strategy_and_save_to_db` method enables the addition of new ensemble strategies to the model and their storage in a DuckDB database. This method accepts a pandas DataFrame containing the strategies and the database path as inputs. It appends the new strategies to the existing ones and saves them in the database specified during the initialization of the ensemble model. This method facilitates the management of various ensemble strategies and ensures their persistent storage for future use. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could the import block for all of the runnable code in this blog be added here to help readers execute the code examples contained herein?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was a bit uncertain about including imports for each block since the focus was more on the definition and inner workings of each method, with the Bringing It All Together
section covering the execution part. However, it makes total sense to include the necessary dependencies for each block to help users run them (and since the Bringing It All Together
section is also suffering some changes).
And again, I have some doubts about including all imports in the first method because it would clutter the block with more imports than the method itself. On the other hand, by including specific imports in each section, there is some repetition across blocks, but it should make it easier for readers to execute the examples. Let me know what you think about this approach, and if you have any other suggestions.
|
||
As already highlighted in the previous method, a key feature of MLflow PyFunc models is the ability to define custom methods, providing significant flexibility and customization for various tasks. In the multi-model PyFunc setup, the `fit` method is essential for customizing and optimizing multiple sub-models. It manages the training and fine-tuning of algorithms such as `RandomForestRegressor`, `XGBRegressor`, `DecisionTreeRegressor`, `GradientBoostingRegressor`, and `AdaBoostRegressor`. In addition, by utilizing techniques like GridSearchCV, the `fit` method ensures that each model variant is optimized for maximum performance. | ||
|
||
```python |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a note that this is a method of the subclass of PythonModel that will be used for serialization and usage, and not a general function? Perhaps mention that to follow along with making a usable ensemble model, show earlier that a skeleton will need to be created with comments within the body of the class for where these methods will go? (Just thinking about a "I want to follow along here and build this so I can try it out" approach)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really liked the idea of including a skeleton for clarity. I've added a skeleton of the Python class to illustrate the structure and show where each method will be placed. This should help in understanding how each part fits together before we dive into the details of each method and also that each one is a method of the subclass of PythonModel.
# Extract the strategy and drop it from the input features | ||
print(f"Strategy: {model_input['strategy'].iloc[0]}") | ||
strategy = model_input["strategy"].iloc[0] | ||
model_input = model_input.drop(columns=["strategy"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
model_input = model_input.drop(columns=["strategy"]) | |
model_input.drop(columns=["strategy"], inplace=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for pointing that out, updated it.
self, strategy_df: pd.DataFrame, db_path: str | ||
) -> None: | ||
""" | ||
Adds new strategies from a DataFrame to the DuckDB database. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can an example of the strategy DF be supplied as part of the blog so that readers can run this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes total sense; added an example of the strategy DataFrame, including an overview of the schema.
|
||
Having explored each method in detail, the next step is to integrate them to observe the complete implementation in action. This will offer a comprehensive view of how the components interact to achieve the objectives of the project. | ||
|
||
```python |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this part, we could forego the full example script and instead do a "build your own" with placeholders for each of the methods defined above. Users can then assemble the class in their own environment and run it to see the results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That’s a really interesting idea! Since we’ve added the class skeleton, it’s now easier for users to integrate the methods and customize them. I’ve removed the full example script and instead included a note guiding users to assemble the EnsembleModel
class using the provided skeleton and add their own logic as needed (they can even expand the functionalities).
joblib.dump(ensemble_model.preprocessor, "models/preprocessor.pkl") | ||
|
||
# Define strategies for the ensemble model | ||
strategy_data = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a comment above about having a sample of this. It might be helpful just to show a single row strategy above that explains the required schema to aid the mental model while reading through things above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned above, added an example of the strategy DataFrame, including an overview of the schema. Thank you.
# Add strategies to the database | ||
ensemble_model.add_strategy_and_save_to_db(strategies_df, "models/strategies.db") | ||
|
||
# Define the Conda environment configuration for the MLflow model |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this being done purely for the requirements? Does defining the requirements
argument in the log_model
call not sufficient?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're totally right; defining the Conda environment as a dictionary directly in the log_model
call is more efficient, there's no need to use the file option. Updated.
Added some relatively minor feedback. This looks great, is very well written, and teaches a lot of complex topics in a very understandable way. Excellent work here! Let me know when the responses have been submitted and we'll get this published! :) |
Thank you for the great feedback @BenWilson2! Really happy to contribute and hope the blog post will be a valuable resource for users. I’ve answered all the comments, and if you have any further suggestions, please let me know :) |
The latest changes focus on enhancing consistency in terms throughout the codebase. Additionally, I noticed that some imports were missing and have included them. I also added a note and an "empty" line of code (properly documented) about configuring MLflow to point to the tracking server when logging the ensemble model, which was missing. |
Signed-off-by: Ben Wilson <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fantastic. Thanks for the contributions! It will go live with our 2.16.0 release (approximately 2.5 weeks from now).
Description
This blog post demonstrates the capabilities of MLflow PyFunc and how it can be utilized in a Machine Learning project. The
mlflow.pyfunc
offers creative freedom and flexibility. When utilized correctly, teams can build complex systems encapsulated as a model in MLflow that follows the same model lifecycle as traditional ones. This blog showcases how to create multimodal setups, seamlessly connect to databases, and implement a custom fit method usingmlflow.pyfunc
.Additions
Created a folder for the blog content containing:
index.md
with the content of the blog.Added all authors' information and thumbnails to the correct locations.
Additional Information
Related to: #78