-
Notifications
You must be signed in to change notification settings - Fork 77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a function to predict a value from a csv file. #147
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #147 +/- ##
==========================================
- Coverage 87.69% 80.52% -7.17%
==========================================
Files 6 7 +1
Lines 1706 1920 +214
==========================================
+ Hits 1496 1546 +50
- Misses 210 374 +164 ☔ View full report in Codecov by Sentry. |
Hi could you please explain a little bit more your use case here? What are you using these predictions for? |
This need some unittest, otherwise we don't if it's breaking something. |
I have an automation in HA that stores some daily data in a csv file alias: "Heating csv"
id: 157b1d57-73d9-4f39-82c6-13ce0cf4288a
trigger:
- platform: time
at: "23:59:32"
action:
- service: notify.prediction
data:
message: >
{% set dd = states('sensor.degree_day_daily') |float %}
{% set inside = states('sensor.gemiddelde_dagtemperatuur_binnen') |float %}
{% set outside = states('sensor.gemiddelde_dagtemperatuur_buiten') |float %}
{% set hour = states('sensor.branduren_warmtepomp_vandaag') |float | round(2) %}
{% set kwhdd = states('sensor.kwh_per_degree_day_daily') |float %}
{% set hourdd = states('sensor.uur_per_degree_day_daily') |float | round(2) %}
{% set solar_total = states('sensor.opbrengst_kwh') |float %}
{% set solar_total_yesterday = states('sensor.solar_csv_2') |float %}
{% set solar = (states('sensor.opbrengst_kwh') |float - solar_total_yesterday) | round(3) %}
{% set verwarming_total = states('sensor.warmtepomp_kwh') |float %}
{% set verwarming_total_yesterday = states('sensor.verwarming_csv') |float %}
{% set verwarming = (states('sensor.warmtepomp_kwh') |float - verwarming_total_yesterday) | round(3) %}
{% set verbruik_total = states('sensor.verbruik_kwh') |float %}
{% set verbruik_total_yesterday = states('sensor.verbruik_csv') |float %}
{% set verbruik = (states('sensor.verbruik_kwh') |float - verbruik_total_yesterday) | round(3) %}
{% set verbruik_zonder_verwarming = (verbruik - verwarming) | round(3) %}
{% set time = now() %}
{{time}},{{dd}},{{solar}},{{verbruik_zonder_verwarming}},{{hourdd}},{{inside}},{{outside}},{{hour}},{{kwhdd}},{{solar_total}},{{verwarming_total}},{{verwarming}},{{verbruik_total}},{{verbruik}} where I'm trying to get as much data as I can I know the solar for the next day (solcast) and I can calculate the degree days for the next day (based on temperature predictions) |
Ok I understand better now. You have two regressor to train. A first regression to output your degree days using an available forecast of your local temperature, then a second regressor using this degree days along with your available solar forecast from solcast to output the needed number of hours for your heating the next day. Is that it? Please consider adding a unittest in the Like I said I would have make this more generic. Your use case is with CSV files (which I personally like), but may not be the case for most people. I think that this should support other types of data input. Like directly specifying the name of sensors in Home Assistant and then retrieven the data directly like we do for the energy optimization. |
src/emhass/csv_predictor.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may have overlooked but there is not fit method in this class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I see it is with the predict
method -> Create a separate fit method
src/emhass/csv_predictor.py
Outdated
""" | ||
X = data[self.independent_variables].values | ||
y = data[self.dependent_variable].values | ||
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could put a cross validation here. You are not cross validating your model, so maybe prone to overfit on that training set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you point me to an example for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should use cross validation and model selection. It is very well explained here: https://scikit-learn.org/stable/modules/cross_validation.html
Concretely use some hyper-parameter tuning method, as GridSearchCV
.
The best is to use a pipe-line.
Example code:
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
# Create a pipeline with a standard scaler and a linear regression model
pipe = Pipeline([
('scaler', StandardScaler()),
('regressor', LinearRegression())
])
# Define the parameters to tune
param_grid = {
'regressor__alpha': [0.1, 0.5, 1],
'regressor__fit_intercept': [True, False]
}
# Create a grid search object
grid_search = GridSearchCV(pipe, param_grid, cv=5, scoring='neg_mean_squared_error')
# Fit the grid search object to the data
grid_search.fit(X, y)
# Print the best parameters and the corresponding score
print('Best parameters:', grid_search.best_params_)
print('Best score:', grid_search.best_score_)
The grid_search object also contains the best model: best_model = grid_search.best_estimator_
src/emhass/csv_predictor.py
Outdated
# Fit and time it | ||
self.logger.info("Predict through a "+self.sklearn_model+" model") | ||
start_time = time.time() | ||
self.forecaster.fit(X, y) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Creat a separate method for this, a fit
method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean I have to create a def
for the fit method?
Does that def
also needs a command_line def
also?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean I have to create a
def
for the fit method?
Does thatdef
also needs a command_linedef
also?
Yes and yes. See example code above for a complete fit method
I will work on this hopefully next week again... |
Of course, keep this up, it is a very nice new feature. |
I do not have any experience whit unittest, I will try to find this out |
I'm also having an issue with
How can i solve this? |
Follow the same procedure as this: https://github.com/davidusb-geek/emhass/blob/master/tests/test_machine_learning_forecaster.py |
What's your dev environment?
|
I re-open in devcontainer (vscode) And then |
I you may need to rebase your branch based on the latest code from master |
I have seen it. |
I'm not using codespaces.
Hope it helps EDIT: I've actually just tested this same procedure inside the codespaces and it works perfectly |
This comment was marked as outdated.
This comment was marked as outdated.
If you like to try out another alternative, have a look at this: #182 |
@davidusb-geek |
Good job on keeping your pull request up to date. |
This are the new rest commands
If you have a column in your csv file that contains a timestamp, you can pass that column name. |
@davidusb-geek |
Hi yes of course sorry, here are some comments. It is a nice feature. Here are some comments. This class should work for multiple types of input data, not only CSV files. The main workflow should be retrieving the data directly from HA using the same methods as emhass does for the optimization. (we can make this after merging your code with some later refactoring) Then the name of the class can be changed to something like Then there are the models, I only see linear regression, but now that you have put together a pipeline you can go ahead and add a list of different ML models with their parameters and try to find the best. You can add lasso, random forest, etc. The docstring of the main class is confusing on the example for the dependent variable, hours? Also it is typical in data science to name the dependent variable as the target and the independent variables as features. We may make use of the more efficient bayesian optimization already available within emhass to optimize the hyperparameters. But can see this later, gridSearchCV is a very good start. |
Here is a code snippet from chat gpt for multiple models, so needs testing ;-) . Store the results and pick the best model with lowest error:
|
Hi @gieljnssns. |
Hi, I merged #247 yesterday and unit tests are passing correctly, so everything looks good to me. |
How can I fix the CodeQL error? |
Yes these have been hanging for some time now. We need to fix them. They come from using the |
@davidusb-geek |
What is ruff? I propose a solution otherwise, change those
Needs testing. |
https://github.com/astral-sh/ruff This is also the default formatter in Home Assistant |
Hey @gieljnssns your PR file change seems to be a little weird (showing file changes to all master merged commits) I experienced this myself yesterday. (just tested the result with #259) Could you see if this changes anything:
Then a git push Feel free to do this on a test bed first to make sure you don't mess anything up. |
I was about to comment on this myself when I saw the number of files changed = 49! |
Got it, yes open to anything that will make this better. |
I think this already happened. |
|
|
Sorry that's because it's GitHub ssh and not https
To
|
It might be a new GitHub glitch. 🤷♂️ |
|
I think the best I can do is closing this PR and do it again? |
Having a remote that's the origin repository is a good way (my uneducated option ) to fetch and pull in the latest commits to merge or make a new branch. Feel free to use something like this in the future 😁. Vs code has GUI ways to do this I believe. |
The story behind this pull request.
I keep a CSV file in which I store data from which I want to predict the number of heating hours.
I first tried to do this via a custom_component for home-assistant, but apparently it is not possible to install
scikit-learn
.Since the result of my prediction is to be used in emhass and the necessary dependencies are already installed in emhass, I decided to go this way.
This pull request contains a new method
csv-predict
with new parameters, here is an example of a rest command in home-assistantIf you are open to accepting this pull request, I will also take the time to write some documentation. And if necessary, I would also like to try writing some tests.
Here is also a used CSV file.
prediction.csv