Finally compare dataset characteristics against each other #28

shankari · 2021-07-23T13:48:44Z

Implement and evaluate a simple trip clustering algorithm (only start and end) based on DBSCAN clustering of the start and end locations
Load data from multiple datasets and save into JSON to make it easier for others to work with it
Load data from JSON and compare dataset characteristics against each other

- Since we concatenate multiple dataframes, the indices are all over the map. Let's reset the index before saving so we have a unified view. - We have JSON objects in our dataframe (`start_loc`, `end_loc`) so serializing as a csv doesn't really work. We read the objects back as strings, and the JSON library can't parse them because it is not using the correct quotes. - We save it directly as a JSON, using the default BSON handler to correctly serialize the object.

We are already using DBSCAN for the start and end location clustering, we can pretty easily find trips by matching on the start and end locations. Here, we explore two alternatives for the trip matching - add the distance matrices and recluster, or group by the (start, end) cluster label pairs and group. We find that the second method is correct. We also spot check both methods and find that the location clustering can also have some minor issues sometimes, but 2/3 spot checks worked well.

We are able to compare multiple datasets against each other by loading the labeled trips, clustering them and comparing the number of trips and how they are clustered against each other. The NREL dataset has the most compact clusters, the mini-pilot has the least, and the staging dataset has the most variability that spans both highly compat and very uncompact clusters.

@corinne-hcr

Primarily to understand why it is worse than the DBSCAN trip clustering code. As you may recall, the results for DBSCAN were pretty respectable. Most trips were in a cluster, and the median cluster : trip-in-cluster ratio was below 40% for all datasets. e-mission#28 (comment) Why is similarity so much worse than the DBSCAN-based clustering? Should we switch to DBSCAN instead (horrifying proposition given the tight deadline)? See the notebook conclusion to find out! @corinne-hcr, this is more on the lines of what I expected you to do back when you were evaluating the first round/common trips, and definitely when we were getting poor results with the clustering. + change the original DBSCAN notebook to have the modified viz code

shankari · 2021-07-23T23:19:32Z

@corinne-hcr @GabrielKS

Initial results with DBSCAN based trip clustering (c7e8205)

Most labeled trips fit into at least one cluster

The cluster:trip-in-cluster ratio (a rough estimate of the tightness of the cluster) is < 50% for most users.

So we can guess that most users should be prompted ~ < 50% of the time

Median ratio is consistently below 40%

Definitely, the minipilot data was harder to work with, but its median is also below 40%. As expected NREL is best, but staging is also respectable. I don't see any reason why we should have been able to build only models for only three users on staging.

shankari · 2021-07-23T23:27:05Z

Comparison between DBSCAN and similarity for one user is complete:
18144ef

Next step is to incorporate this back into the generalization across datasets and see if the results are generally relevant.

The database code has currently been optimized for a 1:1 mapping between the server and the database, where the server needs to make multiple intensive queries to the database, for example, to read trajectory information from the database for multiple users at a time, for the post-processing. This has led to design decisions in which we cache the database connection and re-use it, to avoid overloading the server with multiple queries. We are now getting some requests for federated data. A concrete example is e-mission/e-mission-eval-private-data@952c476 where we wanted to compare the characteristics of multiple datasets (e-mission/e-mission-eval-private-data#28) An upcoming example would be to "roll up" multiple dashboard deployments (e.g. from individual cities) into a program level dashboard. We anticipate that these will be low, volume, intermittent accesses to generate analyses and metrics. The long-term fix is probably to create a FederatedTimeseries (similar to the AggregateTimeseries that merges data across users). But for now, we just implement a hack to reset the connection and reconnect it to a different URL. This means that we cannot access all databases in parallel, we will need to access them serially. But for the current use case, that is sufficient since we can concatenate all the data and work with it later.

shankari · 2021-07-24T14:29:18Z

@corinne-hcr with corinne-hcr/e-mission-server@7a75990, corinne-hcr/e-mission-server@58a14a8 and
corinne-hcr/e-mission-server@d81d25b

all the notebooks in this repo are runnable.

We need to make sure to call `filter_trips` to reproduce the original results now that corinne-hcr/e-mission-server@b46a370 is committed

shankari · 2021-07-25T05:16:50Z

@corinne-hcr Tradeoffs for various combinations of similarity parameters and radii for the mini-pilot are done:
abf4f78

Scatter plot of tradeoffs

Box plot of tradeoffs

IMHO, this shows the differences more clearly than the scatter plot.

Top: request_pct, Bottom: homogeneity score
L-R: 100m, 300m, 500m

I'm going to integrate this into the dataset comparison before poking around with this some more.

Other analyses I would like to do are:

Look at the request % for unlabeled trips. We can't get the homogeneity score since the unlabeled trips are unlabeled, but if I can figure out how to call the prediction code, we should be able to estimate the confidence of the prediction. I will almost certainly do this.
Look at the stats for clusters that have two trips. I will almost certainly not do this right now.
Figure out how to represent the regularity of patterns. I was going to do something with distances between points, maybe of the same purpose, and the number of unique tuples. I still need to figure this out, so I will almost certainly not do this now.

@corinne-hcr

@corinne-hcr more examples of what the initial analysis might have looked like. Build similarity models for: - 100, 300m, 500m - all combinations of filtering (yes/no) and cutoffs (yes/no) Generate labels for all labeled trips Determine ground truth by looking at: unique tuples and unique values for each of the user inputs Use these models to compute the metrics (homogeneity score and request %) for all combinations, along with a few other metrics like the number of unique tuples, cluster_trip_pct, etc. At this point, we are focusing on ground truth from tuples since the homogeneity score is already fairly high. What we really need to do is to bring down the request %, or determine *why* the user % is so high so that we can fix it (e.g. polygon). Some results in: e-mission#28 (comment)

So we can easily change the result creation interactively without rebuilding the models, which takes a lot of time.

shankari · 2021-07-25T07:34:07Z

Generalization results:

Scatter plot (left: DBSCAN, right: our sim)

Box plots for the componets (left: DBSCAN, right: our sim), top: request %, bottom: homogeneity score

However, the cluster_trip_pct that I had defined earlier still shows a significant difference w. Need to understand it better and figure out why it is different and why it doesn't capture the true metric.

I will briefly attempt to figure that out tomorrow morning, but based on these results, we can stick to a single level of "clustering", use a 500m radius, and don't filter or delete bins. I will attempt to make the changes to the server code tomorrow, but if @corinne-hcr has finished her work, maybe that is the next task for her to tackle.

shankari · 2021-07-25T16:20:05Z

Some more multi-dataset results, including an exploration on the number of trips required to be meaningful, and an explanation of the cluster trip ratio v/s request pct discrepancy.

Results from: 31733c1

Visualize various counts: top to bottom, minipilot, nrel-lh, stage

h-score v/s request pct scatter plot (l: DBSCAN, r: oursim, same as yesterday)

log plot of number of trips v/s request pct

Box plots that illustrate the result

@corinne-hcr

Major changes were to integrate the functions from the "unrolled" exploration code into this notebook so that we could generalize the "oursim" results across multiple datasets as well. It also appeared that we were building all the models **twice** for reasons that I don't understand; unified it to build the models once. Also pulled out the summary statistics code into a separate cell, so we can experiment with it without having to re-run the models, which take significant time. Added several visualizations to visualize the findings, which are: - oursim and DBSCAN have similar request pct - oursim has higher homogeniety score than DBSCAN - we have lower noise and tighter models if we have 100+ labeled trips Visualizations at: e-mission#28 (comment) @corinne-hcr, at this point, I'm going to move on to actually implementing this in the pipeline so we can get some results before the next meeting on Tuesday.

shankari · 2021-07-25T17:03:50Z

I was surprised that the homogeneity score of DBSCAN was so low, and then I realized that I was computing it incorrectly.

Basically, I was just passing in the labels from DBSCAN as the predicted labels, but all the noisy trips have the same label (-1), instead separate labels, one for each noisy trip. This is likely the reason why the scores are lower.

For example, consider the case in which we have two clusters of length 2 each, and 4 single trip clusters.

If all the single trip clusters are labeled with -1 for noise, we will end up with

>>> sm.homogeneity_score([1,1,2,2,3,4,5,6], [0,0,1,1,-1,-1,-1,-1])
0.5999999999999999

because it looks like the -1 predicted cluster actually munges entries from 4 different ground truth clusters.
If we replace them with unique cluster labels, we get a perfect score, as expected.

>>> sm.homogeneity_score([1,1,2,2,3,4,5,6], [0,0,1,1,2,3,4,5])
1.0

I am almost certainly not going to use DBSCAN in the integrated pipeline, and that is my current priority, so I do not plan to fix this now. But if @corinne-hcr wants to write a paper, maybe she can fix it here?

shankari · 2021-07-26T14:32:51Z

@corinne-hcr to understand this PR, and to compare it with your prior work, I would use the following order:

deep drive into the similarity code for one user: 18144ef
generalize to all users in on dataset (minipilot), this should be similar to your original exploration of the similarity code: abf4f78
generalize to all users in all datasets: 31733c1

You will not be able to absorb the results just by looking at the code. You need to check out this branch and actually run the notebooks in the commits above. The notebooks have inline explanations (they are "notebooks" after all) of what I'm trying to understand and what the graphs mean. Most of them are "unrolled" so that I first try one option and then another, so that you can see the evolution of the analysis.

For the last, multi-dataset notebook, you will need the combined dataset JSON file. I have shared that with you through OneDrive.

Please let me know if you have any high-level questions.

corinne-hcr · 2021-07-26T15:15:52Z

For the box plots which titled num labeled trip, what do the y-axes represent? Are they request pct?

shankari · 2021-07-26T15:38:29Z

you should be able to see this from the code - e.g. https://github.com/e-mission/e-mission-eval-private-data/pull/28/files#diff-5b27f01eda7481b2844df59e55fffa030ca22df0c58f9796b8d1bb7ad13b1089R679

Again, You need to check out this branch and actually run the notebooks in the commits above. The notebooks have inline explanations (they are "notebooks" after all) of what I'm trying to understand and what the graphs mean. There are additional plots in the notebooks.

The plots are not designed to be published without modification - they were primarily designed for me to understand what was going on so I could figure out how to modify the evaluation pipeline. If you choose to use any of them, you will need to ensure that all the labels are in place and the font sizes are appropriate.

shankari · 2021-07-26T16:56:53Z

The nrel-lh dataset is from 4 NREL employees who voluntarily collected and labeled their data.
The staging dataset is from ~ 30 program staff for the CEO full pilot who helped with testing the platform before deployment.
We have/should have demographic information for all of those datasets as well from the onboarding survey.

shankari · 2021-07-27T00:07:41Z

@corinne-hcr from looking at your code (get_first_label and score), it looks like you calculate the homogeneity score only of the trips that are in the bins. So if I have 10 trips with the following bins/clusters [[0,1,2,3],[4,5,6],[7],[8],[9]] before cutoff, and the following bins/clusters [[0,1,2,3],[4,5,6]] after cutoff, I believe the labels_pred will be [[0,0,0,0], [1,1,1]]; is that correct?

But you compute the request percentage taking the full, non-cutoff list into account; you have a request for each of [7],[8],[9].

Can you discuss the reason for that mismatch further? We should standardize on a really clear definition of the metric calculations because otherwise we don't know what the metrics mean!

corinne-hcr · 2021-07-27T00:22:06Z

I actually raised that question at the first term. But at this point, I don't remember your explanation clearly. Let's see if there is some records.

shankari · 2021-07-27T00:46:08Z

So I actually implemented both alternate metrics (below cutoff as single trip clusters, and drop trips below cutoff) e457545

The first result is pretty much identical to no_cutoff, the second is pretty much identical to the old metric

Old implementation

Treat below cutoff as single label clusters

Drop below cutoff

Our current implementation of the h-score was incorrect because all the noisy trips were in the same `-1` bucket. e-mission#28 (comment) Experiment with two alternate handling methods, need to figure out which one to use e-mission#28 (comment)

shankari · 2021-07-27T00:52:39Z

Also please see my discussion around h-score and request pct in the notebook (e457545). Maybe we should compute the request_pct only on a split and not on the whole dataset.

shankari · 2021-07-27T07:52:55Z

@corinne-hcr @GabrielKS, before making the changes to the evaluation pipeline, I did some super hacky analysis to compare the old models with the old settings, and my new, preferred settings. I had to copy over the tour_model directory into tour_model_first_only and other cringeworthy changes so that I could run the two modules side by side.

Here are the results:

The old model used very few trips.

Due to a combination of the trip filtering, user validity checking, training on a split, and filtering the trips below the cutoff, the old model got few/no trips to train the model. This is particularly true for the later entries which are from the newer datasets. Sorry not sorry, I didn't have time to color code by dataset. This is a big deal for the location based clustering algorithms, which essentially take a nearest neighbor approach. If a particular bin is not included in the model, it will never be matched.

The old model predicted very few trips, the new model is better.

Note that with the old model, because we have so few trips in the model, we are not even infer labels for already labeled trips. In contrast, with the new model, at least all existing trips are labeled.

There are still many users for whom the new model does not predict a lot of trips, but it is much much better.

The only users with < 20% trips prediction rate have very few labeled trips.

	program	valid_trip_count_new	unlabeled_predict_pct_old	unlabeled_predict_pct_new
11	minipilot	0	nan	0
18	nrel-lh	6	nan	0
20	nrel-lh	0	nan	0
21	nrel-lh	17	nan	0.145038
35	stage	3	nan	0.0597015
36	stage	6	nan	0.176471
38	stage	0	nan	0
39	stage	28	0	0.117647
40	stage	0	nan	0
41	stage	9	nan	0.0814815
43	stage	0	nan	0
44	stage	0	nan	0
47	stage	1	nan	0
58	stage	48	nan	0.177778
59	stage	0	nan	0
62	stage	0	nan	0

I'm now going to move this code from the notebook into the server code.
Will clean up the notebook code and submit on Wednesday.

This reverts commit e457545.

shankari · 2021-08-10T17:41:46Z

From @corinne-hcr

the order of values in label_true and label_pred matter.

>>from sklearn.metrics import precision_score
>>>y_true = ['car', 'bike', 'walk', 'car', 'bike', 'walk']
>>>y_pred = ['car', 'walk', 'bike', 'car', 'car', 'bike']
>>>precision_score(y_true, y_pred, average=None)
array([0.        , 0.66666667, 0.        ])

>>from sklearn.metrics import precision_score
>>>y_true = [0, 1, 2, 0, 1, 2]
>>>y_pred = [0, 2, 1, 0, 0, 1]
>>>precision_score(y_true, y_pred, average=None)
array([0.66666667, 0.        , 0.        ])

>>from sklearn.metrics import precision_score
>>>y_true = [0, 1, 2, 0, 1, 2]
>>>y_pred = [1, 3, 2 ,1, 1, 2]
>>>precision_score(y_true, y_pred, average=None)
array([0.        , 0.33333333, 1.        , 0.        ])

So, if we are using precision/recall score, we can only use the true tuple, not the numeric labels that assigned by the indices(we would probably have different numeric labels for the same label tuple)

From @shankari

I explicitly said that we should get the labels for each trip as the predicted labels:
I am not sure what you mean by "numeric labels"
if you are talking about cluster labels, I also explicitly said that, for the prediction metrics, we don't care about clusters since that is an implementation detail of the model

shankari · 2021-08-10T17:43:55Z

From @corinne-hcr

Some more questions.

"Note that trips without predictions will not match" Do you mean trips that cannot find a matching cluster will not have user labels? In that case, how to do with labels_pred?
"Presumably we would take the highest probability option as the true label." If we need to assign labels to be labels_true, why do we need to split the data? Are you saying we should take the highest probability labels as the labels_pred? I originally didn't focus much on the prediction part since I just return possible user labels and their p, but not really predict a specific user labels combination for the trip.
You mention 2 metrics for Evaluating prediction quality, I had a same idea as the 2nd one.
Are we using both of them or just one of them?

From @shankari

in that case, labels_pred will be {}. So it won't match labels_true
that was awkwardly phrased. "Presumably we would take the highest probability option as the final predicted label"
I put this in the writeup "predicted_trip_pct with unlabeled trips: n_predicted / n_trips This is not strictly required, since the standard accuracy metrics should also cover trips without predictions (they will not match). However, this is a way to generate a useful metric on a much larger and arguably more relevant dataset since we don't have to split off a separate test set." What part of this is not clear?

shankari · 2021-08-10T17:45:55Z

From @corinne-hcr

>>from sklearn.metrics import precision_score
>>>y_true = ['car', 'bike', 'walk', 'car', 'bike', 'walk']
>>>y_pred = ['{}', 'walk', 'bike', 'car', 'car', 'bike']
>>>precision_score(y_true, y_pred, average=None)
array([0. , 0.5, 0. , 0. ])

If the labels_pred is {}, the score is not correct. It is not giving result as 1/3

From @shankari

presumably in the real world, y_true would also be a dictionary.
you could also try using None instead of {}.
also, why should the result be 1/3?

From @corinne-hcr

tp =1
tp/(tp+fp) isn't it 1/3? 
oh it looks like 1/2
Oh, that seems the result is correct

corinne-hcr · 2021-08-10T19:00:12Z

There are some more questions:

For the evaluating cluster quality part, are we using h-score or not?
The reason I hesitate to use h-score at this point is that our ground truth is not entirely true. Mislabeling, random mode selections can happen in the user labels. Since the actual design is not directly assign the user labels from the model to the new trip, I am not sure if it is meaningful to use h-score now.
For this part, we are using labeled trips to build the model and unlabeled trips to evaluate the clusters. "we do not need to split the data into training and test sets at this point". So, it looks like h-score is not relevant here. So, which part are thinking of using h-score?
Can you define "characterize the noise from the datasets" one more time?
In your notebooks, you already decided we should keep all the noisy trips and treat them as single cluster. So, I don't see the inappropriate point for including them for evaluation. I think having all trips that in the model for evaluation would be good since we indeed using them to predict new trips.

shankari · 2021-08-10T22:12:24Z

Responses:

The issues with mislabeled ground truth are true across the board. While they apply to the h-score, they also apply to the precision and recall metrics. So our options are:

Either we don't include any accuracy metrics, which seems drastic, or we check that the mislabelings are a fairly small proportion of the total number of labels.
For the single user I looked at, the mislabelings were ~ 10% of the labeled trips, which is bad, but doesn't seem to invalidate the whole idea of evaluating the accuracy. If you have a better characterization of invalid trips, we can use it to make a case for ignoring accuracy. But otherwise we should include accuracy metrics, along with a general characterization of the invalid trips using my method.

For which part? The proposal is to use the h-score for the cluster quality. Not sure why we would want to use unlabeled trips to evaluate the clusters. Can you elaborate?
"characterize the noise": we are saying that we have lots of noisy trips. characterizing the noise would involve showing the amount of noise, potentially using a chart
in the final comparison across datasets, I picked the no_filter, no_cutoff option. But there are other notebooks in which I explored those options as well. So I don't agree with the statement "the inappropriate point for including them for evaluation". We have to first evaluate those (as I did in the other notebooks) to make the argument for the final pick.

Also, the notebooks were an aid for me to convince myself about the settings for the final system. That doesn't preclude having a more extensive evaluation, even across all datasets, to put into a report or a peer reviewed publication.

corinne-hcr · 2021-08-10T23:29:18Z

Just summarize the answer, please let me know if I understand it incorrectly:

We can either ignore the mislabeling problem and take all user labels as ground truth or show the proportion of mislabeled trips and explain that the score should be increased up to 10%. (ignoring the mislabeling problem seems easier to deal with)
Since we are not tuning anything on our model, we just use h-score to evaluate the cluster quality on the whole dataset
To explain the reason for using no_filter no_cutoff, first show the comparisons boxplot, then explain that keeping the noise enable us to have more predicted trips.

shankari · 2021-08-10T23:34:10Z

if we discuss mislabeled trips, we could say that the score could be increased by x% (hopefully x is 10, but I don't know exactly what it is until we have generalized the analysis that I did to other users and other datasets)
on the cluster quality on the whole labeled dataset
which "the comparison boxplot"? if you want to talk about prediction quality ("enable us to have more predicted trips"), you need to have a prediction boxplot, and I don't think there is one in the existing notebooks

corinne-hcr · 2021-08-11T04:56:47Z

The comparison boxplot is something like #28 (comment)
But currently the boxplots show the h-score in different situations. I am not sure if we need to use the boxplot for h-score or for request% or for cluster_to_trip_pct.
I think we can use the boxplots that treats trips below cutoff as single trip clusters instead of putting all three situations in the paper. Treating trips below cutoff as single trip cluster meet our current design. Also, discussing different metrics(filter/no filter, cutoff/no cutoff) in one situation is more clear.

shankari · 2021-08-11T05:05:29Z

@corinne-hcr those boxplots are for the h-score, which are used to evaluate cluster quality. My point was that I don't think you can use them to evaluate prediction quality. The graphs to evaluate the prediction quality are #28 (comment) but I don't think I made boxplots for them.

I think we can use the boxplots that treats trips below cutoff as single trip clusters instead of putting all three situations in the paper. Treating trips below cutoff as single trip cluster meet our current design. Also, discussing different metrics(filter/no filter, cutoff/no cutoff) in one situation is more clear.

First, filter/no filter, etc are not different metrics. They are different configurations, and we want to use metrics such as the h-score and the cluster-to-trip-ratio to understand them. Second, can you clarify what you mean by "discussing metrics ... in one situation"? What would be the rough structure of such a discussion? To me, the boxplots (or similar graphs) are the easiest way of comparing the configurations, but I'm open to hearing other concrete suggestions!

corinne-hcr · 2021-08-11T06:55:55Z

Right. The graphs for prediction quality are only those you put in there. The boxplot for prediction is not made yet. I was saying that we had three situations - old implementation, treat below cutoff as single label clusters, drop below cutoff. I think we can just use the one that treats below cutoff as single label clusters. Under this situation, the boxplot shows the h-score or cluster-to-trip-ratio for filter/no filter, cutoff/ no cutoff. Then we can say we determine to use no filter and no cutoff in order to keep more trips.

I just check the way you compute the h-score, I don't think you have na in labels_true for h-score tuple, but you use dropna before calculating the score. And we need to change the cluster labels for trips with -1

Could you check that again in case I made some mistake?

shankari · 2021-08-11T18:08:07Z

I don't think you have na in labels_true for h-score tuple, but you use dropna before calculating the score

Using dropna doesn't hurt anything if we don't have any N/A. I added that because if we build separate models for the individual labels, we can have N/A. I think I poked at it a little bit in that notebook as well
https://github.com/e-mission/e-mission-eval-private-data/pull/28/files#diff-b53e99b317a902a7e95b56e64c541a6a68dafc348bef0ee5111d325dc0617bf1R677

https://github.com/e-mission/e-mission-eval-private-data/pull/28/files#diff-b53e99b317a902a7e95b56e64c541a6a68dafc348bef0ee5111d325dc0617bf1R689

but gave up exploring in detail due to lack of time.

I think we can just use the one that treats below cutoff as single label clusters.

I am not convinced by this because in that case, as I said in the notebook, there is effectively no difference in this metric between a similarity instance that drops trips below cutoff and one that does not. The metric is not meaningful to show anything important about the cluster quality.

shankari · 2021-08-13T17:15:36Z

Please see the results for re-introducing the same mode #28 (comment) and #28 (comment)

The related commit is corinne-hcr/e-mission-server@97921a9

shankari · 2022-06-07T16:26:09Z

Last few commits before we close out this PR:

Compare user mode mapping effect with outputs.ipynb: plots what happens if we change replaced mode = "same mode" to replaced mode = <actual mode>. The effect is not very high, and we have removed the "same mode" option now, so I am not sure how useful this is over the long term
Explore sim usage (common trips -> labeling) unrolled-outputs.ipynb: The existing "Explore sim usage unrolled" notebook with embedded outputs; to give people a sense of what the expected outputs are. Not sure why I have the outputs only for this notebook; can remove if we can reproduce.
Exploring basic datasets for model building validity*.ipynb: comparing the effect of the old (emission/analysis/modelling/tour_model) and new models (emission/analysis/modelling/tour_model_first_only). We have not used the old models in production for a while, but keeping these around for demonstrating the effect, and as a template for evaluating the next round of model improvements.

@hlu109

So that @hlu109 can take over e-mission#28 (comment)

…private-data into tune_clustering_params

And add a README directing people to the README for my evaluation

shankari · 2022-06-07T16:49:57Z

@hlu109 I have committed all the pending changes on my laptop and moved the obsolete analyses out.
I am now merging this.

I would suggest running through my notebooks here to understand the analysis step by step.

@hlu109

So that @hlu109 can take over e-mission#28 (comment)

@corinne-hcr

Primarily to understand why it is worse than the DBSCAN trip clustering code. As you may recall, the results for DBSCAN were pretty respectable. Most trips were in a cluster, and the median cluster : trip-in-cluster ratio was below 40% for all datasets. e-mission#28 (comment) Why is similarity so much worse than the DBSCAN-based clustering? Should we switch to DBSCAN instead (horrifying proposition given the tight deadline)? See the notebook conclusion to find out! @corinne-hcr, this is more on the lines of what I expected you to do back when you were evaluating the first round/common trips, and definitely when we were getting poor results with the clustering. + change the original DBSCAN notebook to have the modified viz code

@corinne-hcr

@corinne-hcr more examples of what the initial analysis might have looked like. Build similarity models for: - 100, 300m, 500m - all combinations of filtering (yes/no) and cutoffs (yes/no) Generate labels for all labeled trips Determine ground truth by looking at: unique tuples and unique values for each of the user inputs Use these models to compute the metrics (homogeneity score and request %) for all combinations, along with a few other metrics like the number of unique tuples, cluster_trip_pct, etc. At this point, we are focusing on ground truth from tuples since the homogeneity score is already fairly high. What we really need to do is to bring down the request %, or determine *why* the user % is so high so that we can fix it (e.g. polygon). Some results in: e-mission#28 (comment)

@corinne-hcr

Major changes were to integrate the functions from the "unrolled" exploration code into this notebook so that we could generalize the "oursim" results across multiple datasets as well. It also appeared that we were building all the models **twice** for reasons that I don't understand; unified it to build the models once. Also pulled out the summary statistics code into a separate cell, so we can experiment with it without having to re-run the models, which take significant time. Added several visualizations to visualize the findings, which are: - oursim and DBSCAN have similar request pct - oursim has higher homogeniety score than DBSCAN - we have lower noise and tighter models if we have 100+ labeled trips Visualizations at: e-mission#28 (comment) @corinne-hcr, at this point, I'm going to move on to actually implementing this in the pipeline so we can get some results before the next meeting on Tuesday.

Our current implementation of the h-score was incorrect because all the noisy trips were in the same `-1` bucket. e-mission#28 (comment) Experiment with two alternate handling methods, need to figure out which one to use e-mission#28 (comment)

…tion h-score calculation: Our current implementation of the h-score was incorrect because all the noisy trips were in the same `-1` bucket. e-mission#28 (comment) Experiment with two alternate handling methods, need to figure out which one to use e-mission#28 (comment) Improvements to h-score and request count calculation are consistent with the changes required for e-mission#28 (comment)

@hlu109

So that @hlu109 can take over e-mission#28 (comment)

@hlu109

So that @hlu109 can take over e-mission#28 (comment)

Finally compare dataset characteristics against each other

shankari added 5 commits July 23, 2021 00:14

Removing the image attachements so we can publish this successfully

414e792

Update the notebook to match the new similarity definition

1190416

We need to make sure to call `filter_trips` to reproduce the original results now that corinne-hcr/e-mission-server@b46a370 is committed

shankari added 2 commits July 24, 2021 22:19

Split the model generation from the result creation

d81a234

So we can easily change the result creation interactively without rebuilding the models, which takes a lot of time.

shankari mentioned this pull request Jul 26, 2021

check unit test code e-mission/e-mission-server#826

Open

shankari mentioned this pull request Jul 27, 2021

Simplified and reconfigured model building and application corinne-hcr/e-mission-server#3

Merged

Revert "Try alternate methods of handling noisy trips in the h-score"

4f86349

This reverts commit e457545.

This was referenced Aug 14, 2021

Trip confidence can be artificially high for single trip clusters e-mission/e-mission-docs#663

Closed

Common trip system building e-mission/e-mission-docs#647

Closed

shankari mentioned this pull request Jun 7, 2022

change relative imports to absolute path e-mission/e-mission-server#857

Closed

shankari added 3 commits June 7, 2022 09:31

Finish checking in all the pending notebooks for the clustering

318ccd9

So that @hlu109 can take over e-mission#28 (comment)

Merge branch 'master' of https://github.com/e-mission/e-mission-eval-…

c1715e3

…private-data into tune_clustering_params

Move all the obsolete analysis out

93590d8

And add a README directing people to the README for my evaluation

shankari merged commit b8368d9 into e-mission:master Jun 7, 2022

shankari added a commit to shankari/e-mission-eval that referenced this pull request Jul 30, 2022

Finish checking in all the pending notebooks for the clustering

13a9831

So that @hlu109 can take over e-mission#28 (comment)

humbleOldSage pushed a commit to humbleOldSage/e-mission-eval-private-data that referenced this pull request Dec 1, 2023

Finish checking in all the pending notebooks for the clustering

350171d

So that @hlu109 can take over e-mission#28 (comment)

humbleOldSage pushed a commit to humbleOldSage/e-mission-eval-private-data that referenced this pull request Dec 1, 2023

Finish checking in all the pending notebooks for the clustering

cf100a7

So that @hlu109 can take over e-mission#28 (comment)

humbleOldSage pushed a commit to humbleOldSage/e-mission-eval-private-data that referenced this pull request Dec 1, 2023

Merge pull request e-mission#28 from shankari/tune_clustering_params

b54330d

Finally compare dataset characteristics against each other

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Finally compare dataset characteristics against each other #28

Finally compare dataset characteristics against each other #28

shankari commented Jul 23, 2021

shankari commented Jul 23, 2021 •

edited

Loading

shankari commented Jul 23, 2021

shankari commented Jul 24, 2021

shankari commented Jul 25, 2021 •

edited

Loading

shankari commented Jul 25, 2021 •

edited

Loading

shankari commented Jul 25, 2021 •

edited

Loading

shankari commented Jul 25, 2021 •

edited

Loading

shankari commented Jul 26, 2021

corinne-hcr commented Jul 26, 2021

shankari commented Jul 26, 2021

shankari commented Jul 26, 2021

shankari commented Jul 27, 2021

corinne-hcr commented Jul 27, 2021

shankari commented Jul 27, 2021 •

edited

Loading

shankari commented Jul 27, 2021

shankari commented Jul 27, 2021

shankari commented Aug 10, 2021 •

edited

Loading

shankari commented Aug 10, 2021 •

edited

Loading

shankari commented Aug 10, 2021 •

edited

Loading

corinne-hcr commented Aug 10, 2021

shankari commented Aug 10, 2021

corinne-hcr commented Aug 10, 2021

shankari commented Aug 10, 2021 •

edited

Loading

corinne-hcr commented Aug 11, 2021

shankari commented Aug 11, 2021 •

edited

Loading

corinne-hcr commented Aug 11, 2021

shankari commented Aug 11, 2021 •

edited

Loading

shankari commented Aug 13, 2021

shankari commented Jun 7, 2022

shankari commented Jun 7, 2022

Finally compare dataset characteristics against each other #28

Finally compare dataset characteristics against each other #28

Conversation

shankari commented Jul 23, 2021

shankari commented Jul 23, 2021 • edited Loading

Most labeled trips fit into at least one cluster

The cluster:trip-in-cluster ratio (a rough estimate of the tightness of the cluster) is < 50% for most users.

Median ratio is consistently below 40%

shankari commented Jul 23, 2021

shankari commented Jul 24, 2021

shankari commented Jul 25, 2021 • edited Loading

Scatter plot of tradeoffs

Box plot of tradeoffs

shankari commented Jul 25, 2021 • edited Loading

Scatter plot (left: DBSCAN, right: our sim)

Box plots for the componets (left: DBSCAN, right: our sim), top: request %, bottom: homogeneity score

shankari commented Jul 25, 2021 • edited Loading

Visualize various counts: top to bottom, minipilot, nrel-lh, stage

h-score v/s request pct scatter plot (l: DBSCAN, r: oursim, same as yesterday)

log plot of number of trips v/s request pct

Box plots that illustrate the result

shankari commented Jul 25, 2021 • edited Loading

shankari commented Jul 26, 2021

corinne-hcr commented Jul 26, 2021

shankari commented Jul 26, 2021

shankari commented Jul 26, 2021

shankari commented Jul 27, 2021

corinne-hcr commented Jul 27, 2021

shankari commented Jul 27, 2021 • edited Loading

Old implementation

Treat below cutoff as single label clusters

Drop below cutoff

shankari commented Jul 27, 2021

shankari commented Jul 27, 2021

The old model used very few trips.

The old model predicted very few trips, the new model is better.

There are still many users for whom the new model does not predict a lot of trips, but it is much much better.

The only users with < 20% trips prediction rate have very few labeled trips.

shankari commented Aug 10, 2021 • edited Loading

shankari commented Aug 10, 2021 • edited Loading

shankari commented Aug 10, 2021 • edited Loading

corinne-hcr commented Aug 10, 2021

shankari commented Aug 10, 2021

corinne-hcr commented Aug 10, 2021

shankari commented Aug 10, 2021 • edited Loading

corinne-hcr commented Aug 11, 2021

shankari commented Aug 11, 2021 • edited Loading

corinne-hcr commented Aug 11, 2021

shankari commented Aug 11, 2021 • edited Loading

shankari commented Aug 13, 2021

shankari commented Jun 7, 2022

shankari commented Jun 7, 2022

shankari commented Jul 23, 2021 •

edited

Loading

shankari commented Jul 25, 2021 •

edited

Loading

shankari commented Jul 25, 2021 •

edited

Loading

shankari commented Jul 25, 2021 •

edited

Loading

shankari commented Jul 25, 2021 •

edited

Loading

shankari commented Jul 27, 2021 •

edited

Loading

shankari commented Aug 10, 2021 •

edited

Loading

shankari commented Aug 10, 2021 •

edited

Loading

shankari commented Aug 10, 2021 •

edited

Loading

shankari commented Aug 10, 2021 •

edited

Loading

shankari commented Aug 11, 2021 •

edited

Loading

shankari commented Aug 11, 2021 •

edited

Loading