Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discrete transform #70

Merged
merged 10 commits into from
Feb 25, 2020
Merged

Discrete transform #70

merged 10 commits into from
Feb 25, 2020

Conversation

LSchueler
Copy link
Member

I added a transform function to the collection, which takes values and optionally thresholds and applies these to an existing field, in order to create a field made up of a number of discrete values.

@LSchueler LSchueler requested a review from MuellerSeb February 19, 2020 14:42
@LSchueler LSchueler added Documentation enhancement New feature or request labels Feb 19, 2020
@LSchueler LSchueler added this to the 1.1.2 milestone Feb 19, 2020
@LSchueler LSchueler self-assigned this Feb 19, 2020
Copy link
Member

@MuellerSeb MuellerSeb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for that addition! I really think this is a good extension to the binary transformation.
I have some issues with the standard handling of the input and the restrictions to the given values:

  1. why is it necessary for the thresholds to be related to the given values?
  2. I think a discrete transformation could also be called with a single number, that says how many values there should be in the end. Like: divide it in 5 values, following the original field.
  3. When giving values and thresholds, I would just check, if len(values) == len(thresholds)+1 and not that the corresponding thresholds are between the current values, because in the end the given values could be chosen independent of the given field.

What do you think?

srf = gs.SRF(model, seed=20170519)
srf.structured([x, y])
# create 5 equidistanly spaced values
discrete_values = np.linspace(np.min(srf.field), np.max(srf.field), 5)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The values are equidistantly, yes. But they don't need to be equally distributed in the end. I think, this could be misleading.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean it's misleading for this specific example? Would " create 5 eq. spaced values for this example" be better?! Or do you prefer an example with non-eq. spaced values?

else:
if thresholds is None:
# just in case, sort the values
values = np.sort(values)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if I want the values to BE unsorted? Why do the given values need to be in relation to the values of the given field?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, at least for the way I implemented the transformation, the values have to be monotonically increasing. I'm also not sure of how to interpret non monotonically increasing values.

values = np.array(values)
thresholds = np.array(thresholds)
for i in range(len(thresholds)):
if not (values[i] <= thresholds[i] < values[i + 1]):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this restriction?

Copy link
Member Author

@LSchueler LSchueler Feb 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How else would you define an array of thresholds, which subdivides an array of values? - It's a bit like the nodes and the edges of a graph.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The binary transformation does the following:

  1. get a divide value to select "lower" and "upper" values
  2. replace the lower and upper values with given values

The divide values is unrelated to the given lower and upper values, which will be set in the field. It is only related to the input-field

if thresholds is None:
# just in case, sort the values
values = np.sort(values)
thresholds = (values[1:] + values[:-1]) / 2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the natural way would be to us thresholds, so that the resulting ratios between the given values are even. Maybe I don't get the aim of this transformation...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was also not exactly sure how to handle the kind of input arguments the best.
But at least in my use case I'm just interested in the number of "value classes". The non-arithmetic mean thresholds are just a little generalisation, which was easy to implement.

If the function would just take an array of thresholds, what values would you assign to all the field values lying in between two thresholds?

@MuellerSeb MuellerSeb modified the milestones: 1.1.2, 1.2 Feb 20, 2020
@LSchueler
Copy link
Member Author

Thank's for the input.

I think I've addressed 1) with my other comments.

At first I also wanted to simply take an integer, but by taking an array of values, it becomes so much more flexible and I simply hope, that functions like np.linscape are known to our users, otherwise I've given the example.

But if the values are not related to the field anymore, then this should not be a field transformation, but a new kind of field generation. If you want to extend the value range of the field, it could simply be multiplied by a constant factor before transforming it.
The binary transformation would also only give a globally constant field, if the divide keyword would be larger than max(field) or less than min(field).

Now, arbitrary values can be assigned to the value classes which are
separated by the thresholds.
@LSchueler
Copy link
Member Author

Wow, yesterday I had a pretty big brain fart...
I finally got your point, removed the check and added another example illustrating the new capability.

Ready for merging?

@MuellerSeb
Copy link
Member

Wow, yesterday I had a pretty big brain fart...
I finally got your point, removed the check and added another example illustrating the new capability.

Ready for merging?

No problem :-)

Only thing I would argue, is the standard calculation of the threshold. I would propose the following way:

from scipy.special import erfinv
n = len(values)
p = np.arange(1, n) / n  # n-1 equal subdivisions of [0, 1]
# use quantile of the normal distribution to get equal portions for each value
thresholds = fld.mean + fld.model.sill * np.sqrt(2) * erfinv(2 * p - 1)

Then we also don't need to sort the values.

@MuellerSeb
Copy link
Member

Or we provide both ways with a switch. Then we could both be happy

@LSchueler
Copy link
Member Author

I get your point in wanting to relate the thresholds to a Gaussian process. But do you think there are application cases for discrete distributions where one would want that?

At least in my use case, I'd only want to have the arithmetic mean of the given values as a threshold.

Do you have an idea of how to concisely describe the difference of the two threshold calculation methods?

@MuellerSeb
Copy link
Member

MuellerSeb commented Feb 21, 2020

I get your point in wanting to relate the thresholds to a Gaussian process. But do you think there are application cases for discrete distributions where one would want that?

For example to get a facies distribution, where each facies has a number and they should be equally distributed and you want them to be next to each other in the order you have given them.

At least in my use case, I'd only want to have the arithmetic mean of the given values as a threshold.

Then the switch could state that. like th_mode="arithmetic" or th_mode="equal".

Do you have an idea of how to concisely describe the difference of the two threshold calculation methods?

One connects the given values to the field and does sth. like a nearest neighbor interpolation (dividing areas by the arithmetic mean of the given values). The other chunks the given field in equal portions and equips each area with a corresponding given value.

EDIT

Or we could use the thresholds keyword itself: thresholds="arithmetic"

@MuellerSeb
Copy link
Member

MuellerSeb commented Feb 21, 2020

If we use the thresholds keyword, only a minimal modification is needed:

if thresholds == "arithmetic":
    # just in case, sort the values
    values = np.sort(values)
    thresholds = (values[1:] + values[:-1]) / 2
elif thresholds == "equal":
    n = len(values)
    p = np.arange(1, n) / n  # n-1 equal subdivisions of [0, 1]
    # use quantile of the normal distribution to get equal portions for each value
    thresholds = fld.mean + fld.model.sill * np.sqrt(2) * erfinv(2 * p - 1)
else:
    if len(values) != len(thresholds) + 1:
        raise ValueError(
            "discrete transformation: len(values) != len(thresholds) + 1"
        )
    values = np.array(values)
    thresholds = np.array(thresholds, dtype=float)

EDIT

I made a mistake in the calculation, the term fld.model.sill * np.sqrt(2) should be np.sqrt(fld.model.sill * 2)

@LSchueler
Copy link
Member Author

Okay, I've added the possibility to add more automatic threshold calculations, but your suggestion is still pretty buggy. I'll have a look at it next week.

@MuellerSeb
Copy link
Member

I just found a bug in the Zinn&Harvey transformation, where I had the same fallacy. \sigma is not the variance but the standard deviation. That's why we have to use the root.

@LSchueler
Copy link
Member Author

Damn, such a minor thing I added to GSTools and so much worries with it :-)
Let's finally merge this thing!

@LSchueler LSchueler merged commit 66ea6a8 into develop Feb 25, 2020
@LSchueler LSchueler deleted the discrete_transform branch February 25, 2020 10:30
@MuellerSeb MuellerSeb mentioned this pull request Mar 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Documentation enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants