Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expand components/ucd/tests/category_tests.rs #43

Closed
behnam opened this issue Jul 19, 2017 · 9 comments
Closed

Expand components/ucd/tests/category_tests.rs #43

behnam opened this issue Jul 19, 2017 · 9 comments
Labels
A: test Testing C: ucd Unicode Character Database help wanted L: easy Level: Easy
Milestone

Comments

@behnam
Copy link
Member

behnam commented Jul 19, 2017

We have a cross-component test in components/ucd/tests/category_tests.rs that checks values of the Bidi_Class property against General_Category property, based on UAX#9's Table 4. Bidirectional Character Types.

Now that we have component/ucd/category, we can expand the test to also cover General_Category values.

@behnam behnam added A: test Testing C: ucd Unicode Character Database help wanted L: easy Level: Easy labels Jul 19, 2017
@behnam behnam changed the title Expand components/utils/tests/category_tests.rs Expand components/ucd/tests/category_tests.rs Jul 19, 2017
@calum
Copy link

calum commented Jul 26, 2017

Hi, I would like to work on this. I'm new to unic but saw this in the this-week-in-rust update.
Working on it now

@calum
Copy link

calum commented Jul 26, 2017

Hi again,

I've been looking into this issue for a couple of hours (after reading up on unicode). I'm not sure what needs to be done here. Do you want a second test which will check that all the General_Category values are correct for cp in iter_all_chars()?

Or did you just want to replace the unic_ucd::normal::is_combining_mark with the General_Category is_mark() implementation?

Thanks for any help, this project is really interesting to me so I would like to keep contributing in the future.

https://github.com/calum/rust-unic/blob/expand_catrgory_tests/unic/ucd/tests/category_tests.rs

use unic_ucd::category::GeneralCategory as GC;

...

#[test]
fn test_bidi_nsm_against_gen_cat() {
    // Every NSM must be a GC=Mark
    for cp in iter_all_chars() {
        if BidiClass::of(cp) == BidiClass::NonspacingMark {
            assert!(GC::is_mark(&GC::of(cp))); // changed from is_combining_mark(cp);
        }
    }

    // Every GC!=Mark must not be an NSM
    for cp in iter_all_chars() {
        if !GC::is_mark(&GC::of(cp)) { // changed from !is_combining_mark(cp);
            assert_ne!(BidiClass::of(cp), BidiClass::NonspacingMark);
        }
    }
}

// This is just to help me learn what's happening
// but I could change this to test all the GC values are correct
#[test]
fn test_gc_val() {
    let mut list: Vec<GC> = Vec::new();
    for cp in iter_all_chars() {
        let category = GC::of(cp);
        if !list.contains(&category) {
            list.push(category);
        }
    }
    // prints all GC values except from 'Unassigned'
    // which is obvious since we loop over all chars
    println!("{:?}", list);
}

@CAD97
Copy link
Collaborator

CAD97 commented Jul 26, 2017

@calum I believe the idea here is that we want a test to make sure that GC::is_mark agrees with normal::is_combining_mark.

So, leaving the test which exists alone, something along the lines of:

for every character
  if character normal::is_category_mark
    assert character GC::is_mark
  else
    assert character not GC::is_mark

But that actually brings up a point about allowing normal::is_category_mark to be implemented off of GeneralCategory::is_mark if it's being used. If it's not being used, it's better to have the individual table for the information so that someone importing ucd-normal doesn't also have to bring the entire table for ucd-category into their project. I'll open a new issue on that topic.

@behnam
Copy link
Member Author

behnam commented Jul 26, 2017

Thanks, @calum, for working on this, and welcome! :)

First, normal::is_combining_mark and GC::is_mark() are expected to return the same results. Since we have duplication functions, we can have a consistency test for it, too. The reason we have two separate implementation is that 1) the category component didn't exist when I brought in the normal component, then 2) all normal needs from GC property is the is_mark data, and we haven't answered it yet if the additional size of category binary is negligible comparing to the whole normal component. Let me file a separate issue for it so we can discuss and merge, if possible.

Now, about non-mark GC values: if you look at the Bidirectional Character Types table, you can see that some of the table rows contain GC sets listed in their General Scope column. For example, Bidi_Class = BN has "Default ignorables, non-characters, and control characters, other than those explicitly given other types", meaning that all chars with GC=Co will have BC=BN.

Basically, it's up to you to find all possible such tests from the table, and if some of these tests fail, we can see if it's a bug, or just kind of over-generalization in the language.

We can expand this more when we have the Script property implemented. For example, you can have a test to make sure BC=R for all chars with GC.is_letter() and Script=Hebrew.

Does this help?

@calum
Copy link

calum commented Jul 26, 2017

Thank you both for the in depth help! I think I have everything I need to finish this off now. I'll aim to put in a pull request by Saturday (hopefully earlier).

I really appreciate the help, and the opportunity to contribute.

@calum
Copy link

calum commented Jul 27, 2017

I deleted my last question.

I think I'm working it out now. So an example of one of my tests looks like this:

/// `Bidi_Class=B := General_Category in { Cc (Control), Zp (ParagraphSeparator) }`
///
/// <http://www.unicode.org/reports/tr9/#NSM>
#[test]
fn test_bidi_b_against_gen_cat() {
    // Every B must be a GC::Control or GC::ParagraphSeparator
    for cp in iter_all_chars() {
        if BidiClass::of(cp) == BidiClass::ParagraphSeparator {
            assert!(
                GC::of(cp) == GC::Control ||
                GC::of(cp) == GC::ParagraphSeparator
            );
        }
    }
}

@calum
Copy link

calum commented Jul 27, 2017

Pull request #61 includes the typo which I was fixing in PR #60

@CAD97
Copy link
Collaborator

CAD97 commented Aug 1, 2017

#61 was merged. Should this be closed?

@behnam
Copy link
Member Author

behnam commented Aug 4, 2017

Yep, done here. Thanks, @calum!

@behnam behnam closed this as completed Aug 4, 2017
@behnam behnam added this to the UNIC-0.5 milestone Aug 4, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: test Testing C: ucd Unicode Character Database help wanted L: easy Level: Easy
Projects
None yet
Development

No branches or pull requests

3 participants