Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple hits on a single index #44

Open
whitesided opened this issue Dec 5, 2013 · 4 comments
Open

Multiple hits on a single index #44

whitesided opened this issue Dec 5, 2013 · 4 comments

Comments

@whitesided
Copy link

Maybe this isn't a bug but an intended behavior? I'm not sure why that would be, unless it's a presentation of ambiguity in section identification to let the end user decide?

in 113hr2642eh we get three hits (and the same index) on the same string:

[
  {
    "type": "usc",
    "match": "7 U.S.C. 950aaa-2(d)",
    "index": 811965,
    "usc": {
      "title": "7",
      "section": "950aaa-2",
      "subsections": [
        "d"
      ],
      "id": "usc/7/950aaa-2/d",
      "section_id": "usc/7/950aaa-2"
    }
  },
  {
    "type": "usc",
    "match": "7 U.S.C. 950aaa-2(d)",
    "index": 811965,
    "usc": {
      "title": "7",
      "section": "950aaa",
      "subsections": [],
      "id": "usc/7/950aaa",
      "section_id": "usc/7/950aaa"
    }
  },
  {
    "type": "usc",
    "match": "7 U.S.C. 950aaa-2(d)",
    "index": 811965,
    "usc": {
      "title": "7",
      "section": "2",
      "subsections": [
        "d"
      ],
      "id": "usc/7/2/d",
      "section_id": "usc/7/2"
    }
  }
]

I'm just going to skip subsequent hits on the same index and use the first one I find to cope with this for the moment.

@konklone
Copy link
Member

konklone commented Dec 5, 2013

This doesn't look intended to me - looks like a very nice bug that deserves a very nice test case. Your workaround sounds right to me in the interim. Thanks for filing this, I'll work out a fix to this soon. I'm hoping to spend a bunch of time tomorrow or Friday on many of this project's open tickets.

@konklone
Copy link
Member

konklone commented Dec 7, 2013

Never make a promise on timeline on a Github ticket! I got swamped, mostly with the /licensing repo. I'll get to this, and the other tickets, soon.

@konklone konklone added the Bug label Feb 9, 2014
@konklone konklone added this to the 1.0 milestone Feb 9, 2014
@konklone konklone added the Cites label Feb 9, 2014
tmcw added a commit that referenced this issue May 10, 2014
@konklone
Copy link
Member

So this is actually expected behavior. When it detects cites for which it can't know whether or not it's a single section with a hyphen, or two sections -- it returns all of them, erring on the side of letting the user decide.

Here's the code dealing with establishing ambiguity and parsing of ranges:
https://github.com/unitedstates/citation/blob/master/citations/usc.js#L48-L65

This is because I built it originally to support a search engine, where you'd want to turn up too many results instead of too few. For a markup tool, I can see why you'd want to be stricter about it. But ultimately, the problem is that we can't always be certain whether a hyphen indicates two sections or one.

There are a couple unambiguous situations -- if there's a double section symbol (§§), it assumes a range. If there's a parenthesis before the hyphen, it assumes a range (because the parenthesis denotes a subsection, so it's stopped describing the section-level identifier).

@phearlez, how do you want to handle ambiguous sections? We could add an option that gets passed into the usc citator that instructs the processor whether to be generous or strict with ambiguous ranges. Or, you could make a client-side decision about it. And/or, the library could return an ambiguous: true flag on detected cites where it wasn't certain.

@whitesided
Copy link
Author

For my purposes it's sufficient to have the more aggressive hit listed first; I've coped with this already by simply always using the first hit, basically assuming that the more "greedy" hit will be ordered first. An option to avoid ambiguity would be fine as well.

Our auto-tagging runs on an assumption that there will still be some human eyes on things eventually; it's a helping hand, not a replacement for involvement. So I'm open to either way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants