Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: handle overlapping slices #1735

Merged
merged 2 commits into from
Apr 1, 2024
Merged

fix: handle overlapping slices #1735

merged 2 commits into from
Apr 1, 2024

Conversation

tim-evans
Copy link
Collaborator

@tim-evans tim-evans commented Mar 27, 2024

This is a bug that I discovered with @colin-alexa during work on datasets & tables. Our current implementation of extractSlices does not properly handle overlapping slices, and will duplicate text since it does not handle cases where slices are nested or overlap.

To illustrate this, we may have a document that has slices inside of slices, which are
something very possible inside of tables:

Album Art Info
Debut (1993)
Listen on Bandcamp

Debut is the debut studio album by Icelandic recording artist Björk as an international singer, released in July 1993 by One Little Independent Records and Elektra Entertainment. It was produced by Björk and Nellee Hooper.

It was Björk's first recording following the dissolution of her previous band, the Sugarcubes. The album departed from the rock style of her previous work and drew from an eclectic variety of styles, including electronic pop, house music, jazz, and trip-hop.

Post (1995)
Listen on Bandcamp

Post is the second studio album by Icelandic recording artist Björk, released in 1995 in the United Kingdom by One Little Independent Records and in the United States by Elektra Entertainment.

Whereas Björk's previous album Debut (1993) was produced almost entirely by Nellee Hooper, Björk produced Post herself with co-producers including Hooper, 808 State's Graham Massey, and former Massive Attack member Tricky.

Continuing the style developed on Debut, Post is considered an important exponent of art-pop. It features an eclectic mixture of electronic and dance styles such as techno, trip-hop, IDM, and house, but also ambient, jazz, industrial, and experimental music.

Björk wrote most of the songs after moving to London and intended Post to convey the city's pace, urban culture, and underground club culture.

Here we have some captions on images, inside of tables.

When we represent this content for rendering, we will want the records in the dataset to have well structured documents, so that the first cell for album art looks like:

{
  "text": "\uFFFC",
  "blocks": [{
    "id": "B00000000",
    "type": "image",
    "parents": [],
    "selfClosing": false,
    "attributes": {
      "src": "https://f4.bcbits.com/img/a2682265154_10.jpg",
      "caption": "M00000001"
    }
  }]
}

And the caption slice would be:

{
  "text": "\uFFFCDebut (1993)\uFFFCListen on Bandcamp",
  "blocks": [{
    "id": "B00000000",
    "type": "text",
    "parents": [],
    "selfClosing": false,
    "attributes": {}
  }, {
    "id": "B00000001",
    "type": "line-break",
    "parents": [],
    "selfClosing": true,
    "attributes": {}
  }],
  "marks": [{
    "id": "M00000001",
    "type": "slice",
    "range": "[1..32]",
    "attributes": {
      "refs": ["B00000000"],
    }
  }]
}

Currently, the data cell for this would have the caption slice included, which will result in duplicate text in rendering:

{
  "text": "\uFFFCDebut (1993)\uFFFCListen on Bandcamp",
  "blocks": [{
    "id": "B00000000",
    "type": "image",
    "parents": [],
    "selfClosing": false,
    "attributes": {
      "src": "https://f4.bcbits.com/img/a2682265154_10.jpg",
      "caption": "M00000001"
    }
  }, {
    "id": "B00000001",
    "type": "line-break",
    "parents": [],
    "selfClosing": true,
    "attributes": {}
  }],
  "marks": [{
    "id": "M00000001",
    "type": "slice",
    "range": "[1..32]",
    "attributes": {
      "refs": ["B00000000"],
    }
  }]
}

@colin-alexa
Copy link
Contributor

The util module is getting pretty long; this code seems pretty self contained so it might be nice to start breaking this file up a bit

let [, slices] = extractSlices(original);
expect(slices.get("M00000000")?.text).toEqual("A");
expect(slices.get("M00000001")?.text).toEqual("BB");
expect(slices.get("M00000002")?.text).toEqual("C");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really nice

Copy link
Contributor

@colin-alexa colin-alexa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is amazing! great work @tim-evans

@tim-evans tim-evans merged commit 4a02732 into main Apr 1, 2024
3 checks passed
@tim-evans tim-evans deleted the slices-in-slices branch April 1, 2024 15:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants