Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EmojiSpider.py Refactor #15

Open
MyriaCore opened this issue Oct 12, 2020 · 2 comments
Open

EmojiSpider.py Refactor #15

MyriaCore opened this issue Oct 12, 2020 · 2 comments

Comments

@MyriaCore
Copy link
Contributor

MyriaCore commented Oct 12, 2020

ATM EmojiSpider.py pulls from 2 sources - a unicode website, and emojipedia.org. The unicode website appears to be built primarily for human viewing, and NOT for parsing. It also has codepoints that don't really match up with the actual emoji representations, since the fe0f sequence (which can be used to control for text vs emoji presentation) is often omitted.

I propose a few changes:

  1. using plane jane requests.py, rather than the weird combo of requests.py and scrapy. Scrapy appears best for when writing a spider that might have to branch out and explore links in a tree-like pattern. Our use case is more linear, since we'd be parsing a single page, and at MOST, cross referencing info
  2. pull from emojipedia EXCLUSIVELY. In my personal experience, emojipedia has been a more consistent source to pull from than the unicode.org site we were using previously.
  3. do a SQLITE database refractor to make the data itself more self-evident, including:
    • foreign key constraints for data in two tables that use the same key
    • split up the tables. Have a table for:
      • Emojis (id and codepoint)
      • Skin tone variant (emoji id's of skin tone variants)
      • search term table (emojis id and individual search term). Potentially many search terms per emoji id.
      • shortcode table (emoji id and shortcode). Potentially many shortcode per emoji id.
      • thumbnail table (emoji id, thumbnail style, path to image).

These changes would make the code base more readable, and it would also enable users to more easily hack the SQLITE DB to add things they want. It might also make it easier to deliver new features.

@MyriaCore
Copy link
Contributor Author

MyriaCore commented Oct 12, 2020

Here's a mock up of what the new SQL schema might look like under 3:

New SQL Schema
-- Emojis table holds all the actual emoji data
CREATE TABLE Emojis (
  name VARCHAR, 
  codepoint VARCHAR PRIMARY KEY
);

-- Thumbnails table associates emojis with the paths to their thumbails, in the
-- various different styles that are supported. 
CREATE TABLE Thumbnails (
  emoji VARCHAR,
  apple VARCHAR,      -- path to apple-style thumbnail
  twemoji VARCHAR,    -- path to twemoji-style thumbnail
  noto VARCHAR,       -- path to noto-style thumbnail
  blobmoji VARCHAR,   -- path to blobmoji-style thumbnail
  FOREIGN KEY (emoji) -- emoji should be a codepoint from the Emojis table
  REFERENCES Emojis (codepoint)
);

-- SkinToneVariants table associates emojis that are skin-tone variants of each
-- other. 
CREATE TABLE SkinToneVariants (
  'default' VARCHAR,    -- 👌 default skin tone variant
  light VARCHAR,        -- 👌🏻 light skin tone variant
  medium_light VARCHAR, -- 👌🏼 medium-light skin tone variant
  medium VARCHAR,       -- 👌🏽 medium skin tone variant
  medium_dark VARCHAR,  -- 👌🏾 medium-dark skin tone variant
  dark VARCHAR,         -- 👌🏿 dark skin tone variant
  -- each skintone variant should be a codepoint from the Emojis table. 
  FOREIGN KEY ('default', light, medium_light, medium, medium_dark, dark) 
  REFERENCES Emojis (
    codepoint, codepoint, codepoint, 
    codepoint, codepoint, codepoint
  )
);

-- Keywords table holds keywords associated w/ emojis. Can be multiple keywords 
-- per emoji.
CREATE TABLE KeyWords (
  emoji VARCHAR,
  keyword VARCHAR,
  FOREIGN KEY (emoji) -- emoji should be a codepoint from the Emojis table
  REFERENCES Emojis (codepoint)
);

-- Keywords table holds shortcodes associated w/ emojis. Can be multiple 
-- shortcodes per emoji.
CREATE TABLE ShortCodes (
  emoji VARCHAR,
  shortcode VARCHAR,
  FOREIGN KEY (emoji) -- emoji should be a codepoint from the Emojis table
  REFERENCES Emojis (codepoint)
);

-- Create Indices for common lookups
CREATE INDEX idx_emoji_thumbnail ON Thumbnail (emoji);
CREATE INDEX idx_default_skintone ON SkinToneVariants ('default');
CREATE INDEX idx_emoji_keyword ON KeyWords (emoji);
CREATE INDEX idx_emoji_shortcode ON ShortCodes (emoji);

Advantages:

  • Insertions and lookups are way cleaner, no need to do a whole-database update
    • New shortcodes can be added (potentially by the user) with:
      INSERT INTO ShortCodes (emoji, shortcode)
      VALUES ("🎉", "tada")
    • New keywords can be added (potentially by the user) with:
      INSERT INTO KeyWords (emoji, keyword)
      VALUES ("🎉", "celebration")
    • To get a table of keywords associated with 🎉:
      SELECT keyword FROM KeyWords
      WHERE emoji = "🎉"
  • The use of foreign keys (e.g. emoji.name in the old schema, Emojis.codepoint in this one) is more self-evident, and is structurally validated by SQLITE
  • Rely less on parsing CSV values to represent compound data (e.g. name_search, shortcodes, etc.), and instead use individual tables, and SELECT / UNION to get the data together

Caveats:

  • Tables with foreign keys that reference Emojis are dependent on Emojis. This means:
    • Insertions into dependent tables require that the associated codepoint does exist in Emojis table.
    • See SQLite foreign key constraint actions section in this page for information about how to manually specify what insertions, deletions, updates, etc. do to foreign keys. For deletions and updates, CASCADE seems like a good choice.

@gornostal
Copy link
Member

I like the ideas you outlined here 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants