EmojiSpider.py Refactor #15

MyriaCore · 2020-10-12T18:10:48Z

ATM EmojiSpider.py pulls from 2 sources - a unicode website, and emojipedia.org. The unicode website appears to be built primarily for human viewing, and NOT for parsing. It also has codepoints that don't really match up with the actual emoji representations, since the fe0f sequence (which can be used to control for text vs emoji presentation) is often omitted.

I propose a few changes:

using plane jane requests.py, rather than the weird combo of requests.py and scrapy. Scrapy appears best for when writing a spider that might have to branch out and explore links in a tree-like pattern. Our use case is more linear, since we'd be parsing a single page, and at MOST, cross referencing info
pull from emojipedia EXCLUSIVELY. In my personal experience, emojipedia has been a more consistent source to pull from than the unicode.org site we were using previously.
do a SQLITE database refractor to make the data itself more self-evident, including:
- foreign key constraints for data in two tables that use the same key
- split up the tables. Have a table for:
  - Emojis (id and codepoint)
  - Skin tone variant (emoji id's of skin tone variants)
  - search term table (emojis id and individual search term). Potentially many search terms per emoji id.
  - shortcode table (emoji id and shortcode). Potentially many shortcode per emoji id.
  - thumbnail table (emoji id, thumbnail style, path to image).

These changes would make the code base more readable, and it would also enable users to more easily hack the SQLITE DB to add things they want. It might also make it easier to deliver new features.

The text was updated successfully, but these errors were encountered:

MyriaCore · 2020-10-12T20:43:18Z

Here's a mock up of what the new SQL schema might look like under 3:

New SQL Schema

-- Emojis table holds all the actual emoji data
CREATE TABLE Emojis (
  name VARCHAR, 
  codepoint VARCHAR PRIMARY KEY
);

-- Thumbnails table associates emojis with the paths to their thumbails, in the
-- various different styles that are supported. 
CREATE TABLE Thumbnails (
  emoji VARCHAR,
  apple VARCHAR,      -- path to apple-style thumbnail
  twemoji VARCHAR,    -- path to twemoji-style thumbnail
  noto VARCHAR,       -- path to noto-style thumbnail
  blobmoji VARCHAR,   -- path to blobmoji-style thumbnail
  FOREIGN KEY (emoji) -- emoji should be a codepoint from the Emojis table
  REFERENCES Emojis (codepoint)
);

-- SkinToneVariants table associates emojis that are skin-tone variants of each
-- other. 
CREATE TABLE SkinToneVariants (
  'default' VARCHAR,    -- 👌 default skin tone variant
  light VARCHAR,        -- 👌🏻 light skin tone variant
  medium_light VARCHAR, -- 👌🏼 medium-light skin tone variant
  medium VARCHAR,       -- 👌🏽 medium skin tone variant
  medium_dark VARCHAR,  -- 👌🏾 medium-dark skin tone variant
  dark VARCHAR,         -- 👌🏿 dark skin tone variant
  -- each skintone variant should be a codepoint from the Emojis table. 
  FOREIGN KEY ('default', light, medium_light, medium, medium_dark, dark) 
  REFERENCES Emojis (
    codepoint, codepoint, codepoint, 
    codepoint, codepoint, codepoint
  )
);

-- Keywords table holds keywords associated w/ emojis. Can be multiple keywords 
-- per emoji.
CREATE TABLE KeyWords (
  emoji VARCHAR,
  keyword VARCHAR,
  FOREIGN KEY (emoji) -- emoji should be a codepoint from the Emojis table
  REFERENCES Emojis (codepoint)
);

-- Keywords table holds shortcodes associated w/ emojis. Can be multiple 
-- shortcodes per emoji.
CREATE TABLE ShortCodes (
  emoji VARCHAR,
  shortcode VARCHAR,
  FOREIGN KEY (emoji) -- emoji should be a codepoint from the Emojis table
  REFERENCES Emojis (codepoint)
);

-- Create Indices for common lookups
CREATE INDEX idx_emoji_thumbnail ON Thumbnail (emoji);
CREATE INDEX idx_default_skintone ON SkinToneVariants ('default');
CREATE INDEX idx_emoji_keyword ON KeyWords (emoji);
CREATE INDEX idx_emoji_shortcode ON ShortCodes (emoji);

Advantages:

Insertions and lookups are way cleaner, no need to do a whole-database update
- New shortcodes can be added (potentially by the user) with:
```
INSERT INTO ShortCodes (emoji, shortcode)
VALUES ("🎉", "tada")
```
- New keywords can be added (potentially by the user) with:
```
INSERT INTO KeyWords (emoji, keyword)
VALUES ("🎉", "celebration")
```
- To get a table of keywords associated with 🎉:
```
SELECT keyword FROM KeyWords
WHERE emoji = "🎉"
```
The use of foreign keys (e.g. emoji.name in the old schema, Emojis.codepoint in this one) is more self-evident, and is structurally validated by SQLITE
Rely less on parsing CSV values to represent compound data (e.g. name_search, shortcodes, etc.), and instead use individual tables, and SELECT / UNION to get the data together

Caveats:

Tables with foreign keys that reference Emojis are dependent on Emojis. This means:
- Insertions into dependent tables require that the associated codepoint does exist in Emojis table.
- See SQLite foreign key constraint actions section in this page for information about how to manually specify what insertions, deletions, updates, etc. do to foreign keys. For deletions and updates, CASCADE seems like a good choice.

gornostal · 2020-10-13T05:21:12Z

I like the ideas you outlined here 👍

MyriaCore mentioned this issue Oct 12, 2020

Flag icons are broken #14

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EmojiSpider.py Refactor #15

EmojiSpider.py Refactor #15

MyriaCore commented Oct 12, 2020 •

edited

Loading

MyriaCore commented Oct 12, 2020 •

edited

Loading

gornostal commented Oct 13, 2020

EmojiSpider.py Refactor #15

EmojiSpider.py Refactor #15

Comments

MyriaCore commented Oct 12, 2020 • edited Loading

MyriaCore commented Oct 12, 2020 • edited Loading

gornostal commented Oct 13, 2020

MyriaCore commented Oct 12, 2020 •

edited

Loading

MyriaCore commented Oct 12, 2020 •

edited

Loading