perf: do not store redundant feature strings for each candidate #217

lukehsiao · 2019-02-12T20:21:15Z

The size of the feature table is very, very large because we store the strings of each feature for each candidate, even though many of these strings are shared between candidates. This also slows down queries. We should have some additional table which maps to those strings, so that those strings do not all need to be stored.

This will take some consideration, though. As it is right now, it makes it very easy to inspect a particular candidate's features or labels simply though cand.features, and we would lose that if we were to refactor the code in this way.

As a rough analysis, with a dataset of 253,524 candidates, we see the following.

Total number of unique feature keys:

# select count(*) from feature_key;
 count 
-------
 17705

Total number of features:

# select count(*) from (select unnest(keys) from feature) as temp;
  count   
----------
 43317015

The feature table is by far the largest table.

# \d+
                               List of relations
 Schema |         Name          |   Type   |  Owner  |    Size    | Description
--------+-----------------------+----------+---------+------------+-------------
 public | candidate             | table    | user    | 14 MB      |
 public | candidate_id_seq      | sequence | user    | 8192 bytes |
 public | caption               | table    | user    | 8192 bytes |
 public | caption_mention       | table    | user    | 0 bytes    |
 public | ce_v_max              | table    | user    | 48 kB      |
 public | cell                  | table    | user    | 8128 kB    |
 public | cell_mention          | table    | user    | 0 bytes    |
 public | context               | table    | user    | 30 MB      |
 public | context_id_seq        | sequence | user    | 8192 bytes |
 public | document              | table    | user    | 3208 kB    |
 public | document_mention      | table    | user    | 0 bytes    |
 public | feature               | table    | user    | 598 MB     |
 public | feature_key           | table    | user    | 3256 kB    |
 public | figure                | table    | user    | 600 kB     |
 public | figure_mention        | table    | user    | 0 bytes    |
 public | gold_label            | table    | user    | 23 MB      |
 public | gold_label_key        | table    | user    | 16 kB      |
 public | implicit_span_mention | table    | user    | 1272 kB    |
 public | label                 | table    | user    | 40 MB      |
 public | label_key             | table    | user    | 24 kB      |
 public | marginal              | table    | user    | 0 bytes    |
 public | marginal_id_seq       | sequence | user    | 8192 bytes |
 public | mention               | table    | user    | 560 kB     |
 public | mention_id_seq        | sequence | user    | 8192 bytes |
 public | paragraph             | table    | user    | 4936 kB    |
 public | paragraph_mention     | table    | user    | 0 bytes    |
 public | part                  | table    | user    | 216 kB     |
 public | part_ce_v_max         | table    | user    | 272 kB     |
 public | part_polarity         | table    | user    | 2440 kB    |
 public | part_stg_temp_max     | table    | user    | 4216 kB    |
 public | part_stg_temp_min     | table    | user    | 4224 kB    |
 public | polarity              | table    | user    | 88 kB      |
 public | prediction            | table    | user    | 8192 bytes |
 public | prediction_key        | table    | user    | 8192 bytes |
 public | section               | table    | user    | 48 kB      |
 public | section_mention       | table    | user    | 0 bytes    |
 public | sentence              | table    | user    | 87 MB      |
 public | span_mention          | table    | user    | 264 kB     |
 public | stable_label          | table    | user    | 8192 bytes |
 public | stg_temp_max          | table    | user    | 152 kB     |
 public | stg_temp_min          | table    | user    | 152 kB     |
 public | table                 | table    | user    | 112 kB     |
 public | table_mention         | table    | user    | 0 bytes    |
 public | webpage               | table    | user    | 8192 bytes |
(44 rows)

The text was updated successfully, but these errors were encountered:

HiromuHota · 2020-09-09T23:55:54Z

Not only this, but scalability both in computing and in storing is a challenge for Fonduer.

lukehsiao added enhancement New feature or request help wanted Extra attention is required discussion Further information is requested and removed enhancement New feature or request labels Feb 12, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: do not store redundant feature strings for each candidate #217

perf: do not store redundant feature strings for each candidate #217

lukehsiao commented Feb 12, 2019 •

edited

Loading

HiromuHota commented Sep 9, 2020

perf: do not store redundant feature strings for each candidate #217

perf: do not store redundant feature strings for each candidate #217

Comments

lukehsiao commented Feb 12, 2019 • edited Loading

HiromuHota commented Sep 9, 2020

lukehsiao commented Feb 12, 2019 •

edited

Loading