Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: do not store redundant feature strings for each candidate #217

Open
lukehsiao opened this issue Feb 12, 2019 · 1 comment
Open

perf: do not store redundant feature strings for each candidate #217

lukehsiao opened this issue Feb 12, 2019 · 1 comment
Labels
discussion Further information is requested help wanted Extra attention is required

Comments

@lukehsiao
Copy link
Contributor

lukehsiao commented Feb 12, 2019

The size of the feature table is very, very large because we store the strings of each feature for each candidate, even though many of these strings are shared between candidates. This also slows down queries. We should have some additional table which maps to those strings, so that those strings do not all need to be stored.

This will take some consideration, though. As it is right now, it makes it very easy to inspect a particular candidate's features or labels simply though cand.features, and we would lose that if we were to refactor the code in this way.

As a rough analysis, with a dataset of 253,524 candidates, we see the following.

Total number of unique feature keys:

# select count(*) from feature_key;
 count 
-------
 17705

Total number of features:

# select count(*) from (select unnest(keys) from feature) as temp;
  count   
----------
 43317015

The feature table is by far the largest table.

# \d+
                               List of relations
 Schema |         Name          |   Type   |  Owner  |    Size    | Description
--------+-----------------------+----------+---------+------------+-------------
 public | candidate             | table    | user    | 14 MB      |
 public | candidate_id_seq      | sequence | user    | 8192 bytes |
 public | caption               | table    | user    | 8192 bytes |
 public | caption_mention       | table    | user    | 0 bytes    |
 public | ce_v_max              | table    | user    | 48 kB      |
 public | cell                  | table    | user    | 8128 kB    |
 public | cell_mention          | table    | user    | 0 bytes    |
 public | context               | table    | user    | 30 MB      |
 public | context_id_seq        | sequence | user    | 8192 bytes |
 public | document              | table    | user    | 3208 kB    |
 public | document_mention      | table    | user    | 0 bytes    |
 public | feature               | table    | user    | 598 MB     |
 public | feature_key           | table    | user    | 3256 kB    |
 public | figure                | table    | user    | 600 kB     |
 public | figure_mention        | table    | user    | 0 bytes    |
 public | gold_label            | table    | user    | 23 MB      |
 public | gold_label_key        | table    | user    | 16 kB      |
 public | implicit_span_mention | table    | user    | 1272 kB    |
 public | label                 | table    | user    | 40 MB      |
 public | label_key             | table    | user    | 24 kB      |
 public | marginal              | table    | user    | 0 bytes    |
 public | marginal_id_seq       | sequence | user    | 8192 bytes |
 public | mention               | table    | user    | 560 kB     |
 public | mention_id_seq        | sequence | user    | 8192 bytes |
 public | paragraph             | table    | user    | 4936 kB    |
 public | paragraph_mention     | table    | user    | 0 bytes    |
 public | part                  | table    | user    | 216 kB     |
 public | part_ce_v_max         | table    | user    | 272 kB     |
 public | part_polarity         | table    | user    | 2440 kB    |
 public | part_stg_temp_max     | table    | user    | 4216 kB    |
 public | part_stg_temp_min     | table    | user    | 4224 kB    |
 public | polarity              | table    | user    | 88 kB      |
 public | prediction            | table    | user    | 8192 bytes |
 public | prediction_key        | table    | user    | 8192 bytes |
 public | section               | table    | user    | 48 kB      |
 public | section_mention       | table    | user    | 0 bytes    |
 public | sentence              | table    | user    | 87 MB      |
 public | span_mention          | table    | user    | 264 kB     |
 public | stable_label          | table    | user    | 8192 bytes |
 public | stg_temp_max          | table    | user    | 152 kB     |
 public | stg_temp_min          | table    | user    | 152 kB     |
 public | table                 | table    | user    | 112 kB     |
 public | table_mention         | table    | user    | 0 bytes    |
 public | webpage               | table    | user    | 8192 bytes |
(44 rows)
@lukehsiao lukehsiao added enhancement New feature or request help wanted Extra attention is required discussion Further information is requested and removed enhancement New feature or request labels Feb 12, 2019
@HiromuHota
Copy link
Contributor

Not only this, but scalability both in computing and in storing is a challenge for Fonduer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion Further information is requested help wanted Extra attention is required
Projects
None yet
Development

No branches or pull requests

2 participants