You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The size of the feature table is very, very large because we store the strings of each feature for each candidate, even though many of these strings are shared between candidates. This also slows down queries. We should have some additional table which maps to those strings, so that those strings do not all need to be stored.
This will take some consideration, though. As it is right now, it makes it very easy to inspect a particular candidate's features or labels simply though cand.features, and we would lose that if we were to refactor the code in this way.
As a rough analysis, with a dataset of 253,524 candidates, we see the following.
Total number of unique feature keys:
# select count(*) from feature_key;
count
-------
17705
Total number of features:
# select count(*) from (select unnest(keys) from feature) as temp;
count
----------
43317015
The feature table is by far the largest table.
# \d+
List of relations
Schema | Name | Type | Owner | Size | Description
--------+-----------------------+----------+---------+------------+-------------
public | candidate | table | user | 14 MB |
public | candidate_id_seq | sequence | user | 8192 bytes |
public | caption | table | user | 8192 bytes |
public | caption_mention | table | user | 0 bytes |
public | ce_v_max | table | user | 48 kB |
public | cell | table | user | 8128 kB |
public | cell_mention | table | user | 0 bytes |
public | context | table | user | 30 MB |
public | context_id_seq | sequence | user | 8192 bytes |
public | document | table | user | 3208 kB |
public | document_mention | table | user | 0 bytes |
public | feature | table | user | 598 MB |
public | feature_key | table | user | 3256 kB |
public | figure | table | user | 600 kB |
public | figure_mention | table | user | 0 bytes |
public | gold_label | table | user | 23 MB |
public | gold_label_key | table | user | 16 kB |
public | implicit_span_mention | table | user | 1272 kB |
public | label | table | user | 40 MB |
public | label_key | table | user | 24 kB |
public | marginal | table | user | 0 bytes |
public | marginal_id_seq | sequence | user | 8192 bytes |
public | mention | table | user | 560 kB |
public | mention_id_seq | sequence | user | 8192 bytes |
public | paragraph | table | user | 4936 kB |
public | paragraph_mention | table | user | 0 bytes |
public | part | table | user | 216 kB |
public | part_ce_v_max | table | user | 272 kB |
public | part_polarity | table | user | 2440 kB |
public | part_stg_temp_max | table | user | 4216 kB |
public | part_stg_temp_min | table | user | 4224 kB |
public | polarity | table | user | 88 kB |
public | prediction | table | user | 8192 bytes |
public | prediction_key | table | user | 8192 bytes |
public | section | table | user | 48 kB |
public | section_mention | table | user | 0 bytes |
public | sentence | table | user | 87 MB |
public | span_mention | table | user | 264 kB |
public | stable_label | table | user | 8192 bytes |
public | stg_temp_max | table | user | 152 kB |
public | stg_temp_min | table | user | 152 kB |
public | table | table | user | 112 kB |
public | table_mention | table | user | 0 bytes |
public | webpage | table | user | 8192 bytes |
(44 rows)
The text was updated successfully, but these errors were encountered:
The size of the feature table is very, very large because we store the strings of each feature for each candidate, even though many of these strings are shared between candidates. This also slows down queries. We should have some additional table which maps to those strings, so that those strings do not all need to be stored.
This will take some consideration, though. As it is right now, it makes it very easy to inspect a particular candidate's features or labels simply though
cand.features
, and we would lose that if we were to refactor the code in this way.As a rough analysis, with a dataset of 253,524 candidates, we see the following.
Total number of unique feature keys:
Total number of features:
The feature table is by far the largest table.
The text was updated successfully, but these errors were encountered: