Skip to content

Work with documents

Adrian Viehweger edited this page Mar 22, 2017 · 4 revisions

Documents are nested hash maps. In zoo, they appear in three forms:

  • DotMap object for object-like navigation and quick changes
  • dict, where we make changes to a schema more explicit
  • JSON formatted string for data movement (backup, sharing etc.)

We'll create the following document using both a dict and a DotMap object. Note that if we wanted to enforce a schema, DotMap is not ideal, because it allows assignment to non-existing keys as well as replacement of existing ones. If schema enforcement is desired, we use zoo.utils.deep_set() and deep_get(), with explicit arguments for key creation (force) and replacement (replace). For further details on how to use these two functions look at the zoo.utils test file.

Enforcing a schema:

for i in islice(df.iterrows(), 50):
    j = i[1]
    if 'influenza b' not in j.isolate.lower():

        d = deepcopy(schema)  # important, otherwise schema is modified
        entries = {
            '_id': str(uuid4()),
            'metadata.host': j.host.lower(),
            'metadata.location': j.country,
            'metadata.date': parse_date(j.date),
            }

        for k, v in entries.items():
            # print(k)
            deep_set(d, k, v, replace=True)

        deep_get(d, 'metadata.alt_id').append({'genbank': j.genbank})
        deep_set(d, 'relative.taxonomy.subtype', j.subtype)
        deep_set(d, 'derivative.segment_number', j.segment_number, force=True)
        deep_set(
            d, 'relative.taxonomy.nomenclature',
            re.search('\((.*)\)', j.isolate).group(1))
        # format: 'Influenza A virus (A/Hong Kong/1/1968(H3N2))'
        # returns: 'A/Hong Kong/1/1968(H3N2)'
        # stackoverflow, 15864800

        deep_set(
            d,
            'metadata.host_detail',
            parse_nomenclature_iav(
                deep_get(
                    d, 'relative.taxonomy.nomenclature'
                    ))['host'].lower(),
            force=True)

Being more relaxed about the schema (and prone accidental key assignments):

d = DotMap(schema)

    d._id = str(uuid4())
    d.metadata.ids.append({'genbank': j.genbank})
    d.metadata.host = j.host.lower()
    d.metadata.location = j.country
    # Parsers for common tasks
    d.metadata.date = parse_date(j.date)
    # Create attributes that are not present in schema? No problem.
    d.metadata.segment_number = j.segment_number
    d.relative.taxonomy.subtype = j.subtype
    d.derivative.update({'seqlen': j.seqlen})

    d.relative.taxonomy.nomenclature = re.search(
        '\((.*)\)', j.isolate).group(1)
    # format: 'Influenza A virus (A/Hong Kong/1/1968(H3N2))'
    # returns: 'A/Hong Kong/1/1968(H3N2)'
    # stackoverflow, 15864800

    d.metadata.host_detail = parse_nomenclature_iav(
        d.relative.taxonomy.nomenclature)['host'].lower()


# easy transformation
dm = DotMap(d)
d = dm.to_dict()
Clone this wiki locally