← Stages of Data: Raw | Side Note on Data Structures: Tidy Data →
Processing data puts it into a state more readily available for analysis and makes the data legible. For instance, it could be rendered as structured data. This can also take many forms, e.g., a table. Here are a few you're likely to come across, all representing the same data:
XML or eXstensible Markup Language, uses a nested structure, where the "tags" like <Cat>
contain other tags inside them, like <firstName>
. This format is good for organizing the layout of a document in a tree-like format, just like HTML, where we want to nest elements like a sentence within a paragraph, for example. XML does not carry any information about how to be displayed and can be used in a variety of presentation scenarios.
<Cats>
<Cat>
<firstName>Smally</firstName>
<lastName>McTiny</lastName>
</Cat>
<Cat>
<firstName>Kitty</firstName>
<lastName>Kitty</lastName>
</Cat>
<Cat>
<firstName>Foots</firstName>
<lastName>Smith</lastName>
</Cat>
<Cat>
<firstName>Tiger</firstName>
<lastName>Jaws</lastName>
</Cat>
</Cats>
This file is viewed on an online XML Viewer. If you would like to, you can either copy the code chunk above to try it out on XML Viewer or download the XML file to try it out in other viewers. To save the file onto your local computer, right click on Raw
button (top right-hand corner of the data set) and click Save Link As...
to save the file onto your local computer.
For example, after downloading the file, can you try to open this file in your browser? (Psst! Try right clicking on cats.xml
in your local directory and choosing Open with Other Application
in the drop down menu to select the browser of your choice.)
JSON or JavaScript Object Notation, also uses a nesting structure, but with the addition of key/value pairs, like the "firstName"
key which is tied to the Smally
value (at least for the first cat!). JSON is popular with web applications that save and send data from your browser to web servers, because it uses the main language of web browsers, JavaScript, to work with data.
{
"Cats": [
{
"firstName": "Smally",
"lastName": "McTiny"
},
{
"firstName": "Kitty",
"lastName": "Kitty"
},
{
"firstName": "Foots",
"lastName":"Smith"
},
{
"firstName": "Tiger",
"lastName":"Jaws"
}
]
}
This file is viewed on my Firefox browser from my local directory. To view it in your browser, you can drag and drop the local file onto a open tab or window. You can also download the JSON file and try opening it in other viewers (e.g. R Studio, webviewers like Code Beautify's JSON Viewer). To save the file onto your local computer, right click on Raw
button (top right-hand corner of the data set) and click Save Link As...
to save the file onto your local computer.
CSV or Comma Separated Values uses—you guessed it!—commas to separate values. Each line (First Name, Last Name) is a new "record" and each column (separated by a comma) is a new "field." This data format stores tabular data in a clean way that facilitates the transfer between different data architectures. As data types go, it is very rudimentary (even predating computers!) and is easy to type, without needing special characters beyond a comma.
First Name,Last Name
Smally,McTiny
Kitty,Kitty
Foots,Smith
Tiger,Jaws
This file is viewed on my VSCode with the extension Excel Viewer
. To view in VSCode, install the extension in VSCode, open the .csv, and then right click on the file and click Open Preview
. You can also download the CSV file to open it in other viewers (e.g. Microsoft Excel, Notepad). To save the file onto your local computer, right click on Raw
button (top right-hand corner of the data set) and click Save Link As...
to save the file onto your local computer.
A small detour to discuss data formats. Open data formats are usually available to anyone free-of-charge and allows for easy reusability. Proprietary formats often hold copyrights, patents, or have other restrictions placed on them, and are dependent on (expensive) licensed softwares. If the licensed software cease to support its proprietary format or it becomes obsolete, you may be stuck with a file format that cannot be easily open or (re)used (e.g. .mac). For accessibility, future-proofing, and preservation, keep your data in open, sustainable formats. A demonstration:
- Open this file in a text editor (e.g. Visual Studio Code, TextEdit (macOS), NotePad (Windows) ), and then in an app like Excel. This is a CSV, an open, text-only, file format. To save the file onto your local computer, right click on
cats.csv
and clickSave Link As
to download the file to your local computer (it's the same cats.csv from above!) - Now do the same with this Excel file. Unlike the previous, this is a proprietary format!
Sustainable formats are generally unencrypted, uncompressed, and follow an open standard.
Types of multimedia | Examples | Common file extensions |
---|---|---|
Images | TIFF (Tagged Image File Format) | `.tiff`, `.tif` |
JPEG2000 | `.jp2`, `.jpf`, `.jpx` | |
PNG (Portable Network Graphics) | `.png` | |
Text | ASCII (American Standard Code for Information Interchange) | `.ascii`, `.dat`, `.txt` |
PDF (Portable Document Format) | `.pdf` | |
CSV (Comma-Separated Values | `.csv` | |
Audio | FLAC (Free Lossless Audio Codec) | `.flac` |
ogg | `.ogg` | |
Video | MPEG-4 | `.mp4` |
Others | XML (Extensible Markup Language) | `.xml` |
JSON (JavaScript Object Notation | `.json` | |
STL (STereoLithography file format—used in 3D modeling) | `.stl` | |
For a list of file formats, consider the Library of Congress' list of Sustainability of Digital Formats. | ||
Structured data can be:
- a XML list.*
- a Excel table.*
- an email chain.
- a collection of text files.
We may choose to store our data in open data formats because they:
- are sustainable.
- allow for easy reusability.
- are free-of-charge to use.
- All of the above.*
- How do you decide the formats to store your data when you transition from 'raw' to 'processed/transformed' data? What are some of your considerations?
- Explore the
moSmall.csv
dataset, what questions might you ask with this dataset? What columns (variables) will you keep? - If you are saving the file
moSmall.csv
in a proprietary spreadsheet application like Microsoft Excel (Windows/macOS) or Numbers (macOS), you may be prompted to save the file as.xlsx
or.numbers
. What format would you choose to save it in? Why would you choose to do so?
- I usually go with the conventions of the field as it allows me to share my "in progress" work easily with my research lab and collaborators. The file conventions can range from
.csv
to.json
. - I will keep columns (variables) relevant to my question, such as the
Artist Gender
,Is Public Domain
andRights and Reproduction
columns. I will also keep some of the descriptive columns such asObject ID
andArtist Role
to help contextualize the results (e.g. what kind of roles do female artists tend to take on?) - I will choose to keep it in a
.csv
file type as it can be opened up by more programs and if Microsoft stops supporting.xlsx
file types I may no longer have access to opening the dataset. or I will choose to switch to a.xlsx
format as it is easier to use on a graphical user interface like Microsoft Excel. Any stylistic changes I've made to the file will remain as well, such as alternative highlighting rows for readability or bolding column headings.
Do you remember the glossary terms from this section?
← Stages of Data: Raw | Side Note on Data Structures: Tidy Data →