In this workshop we will be discussing the basics of research data in terms of material, transformation, and presentation. We will also be discussing the ethical issues that arise in data collection, cleaning, and representation. Because everyone has a different approach and understanding to data and ethics, this workshop will also include multiple sites for discussions to help us think through what data literacies mean within our projects and broader applications.
These quotes below offers a variety of perspectives to understanding research data across different stakeholders. The inclusion of these different approaches to research data is to suggest that there is no singular, definitive approach, and is dependent on multiple factors, including your project considerations.
Material or information on which an argument, theory, test or hypothesis, or another research output is based.
— Queensland University of Technology. Manual of Procedures and Policies. Section 2.8.3.
What constitutes such data will be determined by the community of interest through the process of peer review and program management. This may include, but is not limited to: data, publications, samples, physical collections, software and models.
Research data is defined as the recorded factual material commonly accepted in the scientific community as necessary to validate research findings, but not any of the following: preliminary analyses, drafts of scientific papers, plans for future research, peer reviews, or communications with colleagues.
The short answer is that we can’t always trust empirical measures at face value: data is always biased, measurements always contain errors, systems always have confounders, and people always make assumptions
Broadly, research data can be understood as materials or information necessary to come to your conclusion but what these materials and information is depends on your project.
There are many ways to represent data, just as there are many sources of data. What can you/do you count as data? Here's a small list of possibilities:
- Non-digital text (lab books, field notebooks)
- Digital texts or digital copies of text
- Statistical analysis (SPSS, SAS, R)
- Scientific sample collections
- Data visualizations
- Computer code
- Standard operating procedures and protocols
- Protein or genetic sequences
- Artistic products
- Curriculum materials (e.g. course syllabi)
- Spreadsheets (e.g.
.xlsx
,.numbers
,.csv
) - Audio (e.g.
.mp3
,.wav
,.aac
) - Video (e.g.
.mov
,.mp4
) - Computer Aided Design/CAD (
.cad
) - Databases (e.g.
.sql
) - Geographic Information Systems (GIS) and spatial data (e.g.
.shp
,.dbf
,.shx
) - Digital copies of images (e.g.
.png
,.jpeg
,.tiff
) - Web files (e.g.
.html
,.asp
,.php
) - Matlab files & 3D Models (e.g.
.stl
,.dae
,.3ds
) - Metadata & Paradata (e.g.
.xml
,.json
) - Collection of digital objects acquired and generated during research
Adapted from: Georgia Tech
Research data can be defined as:
- materials or information necessary to come to my conclusion.*
- the recorded factual material commonly accepted in the scientific community as necessary to validate research findings.*
- method of collection and analysis.
- objective and error-free.
These are some (most!) of the shapes your research data might transform into.
- What are some forms of data you use in your work?
- What about forms of data that you produce as your output? Perhaps there are some forms that are typical of your field.
- Where do you usually get your data from?
- As I am currently exploring discourses on various social media ecosystem, I tend to extract/scrape data that comes through as JSON files, which is a text-file type that is often used to structure large data sets. Sometimes they also come in other forms of data bases such as CSVs or XLS.
- Often times outputs are statistical analysis and various data visualizations. This is also pretty comment in my field of psychology.
- I can get them from large databases like pushshift.io or scrape certain social media outlets directly such as Twitter.
We begin without data. Then it is observed, or made, or imagined, or generated. After that, it goes through further transformations. Stages of data typically consist of a) collection of "raw" data, b) processing and/or transforming data, c) cleaning, d) analysis, and e) visualization. For example, we can consider the stages in the following way:
- We start with formulating a research question(s) or hypotheses and set up a project to answer our question(s).
- E.g. What proportion of the artwork collected and/or hosted in the Met are by non cis-gender men artists and also in public domain?
- In the process of setting up the project, we make decisions on what kind of data we think can help us to answer the question
- E.g. I think I can get the data from the Met's open access data set. I will need to look at what variables exist in the dataset to find out if I can filter by gender and the variables that will correspond to copyrights.
- After collecting our data we then consider and make decisions in the processes of cleaning.
- E.g. I have to transform some of the gender values and decide what to do with the missing fields.
- We then run our preliminary analysis of the data.
- E.g. I can run an analysis of the subset of non cis-gender men and public domain media objects against the total number of media objects to find out the proportion.
- At the end of our analysis, a decision is then made about how would we would present the data and its analysis.
- E.g. I can present the result in a pie chart.
This is one cycle in which data goes from collection to transformation to visualization. This is also not the only way to go through the stages. For example, we could do a preliminary analysis first, such as running a correlation of variables, to explore what is missing before we begin the process of cleaning. Often, we also end up doing multiple iterations of cleaning and analysis, making decisions and choices to collapse particular variables or remove them entirely at each iterations. Making sure that we keep a clear documentation of our process will ensure that we are accountable to the data we have collected/are using and also ensure that our results can be replicated and reproduced if others choose to work on our "raw" data.
Before beginning your data collection, manipulation, and transformation, a good practice is to determine your file naming conventions. How many times have named something as XXX_FinalFINALFINAL.pdf
or have difficulty searching for a version of the file that contained all that good idea that was edited out in the XXX_FinalFINALFINALFINAL.pdf
version? While tools like version controlling with git can be helpful, we can also begin with setting up file naming conventions that can help us succeed! Here's an example from Stanford that demonstrates the problems of badly name files in our projects.
For example, The Graduate Center's Data Management guide suggest that top level folders (such as your main project folder) should include your project title, a unique identifier and the date (year) of your project (e.g. dataliteracies_XYZ_2020
). Your sub folders and individual files should follow a similar system, with an identifiable activity or project in the file name (e.g. a sub-folder of the project: sections_xyz_2020
, a file in the project: lessons_XYZ_2020.doc
).
Do you remember the glossary terms from this section?
"Raw" data is yet to be processed, meaning it has yet to be manipulated by a human or computer. Received or collected data could be in any number of formats, locations, etc.. It could be in any of the forms listed in the previous section.
But "raw" data is a relative term, inasmuch as when one person finishes processing data and presents it as a finished product, another person may take that product and work on it further, and for them that data is "raw" data. For example, I may consider the General Social Survey data to be "raw" as it will require me to filter out missing entries and collapse variables or fields before I can run my analysis. A researcher who participated in the creation of this survey may not consider the version on the site as "raw" because the "raw" version is the physical paper copies of the file. As you can see, this consideration of what is "raw" is non-definitive and is dependent on the project you are working on and the narrative you want to tell with the results.
If you are interested in further exploration and discussion of the ethics of "raw" data, please consider reading Drucker's article which has made useful distinctions between "data" (understood as given) and "capta" (taken or "captured") that also troubles the distinction between "raw" and "processed" data.
As we think about data collection, we should also consider the labor involved in the process. Many researchers rely on Amazon Mechanical Turk (sometimes also referred to as MTurk) for data collection, often paying less than minimum wage for the task. Often the assumption made of these workers is someone who is retired, bored, and participating in online gig work for fun or to kill time. While this may be true for some, more than half of those surveyed in a Pew Research study cite that the income from this work is essential or important. Those who do view the income from this work as essential or important are also mostly from underserved communities.
In addition to being mindful of paying a fair wage to the workers on such platforms, this kind of working environment also brings some further considerations to the data that is collected. For instance, to get close to minimum wage, workers cannot afford to spend much time on each task. Thinking through these circumstances, how do you think it impacts the data we collected?
For a deeper discussion on data and labor, consider Catherine D'Ignazio and Lauren Klein's chapter Show Your Work in Data Feminism.
The stages of data is a single iteration process, i.e. there is a fixed stage progression from data collection to visualization.
- True
- False*
Which of the following statements are true for "raw" data:
- is data that is yet to be processed.*
- is data that is received and/or collected.*
- is the same to every researcher/research team.
- can only be collected from participants.
- Do you think "big data" is "raw data"? Why or why not? Do quantity of data play into our assumptions of "rawness"?
- How should we approach data that we have "scraped"?
- How do you collect "raw" data? What are some of your practices? What are your field's practices?
- If you have not done so, open up
moSmall.csv
from your local computer/laptop. As the original file has about 500,000 entries, we've taken a random sample of 1% of the original dataset. In this case, would you consider this file to be a "raw" dataset?
- I think big data can be raw data depending on how the data is obtained and the processes I need to take before I can apply an analysis. I think that with large datasets, I always assume "rawness" because I won't need all of the variables or there will be decisions that need to be made about missing entries.
- I think my approach to scraped data is similar to big data.
- Currently I collect through either pushshift.io or scrap permissible social media sites on my own or with my collaborator (who will have appropriate authorship). I know that my field of psychology is guilty of the discussion on mechanical turk and also often rely on undergraduates for experimental data collection who would have to sign up for experiments for credits in class or do the labour of working in the lab for the promises of bettering their resume for grad school applications.
- The dataset is "raw" to me as I will likely be working on removing certain variables/entries to work towards my question.
Do you remember the glossary terms from this section?
Processing data puts it into a state more readily available for analysis and makes the data legible. For instance, it could be rendered as structured data. This can also take many forms, e.g., a table. Here are a few you're likely to come across, all representing the same data:
XML or eXstensible Markup Language, uses a nested structure, where the "tags" like <Cat>
contain other tags inside them, like <firstName>
. This format is good for organizing the layout of a document in a tree-like format, just like HTML, where we want to nest elements like a sentence within a paragraph, for example. XML does not carry any information about how to be displayed and can be used in a variety of presentation scenarios.
<Cats>
<Cat>
<firstName>Smally</firstName>
<lastName>McTiny</lastName>
</Cat>
<Cat>
<firstName>Kitty</firstName>
<lastName>Kitty</lastName>
</Cat>
<Cat>
<firstName>Foots</firstName>
<lastName>Smith</lastName>
</Cat>
<Cat>
<firstName>Tiger</firstName>
<lastName>Jaws</lastName>
</Cat>
</Cats>
This file is viewed on an online XML Viewer. If you would like to, you can either copy the code chunk above to try it out on XML Viewer or download the XML file to try it out in other viewers. To save the file onto your local computer, right click on Raw
button (top right-hand corner of the data set) and click Save Link As...
to save the file onto your local computer.
For example, after downloading the file, can you try to open this file in your browser? (Psst! Try right clicking on cats.xml
in your local directory and choosing Open with Other Application
in the drop down menu to select the browser of your choice.)
JSON or JavaScript Object Notation, also uses a nesting structure, but with the addition of key/value pairs, like the "firstName"
key which is tied to the Smally
value (at least for the first cat!). JSON is popular with web applications that save and send data from your browser to web servers, because it uses the main language of web browsers, JavaScript, to work with data.
{
"Cats": [
{
"firstName": "Smally",
"lastName": "McTiny"
},
{
"firstName": "Kitty",
"lastName": "Kitty"
},
{
"firstName": "Foots",
"lastName":"Smith"
},
{
"firstName": "Tiger",
"lastName":"Jaws"
}
]
}
This file is viewed on my Firefox browser from my local directory. To view it in your browser, you can drag and drop the local file onto a open tab or window. You can also download the JSON file and try opening it in other viewers (e.g. R Studio, webviewers like Code Beautify's JSON Viewer). To save the file onto your local computer, right click on Raw
button (top right-hand corner of the data set) and click Save Link As...
to save the file onto your local computer.
CSV or Comma Separated Values uses—you guessed it!—commas to separate values. Each line (First Name, Last Name) is a new "record" and each column (separated by a comma) is a new "field." This data format stores tabular data in a clean way that facilitates the transfer between different data architectures. As data types go, it is very rudimentary (even predating computers!) and is easy to type, without needing special characters beyond a comma.
First Name,Last Name
Smally,McTiny
Kitty,Kitty
Foots,Smith
Tiger,Jaws
This file is viewed on my VSCode with the extension Excel Viewer
. To view in VSCode, install the extension in VSCode, open the .csv, and then right click on the file and click Open Preview
. You can also download the CSV file to open it in other viewers (e.g. Microsoft Excel, Notepad). To save the file onto your local computer, right click on Raw
button (top right-hand corner of the data set) and click Save Link As...
to save the file onto your local computer.
A small detour to discuss data formats. Open data formats are usually available to anyone free-of-charge and allows for easy reusability. Proprietary formats often hold copyrights, patents, or have other restrictions placed on them, and are dependent on (expensive) licensed softwares. If the licensed software cease to support its proprietary format or it becomes obsolete, you may be stuck with a file format that cannot be easily open or (re)used (e.g. .mac). For accessibility, future-proofing, and preservation, keep your data in open, sustainable formats. A demonstration:
- Open this file in a text editor (e.g. Visual Studio Code, TextEdit (macOS), NotePad (Windows) ), and then in an app like Excel. This is a CSV, an open, text-only, file format. To save the file onto your local computer, right click on
cats.csv
and clickSave Link As
to download the file to your local computer (it's the same cats.csv from above!) - Now do the same with this Excel file. Unlike the previous, this is a proprietary format!
Sustainable formats are generally unencrypted, uncompressed, and follow an open standard.
Types of multimedia | Examples | Common file extensions |
---|---|---|
Images | TIFF (Tagged Image File Format) | `.tiff`, `.tif` |
JPEG2000 | `.jp2`, `.jpf`, `.jpx` | |
PNG (Portable Network Graphics) | `.png` | |
Text | ASCII (American Standard Code for Information Interchange) | `.ascii`, `.dat`, `.txt` |
PDF (Portable Document Format) | `.pdf` | |
CSV (Comma-Separated Values | `.csv` | |
Audio | FLAC (Free Lossless Audio Codec) | `.flac` |
ogg | `.ogg` | |
Video | MPEG-4 | `.mp4` |
Others | XML (Extensible Markup Language) | `.xml` |
JSON (JavaScript Object Notation | `.json` | |
STL (STereoLithography file format—used in 3D modeling) | `.stl` | |
For a list of file formats, consider the Library of Congress' list of Sustainability of Digital Formats. | ||
Structured data can be:
- a XML list.*
- a Excel table.*
- an email chain.
- a collection of text files.
We may choose to store our data in open data formats because they:
- are sustainable.
- allow for easy reusability.
- are free-of-charge to use.
- All of the above.*
- How do you decide the formats to store your data when you transition from 'raw' to 'processed/transformed' data? What are some of your considerations?
- Explore the
moSmall.csv
dataset, what questions might you ask with this dataset? What columns (variables) will you keep? - If you are saving the file
moSmall.csv
in a proprietary spreadsheet application like Microsoft Excel (Windows/macOS) or Numbers (macOS), you may be prompted to save the file as.xlsx
or.numbers
. What format would you choose to save it in? Why would you choose to do so?
- I usually go with the conventions of the field as it allows me to share my "in progress" work easily with my research lab and collaborators. The file conventions can range from
.csv
to.json
. - I will keep columns (variables) relevant to my question, such as the
Artist Gender
,Is Public Domain
andRights and Reproduction
columns. I will also keep some of the descriptive columns such asObject ID
andArtist Role
to help contextualize the results (e.g. what kind of roles do female artists tend to take on?) - I will choose to keep it in a
.csv
file type as it can be opened up by more programs and if Microsoft stops supporting.xlsx
file types I may no longer have access to opening the dataset. or I will choose to switch to a.xlsx
format as it is easier to use on a graphical user interface like Microsoft Excel. Any stylistic changes I've made to the file will remain as well, such as alternative highlighting rows for readability or bolding column headings.
Do you remember the glossary terms from this section?
There are different guidelines to the processing of data, one of which is the Tidy Data format, which follows these rules in structuring data:
- Each variable is in a column.
- Each observation is a row.
- Each value is a cell.
Look back at our example of cats to see how they may or may not follow those guidelines. Important note: some data formats allow for more than one dimension of data (like the JSON
structure below). How might that complicate the concept of Tidy Data?
{
"Cats": [
{
"Calico": [
{
"firstName": "Smally",
"lastName":"McTiny"
},
{
"firstName": "Kitty",
"lastName": "Kitty"
}
],
"Tortoiseshell": [
{
"firstName": "Foots",
"lastName":"Smith"
},
{
"firstName": "Tiger",
"lastName":"Jaws"
}
]
}
]
}
While tiny data is a really popular method of structuring and organizing data, it is not the only way to do so. Depending on the type of data you have, it is also not always the best way to structure data.
Tiny data format only allows one value per cell.
- True*
- False
Do you think you can explain the rules of tidy data structuring?
- Looking at the
moSmall.csv
dataset, there are a couple of columns with nested information that don't follow the rules of tidy data. Can you identify at least two of the columns that demonstrates this? - Would you convert
moSmall.csv
to follow the tidy data format? Can you demonstrate how you would do so?
Artist Role
,Artist Display Name
,Artist Display Bio
,Artist Alpha Sort
,Artist Nationality
,Artist Begin Date
,Artist End Date
, orClassification
.- I will choose to convert to the tidy data format if I was interested in any of the variables listed above, so that it will be easier to analyse the entries. I will have to unnest the entries by separating the data into different columns. For example, if I am interested in understanding the type of roles that are predominantly held by non-cisgender men, I will unnest the column
Artist Role
as two columns (e.g.Artist 1 Role
,Artist 2 Role
) as illustrated in this example:
Do you remember the glossary terms from this section?
High quality data is measured in its validity, accuracy, completeness, consistency, and uniformity.
Processed data, even in a table, is going to be full of errors:
- Empty fields
- Multiple formats, such as "yes" or "y" or "1" for a positive response.
- Suspect answers, like a date of birth of 00/11/1234
- Impossible negative numbers, like an age of "-37"
- Dubious outliers
- Duplicated rows
- And many more!
Cleaning data is the work of correcting the errors listed above, and moving towards high quality. This work can be done manually or programmatically.
Measurements must be valid, in that they must conform to set constraints:
- The aforementioned "yes" or "y" or "1" should all be changed to one response.
- Certain fields cannot be empty, or the whole observation must be thrown out.
- Uniqueness, for instance no two people should have the same social security number.
Measurements must be accurate, in that they must represent the correct values. While an observation may be valid, it might at the same time be inaccurate. 123 Fake street is a valid, inaccurate street address.
Unfortunately, accuracy is mostly achieved in the observation process. To be achieved in the cleaning process, an outside trusted source would have to be cross-referenced.
Measurements must be complete, in that they must represent everything that might be known. This also is nearly impossible to achieve in the cleaning process! For instance in a survey, it would be necessary to re-interview someone whose previous answer to a question was left blank.
Measurements must be consistent, in that different observations must not contradict each other. For instance, one person cannot be represented as both dead and still alive in different observations.
Measurements must be uniform, in that the same unit of measure must be used in all relevant measurements. If one person's height is listed in meters and another in feet, one measurement must be converted.
Measurements are accurate when
- observations do not contradict each other.
- they represent the correct values.*
- when they are unique responses (e.g. no duplication).
- when the same unit of measure is used in all relevant measurements.
- How do we know when our data is cleaned enough?
- What happens to the data that is removed?
- Explore the
moSmall.csv
dataset.- Are all the measurements valid? Try checking the
Object ID
column for duplicates. - How might you check if the
Is Public Domain
accurately represents the copyrights of the media objects? - Is the data collected completed? How might you deal with the NA or empty fields?
- What assumptions do you have to make when you clean NA or empty fields?
- Is the collected data consistent? Does the column
Is Public Domain
correspond with the data inRights and Reproduction
? If it does not, which would you follow? Why? - As the dataset is not one that we personally collected, how do we make sense that only
Female
or|
is collected as responses in the column (with the exception of NA and empty fields)? What do we have to do to the data to make sure it is uniform? What decisions do we make in this process?
- Are all the measurements valid? Try checking the
- I think this is often decided before the cleaning process begins, perhaps after some quick visualization or analysis of the "raw" data. I generally remove empty entries from my data sets. Working with social media data, I also usually remove URLs as these influence the topic modelling algorithms (e.g. "http" may end up being the most prominent topic of the corpus). This is usually where I stop cleaning. Some might suggest the removal of stop words like "the" "a" "an," but I have always felt very uncertain about the removal of these words. This is especially because the dictionary of stop words were generated through canon western texts that is not representative of the many variations of English. For example, if I were looking at the tweets of Singaporean youths, the stop word dictionary may not be appropriate.
- For me, the data is often destroyed (usually because IRB desires it) or it remains in the original "raw" file. The file that I clean will always be a duplicate file to allow for recovery in case I made a poor decision in the process of cleaning.
- Exploring the dataset, here are my responses to the questions:
- Using
Object ID
indicates that there is no duplicates in the dataset. Every entry is unique. - I will have to compare it to another trusted source like a database from The Getty Research Institute.
- The data collected is not completed. There are missing fields. Depending on where the missing field is, I may choose to code it as
0
for the ease of analysis. For example, the columnDynasty
only contain 1 meaningful entry within this sample data set, as such, I will not run any analysis that may rely on this column and choose to drop it. The columnAccession Year
only has 1 NA and I will choose to drop that row if this becomes a useful variable for my analysis. - While the
Rights and Reproduction
contains a lot of NA and inappropriate responses (e.g. "Ceramics"), for the most part, for the items labeled asYES
in the columnIs Public Domain
the corresponding column inRights and Reproduction
does not record a copyright holder. I am assuming that the NA can stand in for the object being in the public domain. - Taking only
Female
as a valid gender response, everything else will be converted to a0
for ease of analysis. I am assuming|
as equivalent to a NA or an empty field rather than an alternative gender. Hence in my analysis, the proportion will only record female artists' objects against the rest of the collected items. I cannot necessarily answer the larger question of all non-cisgender men against the total in this case.
- Using
Analysis can take many forms (just like the rest of this stuff!), but many techniques fall within a couple of categories:
Techniques geared towards summarizing a data set, such as:
- Mean
- Median
- Mode
- Average
- Standard deviation
Techniques geared towards testing a hypothesis about a population, based on your data set, such as:
- Extrapolation
- P-Value calculation
Techniques geared towards understanding a phenomenon, rather than predicting and testing hypotheses, such as:
- Grounded Theory/Computational Grounded Theory
- Content Analysis
- Text Analysis
As we have discussed thus far, data are not neutral or objective. They are guided by and produced through our interests and assumptions, often shaped by our socio-political contexts. Hence, we must also understand that the forms of analyses we take to our data further shapes how we are choosing to tell the story. We are crafting a narrative through each of the stages of data that helps us communicate our projects to a wider audience. This is not to say that our analyses are not "empirical" or "scientific" but a suggestion to make transparent the theoretical foundations and perspectives that are guiding our interpretations. For a more nuanced perspective, consider The Numbers Don't Speak for Themselves in Data Feminism.
Descriptive analysis help us summarize a data set.
- True*
- False
- As we consider the types of analysis that we choose to apply onto our data set, what are we representing and leaving out?
- How do we guide our decisions of interpretation with our choices of analyses?
- Are we comfortable with the intended use of our research? Are we comfortable with the unintended use of our research? What are potential misuses of our outputs?
- What can happen when we are trying to just go for the next big thing (tool/methods/algorithms) or just ran out of time and/or budget for our project?
- I may choose to leave out data that are perceived to be outliers, especially if they differ to much from the "normal" curve. I end up representing only those who fall within the "normal" curve which may not actually be an equitable representation.
- The interpretation of the results should align itself with the type of analyses that I ran. In addition, it should be guided in some capacity by previous work in this area to inform my understanding.
- Potential misuse that I am always concern with is the weaponziation of marginalized participants' words and thoughts. I think I remain somewhat uncomfortable with the unintended use of my research because I don't think I can ever consider every circumstances that the analysis can be misused or misquoted. When I was working on an oral history project, I have set up some layers of boundaries to prevent too easy of an access to audio files as an attempt at negotiating access and protection of my narrators.
- In chasing the next big thing, the original intentions for beginning the project might be lost. For me, making sure that my work is meaningful to my communities is important and the excitement of exploring a new tool can sometimes distract me from this intention. Running out of time and/or budget can also mean that the project may end abruptly, and relationships built could be strained in a haphazard wrap up. This brings me back to making sure that before the project begins to spend a significant amount of time on project planning to reduce the chances of this happening.
Do you remember the glossary terms from this section?
Visualizing your data helps you tell a story and construct a narrative that guides your audience in understanding your interpretation of a collected, cleaned, and analyzed dataset. Depending on the type of analysis you ran, different kinds of visualization can be more effective than others. In the table below are some examples of data visualization that can help you convey the message of your data.
Types of Analysis | Types of Visualization | When to Use | Example of Visualization |
---|---|---|---|
Comparisons | Bar charts | Comparison across distinct categories | From The Data for Public Good at the Graduate Center. |
Histograms | Comparison across continuous variable | From Policy Viz. | |
Scatter plots | Useful to check for correlation (not causation!) | From FiveThirtyEight. | |
Time | Stacked area charts | Evolution of value across different groups | From From Data to Viz. |
Sankey Diagrams | Displaying flows of changes | From From Data to Viz. | |
Line graphs | Tracking changes over time | From The Data for Public Good at the Graduate Center. | |
Small numbers/percentages | Pie charts | Demonstrate proportions between categories | From The Library of Congress. |
Tree maps | Demonstrate hierarchy and proportion | From The Data Visualization Catalogue. | |
Survey responses | Stacked bar charts | Compares total amount across each group (e.g. plotting Likert scale) | From The Library of Congress. |
Nested area graphs | Visualize branching/nested questions | From Evergreen Data. | |
Place | Choropleth maps | Visualize values over a geographic area to demonstrate pattern | From The Library of Congress. |
Hex(bin) or Tile maps | Similar to Choropleth with the hexbin/tile representing regions equally rather than by geographic size | From R Graph Gallery. | |
Adapted from Stephanie D. Evergreen (2019) Effective data visualization : The right chart for the right data, The Data Visualization Catalogue, and From Data to Viz | |||
This table is a teaser for the many possibilities of what data visualization can be. Creating a visual for your data is an art form and you can sometimes find yourself spending a significant amount of time looking for the best ways to visualize your data.
An example of effective data visualization can be seen in W.E.B. Du Bois data portraits at the Paris Exposition in 1900, as part of the Exhibit of American Negroes. Using engaging hand-drawn visualizations, he tells the narrative of what it meant to be Black in post-Emancipation America as he translates sociological research and census data to reach beyond the academy. Head here to read more about Du Bois' project.
As we transform our results into visuals, we are also trying to tell a narrative about the data we collected. Data visualization can help us to decode information and share quickly and simply.
- What are we assuming when we choose to visually represent data in particular ways?
- As you may have realized, many of the visualization examples work with quantitative data, as such, how do you think we can visualize qualitative data? (e.g. Word Clouds, Heat Map)
- How can data visualization mislead us? (for e.g. Nathan Yau discusses how data visualization can lie)
- How can data visualization help us tell a story? (for e.g. Data Feminism's On rational, Scientific, Objective Viewpoints from Mythical, Imaginary, Impossible Standpoints)
- Can you try to plot the
moSmall.csv
dataset based on theArtist Gender
variable? What would you have to do before you can plot this graph? How might you explain what your visualization represents?
- An underlying assumption we make is that the conventions of top-down, left-right is universal or at least universal enough for most folx to understand. This neglects potential right-to-left readers. Certain conventions that use color as a way to represent good and bad (e.g. green as good and red as bad) also assumes that this is an effective differentiation that excludes those who have visual impairments can decipher the data in a similar fashion.
- Exploring Voyant-Tools can be a good place to start to see how visualization of qualitative data can look like.
- Exaggerated differences through the choice of scales on the x and y-axis can misled a casual viewer to think that the data is representing a larger difference than it actually is reporting.
- Data visualization can help us convey dense information quickly. The casual viewer can glance at the visualization and understand what we are trying to communicate with our data. Data visualization also can be affective device, like the DuBois' examples which helps to tell the urgency of the narrative/story.
- The difficulty of representing this dataset is how at first glance there's an assumption that gender is binary given that only 2 bars are representing the dataset. Even though the other bar is labeled
Unknown
to suggest that this is not a comprehensive breakdown, I'm not sure how effective it is.
Throughout the workshop we have been thinking together through some of the potential ethical concerns that might crop up as we proceed with our own projects. Just as we have discussed thus far, we hope that you see that data and ethics is an ongoing process throughout the lifespans of your project(s) and don’t often come with easy answers.
In this final activity, we would like for you to think about some of the potential concerns that might come up in the scenario below and think about how you might approach them.
You are interested in looking at the reactions to the democratic party presidential debates across time. You decided that you would use data from Twitter to analyze the responses. After collecting your data, you learned that your data has information from users who were later banned and included some tweets that were removed/deleted from the site.
As you work through this activity, you can definitely choose to do so with your partner! And we highly encourage you to do so! Different perspectives can offer us different insights to our own gaps and help us in thinking through our decisions. Be prepared to discuss your thoughts and ideas when we "meet" for our sessions.
- What are some reasons you might have for anonymizing (or not) your data?
- Would your approach differ if the responses were anonymized v. not?
- Would you remove the data in your initially downloaded corpus?
- How might you be aware of the differences in the corpus you downloaded v. the most current information?
- Would the number of tweets generated impact your decisions?
- How might where you are at in the stages of data (e.g. "raw" data v. "cleaned" data v. analysed) affect your choices?
- If you were collecting and/or analyzing data on folx in power, such as looking at the data from Tweets of Congress' project, would that change the way you consider your answers to the previous questions?
- Current ethical guidelines from SAFE Lab at Columbia University have decided to alter the text of social media post to render it unsearchable. Why and when would you consider (or not) altering the collected tweets for publication?
Data and ethics are contextually driven. As such, there isn’t always a risk-free approach. We often have to work through ethical dilemmas while thinking through information that we may not have (what are the risks of doing/not doing this work?). We have approached a moment where the question is no longer what we could do but what we should do. Given this saturated data-driven world we currently live in, there is value in pausing and consider why and what we are collecting, researching, analyzing, and understanding. Starting on a new project, especially one dealing with "big" data can be exciting but we now also have to first consider who does the data collected benefit and why is it important are important. The IRB (Institutional Review Board)'s regulations may form the starting point of our considerations but should not be the ending point of how we consider contextually-driven ethics and data projects.
In addition, open access is not always the answer to concerns of reproducibility and/or ethical considerations. There are moments where the decision to not have a dataset or analysis openly accessible is valid. For example, when you are working with marginalized or vulnerable populations, concerns for causing more harm justifies restricting access. We may choose to control who has access to decrease the chances of misrepresentations (intentional or otherwise) or having results taken out of contexts.
For a set of great questions to help you think through your data exploration and project planning, please check out Kristen Hackett's Tagging the Tower post, What to Consider when Planning a Digital Project.