Skip to content

Commit

Permalink
small formatting fixes
Browse files Browse the repository at this point in the history
  • Loading branch information
fmichonneau committed Jul 20, 2020
1 parent 42a7cee commit 7332391
Show file tree
Hide file tree
Showing 4 changed files with 62 additions and 49 deletions.
10 changes: 8 additions & 2 deletions _episodes/01-format-data.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ questions:
objectives:
- "Describe best practices for data entry and formatting in spreadsheets."
- "Apply best practices to arrange variables and observations in a spreadsheet."

keypoints:
- Use one column for one variable
- Use one row for one observation
Expand Down Expand Up @@ -51,10 +51,13 @@ Unorganized data can make it harder to work with your data,
so you should be mindful of your data organization when doing your data entry.
You'll want to organize your data in a way that allows other programs and people to easily understand and use the data.

> ## Callout
>
> **Note:** the best layouts/formats (as well as software and
> interfaces) for **data entry** and **data analysis** might be
> different. It is important to take this into account, and ideally
> automate the conversion from one to another.
{: .callout}

### Keeping track of your analyses

Expand Down Expand Up @@ -140,7 +143,7 @@ with this data and how you fixed it.
{: .challenge}


> ## Important ##
> ## Important
>
> Do not forget of our first piece of advice:
> **create a new file** for the cleaned data, and **never
Expand All @@ -150,7 +153,10 @@ with this data and how you fixed it.

An excellent reference, in particular with regard to R scripting is

> ## Resource
>
> Hadley Wickham, *Tidy Data*, Vol. 59, Issue 10, Sep 2014, Journal of
> Statistical Software. [http://www.jstatsoft.org/v59/i10](http://www.jstatsoft.org/v59/i10).
{: .callout}

<!-- *Instructors see notes in 'instructors_notes.md' on this exercise.* -->
30 changes: 16 additions & 14 deletions _episodes/03-dates-as-data.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,11 +38,11 @@ In particular, please remember that functions that are valid for a given
spreadsheet program (be it LibreOffice, Microsoft Excel, OpenOffice.org,
Gnumeric, etc.) are usually guaranteed to be compatible only within the same
family of products. If you will later need to export the data and need to
conserve the timestamps you are better off handling them using one of the solutions discussed below.
conserve the timestamps you are better off handling them using one of the solutions discussed below.


> ## Exercise
>
>
> Challenge: pulling month, day and year out of dates
>
> - In the `Dates` tab of your Excel file we summarized training data from 2015. There's a `date` column.
Expand All @@ -56,12 +56,10 @@ conserve the timestamps you are better off handling them using one of the soluti
>
> (Make sure the new column is formatted as a number and not as a date. Change the function to correspond to each row - i.e., =MONTH(A3), =DAY(A3), =YEAR(A3) for the next row.
>
>
> > ## Solution
> > You can see that even though you wanted the year to be 2015 for all entries, your spreadsheet program interpreted two entries as 2017, the year the data was entered, not the year of the workshop.
> > ![dates, exersize 1](../fig/3_Dates_as_Columns.png)
> > {: .output}
> {: .solution}
{: .challenge}

Expand All @@ -76,7 +74,7 @@ If you’re working with historic data, be extremely careful with your dates!

Excel also entertains a second date system, the 1904 date system, as the default in Excel for Macintosh. This system will assign a
different serial number than the [1900 date system](https://support.microsoft.com/en-us/help/214330/differences-between-the-1900-and-the-1904-date-system-in-excel). Because of this,
[dates must be checked for accuracy when exporting data from Excel](http://uc3.cdlib.org/2014/04/09/abandon-all-hope-ye-who-enter-dates-in-excel/) (look for dates that are ~4 years off).
[dates must be checked for accuracy when exporting data from Excel](http://uc3.cdlib.org/2014/04/09/abandon-all-hope-ye-who-enter-dates-in-excel/) (look for dates that are ~4 years off).


## Data formats in spreadsheets
Expand All @@ -98,11 +96,15 @@ the above functions we can easily add days, months or years to a given date.
Say you had a sampling plan where you needed to sample every thirty seven days.
In another cell, you could type:

=B2+37
~~~
=B2+37
~~~

And it would return

8-Aug
~~~
8-Aug
~~~

because it understands the date as a number `41822`, and `41822 + 37 = 41859`
which Excel interprets as August 8, 2014. It retains the format (for the most
Expand All @@ -124,15 +126,15 @@ the quantities to the correct entities.

Which brings us to the many different ways Excel provides in how it displays dates. If you refer to the figure above, you’ll see that there are many, MANY ways that ambiguity creeps into your data depending on the format you chose when you enter your data, and if you’re not fully cognizant of which format you’re using, you can end up actually entering your data in a way that Excel will badly misinterpret.

> ## Exercise
> ## Exercise
> What happens to the dates in the `dates` tab of our workbook if we save this sheet in Excel (in `csv` format) and then open the file in a plain text editor (like TextEdit or Notepad)? What happens to the dates if we then open the `csv` file in Excel?
> > ## Solution
> > - Click to the `dates` tab of the workbook and double-click on any of the values in the `Date collected` column. Notice that most of the dates display with the year 2015 and two are 2017.
> > - Select `File -> Save As` in Excel and in the drop down menu for file format select `CSV UTF-8 (Comma delimited) (.csv)`. Click `Save`.
> > - You will see a pop-up that says "This workbook cannot be saved in the selected file format because it contains multiple sheets." Choose `Save Active Sheet`.
> > - Navigate to the file in your finder application. Right click and select `Open With`. Choose a plain text editor application and view the file. Notice that the dates display as month/day without any year information.
> > - Now right click on the file again and open with Excel. Notice that the dates display with the current year, not 2015.
> > As you can see, exporting data from Excel and then importing it back into Excel fundamentally changed the data once again!
> > - Click to the `dates` tab of the workbook and double-click on any of the values in the `Date collected` column. Notice that most of the dates display with the year 2015 and two are 2017.
> > - Select `File -> Save As` in Excel and in the drop down menu for file format select `CSV UTF-8 (Comma delimited) (.csv)`. Click `Save`.
> > - You will see a pop-up that says "This workbook cannot be saved in the selected file format because it contains multiple sheets." Choose `Save Active Sheet`.
> > - Navigate to the file in your finder application. Right click and select `Open With`. Choose a plain text editor application and view the file. Notice that the dates display as month/day without any year information.
> > - Now right click on the file again and open with Excel. Notice that the dates display with the current year, not 2015.
> > As you can see, exporting data from Excel and then importing it back into Excel fundamentally changed the data once again!
> {: .solution}
{: .challenge}

Expand Down
24 changes: 13 additions & 11 deletions _episodes/05-exporting-data.md
Original file line number Diff line number Diff line change
Expand Up @@ -65,11 +65,13 @@ An important note for backwards compatibility: you can open CSVs in Excel!

## A Note on Cross-platform Operability

By default, most coding and statistical environments expect UNIX-style line endings (ASCII `LF` character) as representing line breaks. However, Windows uses an alternate line ending signifier (ASCII `CR LF` characters) by default for legacy compatibility with Teletype-based systems..
By default, most coding and statistical environments expect UNIX-style line endings (ASCII `LF` character) as representing line breaks. However, Windows uses an alternate line ending signifier (ASCII `CR LF` characters) by default for legacy compatibility with Teletype-based systems..

As such, when exporting to CSV using Excel, your data in text format will look like this:

>data1,data2<CR><LF>1,2<CR><LF>4,5<CR><LF>
~~~
data1,data2<CR><LF>1,2<CR><LF>4,5<CR><LF>
~~~

When opening your CSV file in Excel again, it will parse it as follows:

Expand All @@ -79,11 +81,11 @@ However, if you open your CSV file on a different system that does not parse the

Your data in text format then look like this:

>data1<br>
>data2<CR><br>
>1<br>
>2<CR><br>
>
~~~
data1,data2<CR>
1,2<CR>
~~~

You will then see a weird character or possibly the string `CR` or `\r`:

Expand All @@ -100,15 +102,15 @@ There are a handful of solutions for enforcing uniform UNIX-style line endings o
```
[filter "cr"]
clean = LC_CTYPE=C awk '{printf(\"%s\\n\", $0)}' | LC_CTYPE=C tr '\\r' '\\n'
smudge = tr '\\n' '\\r'`
smudge = tr '\\n' '\\r'`
```

and then create a file `.gitattributes` that contains the line:

```
*.csv filter=cr
```

3. Use [dos2unix](http://dos2unix.sourceforge.net/) (available on OSX, *nix, and Cygwin) on local files to standardize line endings.

#### A note on Python and `xls`
Expand Down
47 changes: 25 additions & 22 deletions _episodes/06-data-formats-caveats.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
title: Caveats of popular data and file formats
title: Caveats of popular data and file formats
teaching: 5
exercises: 0
questions:
Expand All @@ -15,38 +15,41 @@ keypoints:

## Dealing with commas as part of data values in `*.csv` files ##

In the [previous lesson](../05-exporting-data) we discussed how to export Excel file formats into `*.csv`. Whilst Comma Separated Value files are indeed very useful allowing for easily exchanging and sharing data.
In the [previous lesson](../05-exporting-data) we discussed how to export Excel file formats into `*.csv`. Whilst Comma Separated Value files are indeed very useful allowing for easily exchanging and sharing data.

However, there are some significant problems with this particular format. Quite often the data values themselves may include commas (,). In that case, the software which you use (including Excel) will most likely incorrectly display the data in columns. It is because the commas which are a part of the data values will be interpreted as a delimiter.

Data could look like this:

date,type,len_hours,num_registered,num_attended,trainer,cancelled
29 Apr,OA,1.5,1.5,15,JM,N
3 Mar,OA,60,19,25,PG,N
3 Jul,OA,1,25,20,PG, JM ,N
4 Jan,OA,1,26,17,JM,N
29 Mar,RDM,1,27,24,JM,N

In record `3 Jul,OA,1,25,20,PG, JM ,N` the value for *trainer* includes a comma for multiple trainers (`PG, JM`).

~~~
date,type,len_hours,num_registered,num_attended,trainer,cancelled
29 Apr,OA,1.5,1.5,15,JM,N
3 Mar,OA,60,19,25,PG,N
3 Jul,OA,1,25,20,PG, JM ,N
4 Jan,OA,1,26,17,JM,N
29 Mar,RDM,1,27,24,JM,N
~~~

In record `3 Jul,OA,1,25,20,PG, JM ,N` the value for *trainer* includes a comma for multiple trainers (`PG, JM`).
If we try to read the above into Excel (or other spreadsheet programme), we will get something like this:

![Issue with importing csv format](../fig/csv-mistake.png)

The value for 'trainer' was split into two columns (instead of being put in one column `F`). This can propagate to a number of further errors. For example, the "extra" column will be interpreted as a column with many missing values (and without a proper header!).
If you want to store your data in `csv` format and expect that your data values may contain commas, you can avoid the problem discussed above by putting the values to be included in the same column in quotes (""). Applying this rule, the data might look like this:
The value for 'trainer' was split into two columns (instead of being put in one column `F`). This can propagate to a number of further errors. For example, the "extra" column will be interpreted as a column with many missing values (and without a proper header!).

If you want to store your data in `csv` format and expect that your data values may contain commas, you can avoid the problem discussed above by putting the values to be included in the same column in quotes (""). Applying this rule, the data might look like this:

date,type,len_hours,num_registered,num_attended,trainer,cancelled
29 Apr,OA,1.5,1.5,15,JM,N
3 Mar,OA,60,19,25,PG,N
3 Jul,OA,1,25,20,"PG, JM",N
4 Jan,OA,1,26,17,JM,N
29 Mar,RDM,1,27,24,JM,N

~~~
date,type,len_hours,num_registered,num_attended,trainer,cancelled
29 Apr,OA,1.5,1.5,15,JM,N
3 Mar,OA,60,19,25,PG,N
3 Jul,OA,1,25,20,"PG, JM",N
4 Jan,OA,1,26,17,JM,N
29 Mar,RDM,1,27,24,JM,N
~~~

Now opening this file as a `csv` in Excel will not lead to an extra column, because Excel will only use commas that fall outside of quotation marks as delimiting characters. However, if you are working with an already existing dataset in which the data values are not included in "" but which have commas as both delimiters and parts of data values, you are potentially facing a major problem with data cleaning.

If the dataset you're dealing with contains hundreds or thousands of records, cleaning them up manually (by either removing commas from the data values or putting the values into quotes - "") is not only going to take hours and hours but may potentially end up with you accidentally introducing many errors.

Cleaning up datasets is one of the major problems in many scientific disciplines. The approach almost always depends on the particular context. However, it is a good practice to clean the data in an automated fashion, for example by writing and running a script. The Python and R lessons will give you the basis for developing skills to build relevant scripts.
Cleaning up datasets is one of the major problems in many scientific disciplines. The approach almost always depends on the particular context. However, it is a good practice to clean the data in an automated fashion, for example by writing and running a script. The Python and R lessons will give you the basis for developing skills to build relevant scripts.

0 comments on commit 7332391

Please sign in to comment.