From 73323913c38e1110fe5c055c87d8a4c6cd961d33 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Fran=C3=A7ois=20Michonneau?= Date: Mon, 20 Jul 2020 15:53:32 +0200 Subject: [PATCH] small formatting fixes --- _episodes/01-format-data.md | 10 ++++-- _episodes/03-dates-as-data.md | 30 +++++++++--------- _episodes/05-exporting-data.md | 24 +++++++------- _episodes/06-data-formats-caveats.md | 47 +++++++++++++++------------- 4 files changed, 62 insertions(+), 49 deletions(-) diff --git a/_episodes/01-format-data.md b/_episodes/01-format-data.md index 9077b03e..617c3c98 100644 --- a/_episodes/01-format-data.md +++ b/_episodes/01-format-data.md @@ -7,7 +7,7 @@ questions: objectives: - "Describe best practices for data entry and formatting in spreadsheets." - "Apply best practices to arrange variables and observations in a spreadsheet." - + keypoints: - Use one column for one variable - Use one row for one observation @@ -51,10 +51,13 @@ Unorganized data can make it harder to work with your data, so you should be mindful of your data organization when doing your data entry. You'll want to organize your data in a way that allows other programs and people to easily understand and use the data. +> ## Callout +> > **Note:** the best layouts/formats (as well as software and > interfaces) for **data entry** and **data analysis** might be > different. It is important to take this into account, and ideally > automate the conversion from one to another. +{: .callout} ### Keeping track of your analyses @@ -140,7 +143,7 @@ with this data and how you fixed it. {: .challenge} -> ## Important ## +> ## Important > > Do not forget of our first piece of advice: > **create a new file** for the cleaned data, and **never @@ -150,7 +153,10 @@ with this data and how you fixed it. An excellent reference, in particular with regard to R scripting is +> ## Resource +> > Hadley Wickham, *Tidy Data*, Vol. 59, Issue 10, Sep 2014, Journal of > Statistical Software. [http://www.jstatsoft.org/v59/i10](http://www.jstatsoft.org/v59/i10). +{: .callout} diff --git a/_episodes/03-dates-as-data.md b/_episodes/03-dates-as-data.md index 1d249b08..ab2ae75a 100644 --- a/_episodes/03-dates-as-data.md +++ b/_episodes/03-dates-as-data.md @@ -38,11 +38,11 @@ In particular, please remember that functions that are valid for a given spreadsheet program (be it LibreOffice, Microsoft Excel, OpenOffice.org, Gnumeric, etc.) are usually guaranteed to be compatible only within the same family of products. If you will later need to export the data and need to -conserve the timestamps you are better off handling them using one of the solutions discussed below. +conserve the timestamps you are better off handling them using one of the solutions discussed below. > ## Exercise -> +> > Challenge: pulling month, day and year out of dates > > - In the `Dates` tab of your Excel file we summarized training data from 2015. There's a `date` column. @@ -56,12 +56,10 @@ conserve the timestamps you are better off handling them using one of the soluti > > (Make sure the new column is formatted as a number and not as a date. Change the function to correspond to each row - i.e., =MONTH(A3), =DAY(A3), =YEAR(A3) for the next row. > - > > > ## Solution > > You can see that even though you wanted the year to be 2015 for all entries, your spreadsheet program interpreted two entries as 2017, the year the data was entered, not the year of the workshop. > > ![dates, exersize 1](../fig/3_Dates_as_Columns.png) -> > {: .output} > {: .solution} {: .challenge} @@ -76,7 +74,7 @@ If you’re working with historic data, be extremely careful with your dates! Excel also entertains a second date system, the 1904 date system, as the default in Excel for Macintosh. This system will assign a different serial number than the [1900 date system](https://support.microsoft.com/en-us/help/214330/differences-between-the-1900-and-the-1904-date-system-in-excel). Because of this, -[dates must be checked for accuracy when exporting data from Excel](http://uc3.cdlib.org/2014/04/09/abandon-all-hope-ye-who-enter-dates-in-excel/) (look for dates that are ~4 years off). +[dates must be checked for accuracy when exporting data from Excel](http://uc3.cdlib.org/2014/04/09/abandon-all-hope-ye-who-enter-dates-in-excel/) (look for dates that are ~4 years off). ## Data formats in spreadsheets @@ -98,11 +96,15 @@ the above functions we can easily add days, months or years to a given date. Say you had a sampling plan where you needed to sample every thirty seven days. In another cell, you could type: - =B2+37 +~~~ +=B2+37 +~~~ And it would return - 8-Aug +~~~ +8-Aug +~~~ because it understands the date as a number `41822`, and `41822 + 37 = 41859` which Excel interprets as August 8, 2014. It retains the format (for the most @@ -124,15 +126,15 @@ the quantities to the correct entities. Which brings us to the many different ways Excel provides in how it displays dates. If you refer to the figure above, you’ll see that there are many, MANY ways that ambiguity creeps into your data depending on the format you chose when you enter your data, and if you’re not fully cognizant of which format you’re using, you can end up actually entering your data in a way that Excel will badly misinterpret. -> ## Exercise +> ## Exercise > What happens to the dates in the `dates` tab of our workbook if we save this sheet in Excel (in `csv` format) and then open the file in a plain text editor (like TextEdit or Notepad)? What happens to the dates if we then open the `csv` file in Excel? > > ## Solution -> > - Click to the `dates` tab of the workbook and double-click on any of the values in the `Date collected` column. Notice that most of the dates display with the year 2015 and two are 2017. -> > - Select `File -> Save As` in Excel and in the drop down menu for file format select `CSV UTF-8 (Comma delimited) (.csv)`. Click `Save`. -> > - You will see a pop-up that says "This workbook cannot be saved in the selected file format because it contains multiple sheets." Choose `Save Active Sheet`. -> > - Navigate to the file in your finder application. Right click and select `Open With`. Choose a plain text editor application and view the file. Notice that the dates display as month/day without any year information. -> > - Now right click on the file again and open with Excel. Notice that the dates display with the current year, not 2015. -> > As you can see, exporting data from Excel and then importing it back into Excel fundamentally changed the data once again! +> > - Click to the `dates` tab of the workbook and double-click on any of the values in the `Date collected` column. Notice that most of the dates display with the year 2015 and two are 2017. +> > - Select `File -> Save As` in Excel and in the drop down menu for file format select `CSV UTF-8 (Comma delimited) (.csv)`. Click `Save`. +> > - You will see a pop-up that says "This workbook cannot be saved in the selected file format because it contains multiple sheets." Choose `Save Active Sheet`. +> > - Navigate to the file in your finder application. Right click and select `Open With`. Choose a plain text editor application and view the file. Notice that the dates display as month/day without any year information. +> > - Now right click on the file again and open with Excel. Notice that the dates display with the current year, not 2015. +> > As you can see, exporting data from Excel and then importing it back into Excel fundamentally changed the data once again! > {: .solution} {: .challenge} diff --git a/_episodes/05-exporting-data.md b/_episodes/05-exporting-data.md index 03ae324f..e20c1197 100644 --- a/_episodes/05-exporting-data.md +++ b/_episodes/05-exporting-data.md @@ -65,11 +65,13 @@ An important note for backwards compatibility: you can open CSVs in Excel! ## A Note on Cross-platform Operability -By default, most coding and statistical environments expect UNIX-style line endings (ASCII `LF` character) as representing line breaks. However, Windows uses an alternate line ending signifier (ASCII `CR LF` characters) by default for legacy compatibility with Teletype-based systems.. +By default, most coding and statistical environments expect UNIX-style line endings (ASCII `LF` character) as representing line breaks. However, Windows uses an alternate line ending signifier (ASCII `CR LF` characters) by default for legacy compatibility with Teletype-based systems.. As such, when exporting to CSV using Excel, your data in text format will look like this: ->data1,data21,24,5 +~~~ +data1,data21,24,5 +~~~ When opening your CSV file in Excel again, it will parse it as follows: @@ -79,11 +81,11 @@ However, if you open your CSV file on a different system that does not parse the Your data in text format then look like this: ->data1
->data2
->1
->2
->… +~~~ +data1,data2 +1,2 +… +~~~ You will then see a weird character or possibly the string `CR` or `\r`: @@ -100,15 +102,15 @@ There are a handful of solutions for enforcing uniform UNIX-style line endings o ``` [filter "cr"] clean = LC_CTYPE=C awk '{printf(\"%s\\n\", $0)}' | LC_CTYPE=C tr '\\r' '\\n' - smudge = tr '\\n' '\\r'` + smudge = tr '\\n' '\\r'` ``` - + and then create a file `.gitattributes` that contains the line: - + ``` *.csv filter=cr ``` - + 3. Use [dos2unix](http://dos2unix.sourceforge.net/) (available on OSX, *nix, and Cygwin) on local files to standardize line endings. #### A note on Python and `xls` diff --git a/_episodes/06-data-formats-caveats.md b/_episodes/06-data-formats-caveats.md index 58df58af..eb101467 100644 --- a/_episodes/06-data-formats-caveats.md +++ b/_episodes/06-data-formats-caveats.md @@ -1,5 +1,5 @@ --- -title: Caveats of popular data and file formats +title: Caveats of popular data and file formats teaching: 5 exercises: 0 questions: @@ -15,38 +15,41 @@ keypoints: ## Dealing with commas as part of data values in `*.csv` files ## -In the [previous lesson](../05-exporting-data) we discussed how to export Excel file formats into `*.csv`. Whilst Comma Separated Value files are indeed very useful allowing for easily exchanging and sharing data. +In the [previous lesson](../05-exporting-data) we discussed how to export Excel file formats into `*.csv`. Whilst Comma Separated Value files are indeed very useful allowing for easily exchanging and sharing data. However, there are some significant problems with this particular format. Quite often the data values themselves may include commas (,). In that case, the software which you use (including Excel) will most likely incorrectly display the data in columns. It is because the commas which are a part of the data values will be interpreted as a delimiter. Data could look like this: - - date,type,len_hours,num_registered,num_attended,trainer,cancelled - 29 Apr,OA,1.5,1.5,15,JM,N - 3 Mar,OA,60,19,25,PG,N - 3 Jul,OA,1,25,20,PG, JM ,N - 4 Jan,OA,1,26,17,JM,N - 29 Mar,RDM,1,27,24,JM,N - -In record `3 Jul,OA,1,25,20,PG, JM ,N` the value for *trainer* includes a comma for multiple trainers (`PG, JM`). + +~~~ +date,type,len_hours,num_registered,num_attended,trainer,cancelled +29 Apr,OA,1.5,1.5,15,JM,N +3 Mar,OA,60,19,25,PG,N +3 Jul,OA,1,25,20,PG, JM ,N +4 Jan,OA,1,26,17,JM,N +29 Mar,RDM,1,27,24,JM,N +~~~ + +In record `3 Jul,OA,1,25,20,PG, JM ,N` the value for *trainer* includes a comma for multiple trainers (`PG, JM`). If we try to read the above into Excel (or other spreadsheet programme), we will get something like this: ![Issue with importing csv format](../fig/csv-mistake.png) -The value for 'trainer' was split into two columns (instead of being put in one column `F`). This can propagate to a number of further errors. For example, the "extra" column will be interpreted as a column with many missing values (and without a proper header!). - -If you want to store your data in `csv` format and expect that your data values may contain commas, you can avoid the problem discussed above by putting the values to be included in the same column in quotes (""). Applying this rule, the data might look like this: +The value for 'trainer' was split into two columns (instead of being put in one column `F`). This can propagate to a number of further errors. For example, the "extra" column will be interpreted as a column with many missing values (and without a proper header!). + +If you want to store your data in `csv` format and expect that your data values may contain commas, you can avoid the problem discussed above by putting the values to be included in the same column in quotes (""). Applying this rule, the data might look like this: - date,type,len_hours,num_registered,num_attended,trainer,cancelled - 29 Apr,OA,1.5,1.5,15,JM,N - 3 Mar,OA,60,19,25,PG,N - 3 Jul,OA,1,25,20,"PG, JM",N - 4 Jan,OA,1,26,17,JM,N - 29 Mar,RDM,1,27,24,JM,N - +~~~ +date,type,len_hours,num_registered,num_attended,trainer,cancelled +29 Apr,OA,1.5,1.5,15,JM,N +3 Mar,OA,60,19,25,PG,N +3 Jul,OA,1,25,20,"PG, JM",N +4 Jan,OA,1,26,17,JM,N +29 Mar,RDM,1,27,24,JM,N +~~~ Now opening this file as a `csv` in Excel will not lead to an extra column, because Excel will only use commas that fall outside of quotation marks as delimiting characters. However, if you are working with an already existing dataset in which the data values are not included in "" but which have commas as both delimiters and parts of data values, you are potentially facing a major problem with data cleaning. If the dataset you're dealing with contains hundreds or thousands of records, cleaning them up manually (by either removing commas from the data values or putting the values into quotes - "") is not only going to take hours and hours but may potentially end up with you accidentally introducing many errors. -Cleaning up datasets is one of the major problems in many scientific disciplines. The approach almost always depends on the particular context. However, it is a good practice to clean the data in an automated fashion, for example by writing and running a script. The Python and R lessons will give you the basis for developing skills to build relevant scripts. +Cleaning up datasets is one of the major problems in many scientific disciplines. The approach almost always depends on the particular context. However, it is a good practice to clean the data in an automated fashion, for example by writing and running a script. The Python and R lessons will give you the basis for developing skills to build relevant scripts.