From a626d21d1e2dbd388feadf9dd97efa5e47969c37 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Fri, 29 Dec 2023 06:08:20 +0900 Subject: [PATCH 001/334] Add files via upload --- crowdin.yml | 29 +++++++++++++++++++++++++++++ 1 file changed, 29 insertions(+) create mode 100644 crowdin.yml diff --git a/crowdin.yml b/crowdin.yml new file mode 100644 index 000000000..0ba000bdd --- /dev/null +++ b/crowdin.yml @@ -0,0 +1,29 @@ +preserve_hierarchy: true + +files: + - source: /episodes/*.Rmd + dest: /episodes/%file_name%.md + type: md + translation: /locale/%two_letters_code%/episodes/%original_file_name% + - source: /episodes/*.md + translation: /locale/%two_letters_code%/episodes/%original_file_name% + - source: /instructors/*.md + translation: /locale/%two_letters_code%/instructors/%original_file_name% + - source: /learners/*.md + translation: /locale/%two_letters_code%/learners/%original_file_name% + - source: /profiles/*.md + translation: /locale/%two_letters_code%/profiles/%original_file_name% + - source: /CODE_OF_CONDUCT.md + translation: /locale/%two_letters_code%/%original_file_name% + - source: /config.yaml + translation: /locale/%two_letters_code%/%original_file_name% + - source: /CONTRIBUTING.md + translation: /locale/%two_letters_code%/%original_file_name% + - source: /LICENSE.md + translation: /locale/%two_letters_code%/%original_file_name% + - source: /README.md + translation: /locale/%two_letters_code%/%original_file_name% + - source: /index.md + translation: /locale/%two_letters_code%/%original_file_name% + - source: /links.md + translation: /locale/%two_letters_code%/%original_file_name% \ No newline at end of file From 151598769c8ac301ee4d573d18fcc85b7f4cd134 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Fri, 29 Dec 2023 06:10:20 +0900 Subject: [PATCH 002/334] New translations 10-data-organisation.md (French) --- locale/fr/episodes/10-data-organisation.Rmd | 829 ++++++++++++++++++++ 1 file changed, 829 insertions(+) create mode 100644 locale/fr/episodes/10-data-organisation.Rmd diff --git a/locale/fr/episodes/10-data-organisation.Rmd b/locale/fr/episodes/10-data-organisation.Rmd new file mode 100644 index 000000000..d52686828 --- /dev/null +++ b/locale/fr/episodes/10-data-organisation.Rmd @@ -0,0 +1,829 @@ +--- +source: Rmd +title: Data organisation with spreadsheets +teaching: 30 +exercises: 30 +--- + +```{r, include=FALSE} +``` + +::::::::::::::::::::::::::::::::::::::: objectives + +- Learn about spreadsheets, their strengths and weaknesses. +- How do we format data in spreadsheets for effective data use? +- Learn about common spreadsheet errors and how to correct them. +- Organise your data according to tidy data principles. +- Learn about text-based spreadsheet formats such as the comma-separated (CSV) or tab-separated (TSV) formats. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- How to organise tabular data? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Spreadsheet programs + +**Question** + +- What are basic principles for using spreadsheets for good data + organization? + +**Objective** + +- Describe best practices for organizing data so computers can make + the best use of datasets. + +**Keypoint** + +- Good data organization is the foundation of any research project. + +Good data organization is the foundation of your research +project. Most researchers have data or do data entry in +spreadsheets. Spreadsheet programs are very useful graphical +interfaces for designing data tables and handling very basic data +quality control functions. See also @Broman:2018. + +### Spreadsheet outline + +Spreadsheets are good for data entry. Therefore we have a lot of data +in spreadsheets. Much of your time as a researcher will be spent in +this 'data wrangling' stage. It's not the most fun, but it's +necessary. We'll teach you how to think about data organization and +some practices for more effective data wrangling. + +### What this lesson will not teach you + +- How to do _statistics_ in a spreadsheet +- How to do _plotting_ in a spreadsheet +- How to _write code_ in spreadsheet programs + +If you're looking to do this, a good reference is Head First +Excel, +published by O'Reilly. + +### Why aren't we teaching data analysis in spreadsheets + +- Data analysis in spreadsheets usually requires a lot of manual + work. If you want to change a parameter or run an analysis with a + new dataset, you usually have to redo everything by hand. (We do + know that you can create macros, but see the next point.) + +- It is also difficult to track or reproduce statistical or plotting + analyses done in spreadsheet programs when you want to go back to + your work or someone asks for details of your analysis. + +Many spreadsheet programs are available. Since most participants +utilise Excel as their primary spreadsheet program, this lesson will +make use of Excel examples. A free spreadsheet program that can also +be used is LibreOffice. Commands may differ a bit between programs, +but the general idea is the same. + +Spreadsheet programs encompass a lot of the things we need to be able +to do as researchers. We can use them for: + +- Data entry +- Organizing data +- Subsetting and sorting data +- Statistics +- Plotting + +Spreadsheet programs use tables to represent and display data. Data +formatted as tables is also the main theme of this chapter, and we +will see how to organise data into tables in a standardised way to +ensure efficient downstream analysis. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge: Discuss the following points with your neighbour + +- Have you used spreadsheets, in your research, courses, + or at home? +- What kind of operations do you do in spreadsheets? +- Which ones do you think spreadsheets are good for? +- Have you accidentally done something in a spreadsheet program that made you + frustrated or sad? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +### Problems with spreadsheets + +Spreadsheets are good for data entry, but in reality we tend to +use spreadsheet programs for much more than data entry. We use them +to create data tables for publications, to generate summary +statistics, and make figures. + +Generating tables for publications in a spreadsheet is not +optimal - often, when formatting a data table for publication, we're +reporting key summary statistics in a way that is not really meant to +be read as data, and often involves special formatting +(merging cells, creating borders, making it pretty). We advise you to +do this sort of operation within your document editing software. + +The latter two applications, generating statistics and figures, should +be used with caution: because of the graphical, drag and drop nature of +spreadsheet programs, it can be very difficult, if not impossible, to +replicate your steps (much less retrace anyone else's), particularly if your +stats or figures require you to do more complex calculations. Furthermore, +in doing calculations in a spreadsheet, it's easy to accidentally apply a +slightly different formula to multiple adjacent cells. When using a +command-line based statistics program like R or SAS, it's practically +impossible to apply a calculation to one observation in your +dataset but not another unless you're doing it on purpose. + +### Using spreadsheets for data entry and cleaning + +In this lesson, we will assume that you are most likely using Excel as +your primary spreadsheet program - there are others (gnumeric, Calc +from OpenOffice), and their functionality is similar, but Excel seems +to be the program most used by biologists and biomedical researchers. + +In this lesson we're going to talk about: + +1. Formatting data tables in spreadsheets +2. Formatting problems +3. Exporting data + +## Formatting data tables in spreadsheets + +**Questions** + +- How do we format data in spreadsheets for effective data use? + +**Objectives** + +- Describe best practices for data entry and formatting in + spreadsheets. + +- Apply best practices to arrange variables and observations in a + spreadsheet. + +**Keypoints** + +- Never modify your raw data. Always make a copy before making any + changes. + +- Keep track of all of the steps you take to clean your data in a + plain text file. + +- Organise your data according to tidy data principles. + +The most common mistake made is treating spreadsheet programs like lab +notebooks, that is, relying on context, notes in the margin, spatial +layout of data and fields to convey information. As humans, we can +(usually) interpret these things, but computers don't view information +the same way, and unless we explain to the computer what every single +thing means (and that can be hard!), it will not be able to see how +our data fits together. + +Using the power of computers, we can manage and analyse data in much +more effective and faster ways, but to use that power, we have to set +up our data for the computer to be able to understand it (and +computers are very literal). + +This is why it's extremely important to set up well-formatted tables +from the outset - before you even start entering data from your very +first preliminary experiment. Data organization is the foundation of +your research project. It can make it easier or harder to work with +your data throughout your analysis, so it's worth thinking about when +you're doing your data entry or setting up your experiment. You can +set things up in different ways in spreadsheets, but some of these +choices can limit your ability to work with the data in other programs +or have the you-of-6-months-from-now or your collaborator work with +the data. + +**Note:** the best layouts/formats (as well as software and +interfaces) for data entry and data analysis might be different. It is +important to take this into account, and ideally automate the +conversion from one to another. + +### Keeping track of your analyses + +When you're working with spreadsheets, during data clean up or +analyses, it's very easy to end up with a spreadsheet that looks very +different from the one you started with. In order to be able to +reproduce your analyses or figure out what you did when a reviewer or +instructor asks for a different analysis, you should + +- create a new file with your cleaned or analysed data. Don't modify + the original dataset, or you will never know where you started! + +- keep track of the steps you took in your clean up or analysis. You + should track these steps as you would any step in an experiment. We + recommend that you do this in a plain text file stored in the same + folder as the data file. + +This might be an example of a spreadsheet setup: + +![](fig/spreadsheet-setup-updated.png) + +Put these principles in to practice today during your exercises. + +While versioning is out of scope for this course, you can look at the +Carpentries lesson on +['Git'](https://swcarpentry.github.io/git-novice/) to learn how to +maintain **version control** over your data. See also this blog +post for a quick tutorial or +@Perez-Riverol:2016 for a more research-oriented use-case. + +### Structuring data in spreadsheets + +The cardinal rules of using spreadsheet programs for data: + +1. Put all your variables in columns - the thing you're measuring, + like 'weight' or 'temperature'. +2. Put each observation in its own row. +3. Don't combine multiple pieces of information in one cell. Sometimes + it just seems like one thing, but think if that's the only way + you'll want to be able to use or sort that data. +4. Leave the raw data raw - don't change it! +5. Export the cleaned data to a text-based format like CSV + (comma-separated values) format. This ensures that anyone can use + the data, and is required by most data repositories. + +For instance, we have data from patients that visited several +hospitals in Brussels, Belgium. They recorded the date of the visit, +the hospital, the patients' gender, weight and blood group. + +If we were to keep track of the data like this: + +![](fig/multiple-info.png) + +the problem is that the ABO and Rhesus groups are in the same `Blood` +type column. So, if they wanted to look at all observations of the A +group or look at weight distributions by ABO group, it would be tricky +to do this using this data setup. If instead we put the ABO and Rhesus +groups in different columns, you can see that it would be much easier. + +![](fig/single-info.png) + +An important rule when setting up a datasheet, is that **columns are +used for variables** and **rows are used for observations**: + +- columns are variables +- rows are observations +- cells are individual values + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge: We're going to take a messy dataset and describe how we would clean it up. + +1. Download a messy dataset by clicking + [here](https://github.com/UCLouvain-CBIO/WSBIM1207/raw/master/data/messy_covid.xlsx). + +2. Open up the data in a spreadsheet program. + +3. You can see that there are two tabs. The data contains various + clinical variables recorded in various hospitals in Brussels during + the first and second COVID-19 waves in 2020. As you can see, the + data have been recorded differently during the March and November + waves. Now you're the person in charge of this project and you want + to be able to start analyzing the data. + +4. With the person next to you, identify what is wrong with this + spreadsheet. Also discuss the steps you would need to take to clean + up first and second wave tabs, and to put them all together in one + spreadsheet. + +**Important:** Do not forget our first piece of advice: to create a +new file (or tab) for the cleaned data, never modify your original +(raw) data. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +After you go through this exercise, we'll discuss as a group what was +wrong with this data and how you would fix it. + +<!-- - Take about 10 minutes to work on this exercise. --> + +<!-- - All the mistakes in the *common mistakes* section below are present --> + +<!-- in the messy dataset. If the exercise is done during a workshop, ask --> + +<!-- people what they saw as wrong with the data. As they bring up --> + +<!-- different points, you can refer to the common mistakes or expand a --> + +<!-- bit on the point they brought up. --> + +<!-- - If you get a response where they've fixed the date, you can pause --> + +<!-- and go to the dates lesson. Or you can say you'll come back to dates --> + +<!-- at the end. --> + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge: Once you have tidied up the data, answer the following questions: + +- How many men and women took part in the study? +- How many A, AB, and B types have been tested? +- As above, but disregarding the contaminated samples? +- How many Rhesus + and - have been tested? +- How many universal donors (O-) have been tested? +- What is the average weight of AB men? +- How many samples have been tested in the different hospitals? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +An **excellent reference**, in particular with regard to R scripting +is the _Tidy Data_ paper @Wickham:2014. + +## Common spreadsheet errors + +**Questions** + +- What are some common challenges with formatting data in spreadsheets + and how can we avoid them? + +**Objectives** + +- Recognise and resolve common spreadsheet formatting problems. + +**Keypoints** + +- Avoid using multiple tables within one spreadsheet. +- Avoid spreading data across multiple tabs. +- Record zeros as zeros. +- Use an appropriate null value to record missing data. +- Don't use formatting to convey information or to make your spreadsheet look pretty. +- Place comments in a separate column. +- Record units in column headers. +- Include only one piece of information in a cell. +- Avoid spaces, numbers and special characters in column headers. +- Avoid special characters in your data. +- Record metadata in a separate plain text file. + +<!-- This lesson is meant to be used as a reference for discussion as --> + +<!-- learners identify issues with the messy dataset discussed in the --> + +<!-- previous lesson. Instructors: don't go through this lesson except to --> + +<!-- refer to responses to the exercise in the previous lesson. --> + +There are a few potential errors to be on the lookout for in your own +data as well as data from collaborators or the Internet. If you are +aware of the errors and the possible negative effect on downstream +data analysis and result interpretation, it might motivate yourself +and your project members to try and avoid them. Making small changes +to the way you format your data in spreadsheets, can have a great +impact on efficiency and reliability when it comes to data cleaning +and analysis. + +- [Using multiple tables](#tables) +- [Using multiple tabs](#tabs) +- [Not filling in zeros](#zeros) +- [Using problematic null values](#null) +- [Using formatting to convey information](#formatting) +- [Using formatting to make the data sheet look pretty](#formatting_pretty) +- [Placing comments or units in cells](#units) +- [Entering more than one piece of information in a cell](#info) +- [Using problematic field names](#field_name) +- [Using special characters in data](#special) +- [Inclusion of metadata in data table](#metadata) + +### Using multiple tables {#tables} + +A common strategy is creating multiple data tables within one +spreadsheet. This confuses the computer, so don't do this! When you +create multiple tables within one spreadsheet, you're drawing false +associations between things for the computer, which sees each row as +an observation. You're also potentially using the same field name in +multiple places, which will make it harder to clean your data up into +a usable form. The example below depicts the problem: + +![](fig/2_datasheet_example.jpg) + +In the example above, the computer will see (for example) row 4 and +assume that all columns A-AF refer to the same sample. This row +actually represents four distinct samples (sample 1 for each of four +different collection dates - May 29th, June 12th, June 19th, and June +26th), as well as some calculated summary statistics (an average (avr) +and standard error of measurement (SEM)) for two of those +samples. Other rows are similarly problematic. + +### Using multiple tabs {#tabs} + +But what about workbook tabs? That seems like an easy way to organise +data, right? Well, yes and no. When you create extra tabs, you fail to +allow the computer to see connections in the data that are there (you +have to introduce spreadsheet application-specific functions or +scripting to ensure this connection). Say, for instance, you make a +separate tab for each day you take a measurement. + +This isn't good practice for two reasons: + +1. you are more likely to accidentally add inconsistencies to your + data if each time you take a measurement, you start recording data + in a new tab, and + +2. even if you manage to prevent all inconsistencies from creeping in, + you will add an extra step for yourself before you analyse the data + because you will have to combine these data into a single + datatable. You will have to explicitly tell the computer how to + combine tabs - and if the tabs are inconsistently formatted, you + might even have to do it manually. + +The next time you're entering data, and you go to create another tab +or table, ask yourself if you could avoid adding this tab by adding +another column to your original spreadsheet. We used multiple tabs in +our example of a messy data file, but now you've seen how you can +reorganise your data to consolidate across tabs. + +Your data sheet might get very long over the course of the +experiment. This makes it harder to enter data if you can't see your +headers at the top of the spreadsheet. But don't repeat your header +row. These can easily get mixed into the data, leading to problems +down the road. Instead you can freeze the column +headers +so that they remain visible even when you have a spreadsheet with many +rows. + +### Not filling in zeros {#zeros} + +It might be that when you're measuring something, it's usually a zero, +say the number of times a rabbit is observed in the survey. Why bother +writing in the number zero in that column, when it's mostly zeros? + +However, there's a difference between a zero and a blank cell in a +spreadsheet. To the computer, a zero is actually data. You measured or +counted it. A blank cell means that it wasn't measured and the +computer will interpret it as an unknown value (also known as a null +or missing value). + +The spreadsheets or statistical programs will likely misinterpret +blank cells that you intend to be zeros. By not entering the value of +your observation, you are telling your computer to represent that data +as unknown or missing (null). This can cause problems with subsequent +calculations or analyses. For example, the average of a set of numbers +which includes a single null value is always null (because the +computer can't guess the value of the missing observations). Because +of this, it's very important to record zeros as zeros and truly +missing data as nulls. + +### Using problematic null values {#null} + +**Example**: using -999 or other numerical values (or zero) to +represent missing data. + +**Solutions**: + +There are a few reasons why null values get represented differently +within a dataset. Sometimes confusing null values are automatically +recorded from the measuring device. If that's the case, there's not +much you can do, but it can be addressed in data cleaning with a tool +like +[OpenRefine](https://www.datacarpentry.org/OpenRefine-ecology-lesson/) +before analysis. Other times different null values are used to convey +different reasons why the data isn't there. This is important +information to capture, but is in effect using one column to capture +two pieces of information. Like for using formatting to convey +information it would be good here to create a new +column like 'data_missing' and use that column to capture the +different reasons. + +Whatever the reason, it's a problem if unknown or missing data is +recorded as -999, 999, or 0. + +Many statistical programs will not recognise that these are intended +to represent missing (null) values. How these values are interpreted +will depend on the software you use to analyse your data. It is +essential to use a clearly defined and consistent null indicator. + +Blanks (most applications) and NA (for R) are good +choices. @White:2013 explain good choices for indicating null values +for different software applications in their article: + +![](fig/3_white_table_1.jpg) + +### Using formatting to convey information {#formatting} + +**Example**: highlighting cells, rows or columns that should be +excluded from an analysis, leaving blank rows to indicate +separations in data. + +![](fig/formatting.png) + +**Solution**: create a new field to encode which data should be +excluded. + +![](fig/good_formatting.png) + +### Using formatting to make the data sheet look pretty {#formatting_pretty} + +**Example**: merging cells. + +**Solution**: If you're not careful, formatting a worksheet to be more +aesthetically pleasing can compromise your computer's ability to see +associations in the data. Merged cells will make your data unreadable +by statistics software. Consider restructuring your data in such a way +that you will not need to merge cells to organise your data. + +### Placing comments or units in cells {#units} + +Most analysis software can't see Excel or LibreOffice comments, and +would be confused by comments placed within your data cells. As +described above for formatting, create another field if you need to +add notes to cells. Similarly, don't include units in cells: ideally, +all the measurements you place in one column should be in the same +unit, but if for some reason they aren't, create another field and +specify the units the cell is in. + +### Entering more than one piece of information in a cell {#info} + +**Example**: Recording ABO and Rhesus groups in one cell, such as A+, +B+, A-, ... + +**Solution**: Don't include more than one piece of information in a +cell. This will limit the ways in which you can analyse your data. If +you need both these measurements, design your data sheet to include +this information. For example, include one column for the ABO group and +one for the Rhesus group. + +### Using problematic field names {#field_name} + +Choose descriptive field names, but be careful not to include spaces, +numbers, or special characters of any kind. Spaces can be +misinterpreted by parsers that use whitespace as delimiters and some +programs don't like field names that are text strings that start with +numbers. + +Underscores (`_`) are a good alternative to spaces. Consider writing +names in camel case (like this: ExampleFileName) to improve +readability. Remember that abbreviations that make sense at the moment +may not be so obvious in 6 months, but don't overdo it with names that +are excessively long. Including the units in the field names avoids +confusion and enables others to readily interpret your fields. + +**Examples** + +| Good Name | Good Alternative | Avoid | +| ---------------------------------------------------------- | -------------------------------------- | ------------------------------------ | +| Max_temp_C | MaxTemp | Maximum Temp (°C) | +| Precipitation_mm | Precipitation | precmm | +| Mean_year_growth | MeanYearGrowth | Mean growth/year | +| sex | sex | M/F | +| weight | weight | w. | +| cell_type | CellType | Cell Type | +| Observation_01 | first_observation | 1st Obs | + +### Using special characters in data {#special} + +**Example**: You treat your spreadsheet program as a word processor +when writing notes, for example copying data directly from Word or +other applications. + +**Solution**: This is a common strategy. For example, when writing +longer text in a cell, people often include line breaks, em-dashes, +etc. in their spreadsheet. Also, when copying data in from +applications such as Word, formatting and fancy non-standard +characters (such as left- and right-aligned quotation marks) are +included. When exporting this data into a coding/statistical +environment or into a relational database, dangerous things may occur, +such as lines being cut in half and encoding errors being thrown. + +General best practice is to avoid adding characters such as newlines, +tabs, and vertical tabs. In other words, treat a text cell as if it +were a simple web form that can only contain text and spaces. + +### Inclusion of metadata in data table {#metadata} + +**Example**: You add a legend at the top or bottom of your data table +explaining column meaning, units, exceptions, etc. + +**Solution**: Recording data about your data ("metadata") is +essential. You may be on intimate terms with your dataset while you +are collecting and analysing it, but the chances that you will still +remember that the variable "sglmemgp" means single member of group, +for example, or the exact algorithm you used to transform a variable +or create a derived one, after a few months, a year, or more are slim. + +As well, there are many reasons other people may want to examine or +use your data - to understand your findings, to verify your findings, +to review your submitted publication, to replicate your results, to +design a similar study, or even to archive your data for access and +re-use by others. While digital data by definition are +machine-readable, understanding their meaning is a job for human +beings. The importance of documenting your data during the collection +and analysis phase of your research cannot be overestimated, +especially if your research is going to be part of the scholarly +record. + +However, metadata should not be contained in the data file +itself. Unlike a table in a paper or a supplemental file, metadata (in +the form of legends) should not be included in a data file since this +information is not data, and including it can disrupt how computer +programs interpret your data file. Rather, metadata should be stored +as a separate file in the same directory as your data file, preferably +in plain text format with a name that clearly associates it with your +data file. Because metadata files are free text format, they also +allow you to encode comments, units, information about how null values +are encoded, etc. that are important to document but can disrupt the +formatting of your data file. + +Additionally, file or database level metadata describes how files that +make up the dataset relate to each other; what format they are in; and +whether they supercede or are superceded by previous files. A +folder-level readme.txt file is the classic way of accounting for all +the files and folders in a project. + +(Text on metadata adapted from the online course Research Data +[MANTRA](https://datalib.edina.ac.uk/mantra) by EDINA and Data Library, +University of Edinburgh. MANTRA is licensed under a Creative Commons +Attribution 4.0 International +License.) + +## Exporting data + +**Question** + +- How can we export data from spreadsheets in a way that is useful for + downstream applications? + +**Objectives** + +- Store spreadsheet data in universal file formats. +- Export data from a spreadsheet to a CSV file. + +**Keypoints** + +- Data stored in common spreadsheet formats will often not be read + correctly into data analysis software, introducing errors into your + data. + +- Exporting data from spreadsheets to formats like CSV or TSV puts it + in a format that can be used consistently by most programs. + +Storing the data you're going to work with for your analyses in Excel +default file format (`*.xls` or `*.xlsx` - depending on the Excel +version) isn't a good idea. Why? + +- Because it is a proprietary format, and it is possible that in the + future, technology won't exist (or will become sufficiently rare) to + make it inconvenient, if not impossible, to open the file. + +- Other spreadsheet software may not be able to open files saved in a + proprietary Excel format. + +- Different versions of Excel may handle data differently, leading to + inconsistencies. [Dates](https://datacarpentry.org/spreadsheet-ecology-lesson/03-dates-as-data/index.html) + is a well-documented example of inconsistencies in data storage. + +- Finally, more journals and grant agencies are requiring you to + deposit your data in a data repository, and most of them don't + accept Excel format. It needs to be in one of the formats discussed + below. + +- The above points also apply to other formats such as open data + formats used by LibreOffice / Open Office. These formats are not + static and do not get parsed the same way by different software + packages. + +Storing data in a universal, open, and static format will help deal +with this problem. Try tab-delimited (tab separated values or TSV) or +comma-delimited (comma separated values or CSV). CSV files are plain +text files where the columns are separated by commas, hence 'comma +separated values' or CSV. The advantage of a CSV file over an +Excel/SPSS/etc. file is that we can open and read a CSV file using +just about any software, including plain text editors like TextEdit or +NotePad. Data in a CSV file can also be easily imported into other +formats and environments, such as SQLite and R. We're not tied to a +certain version of a certain expensive program when we work with CSV +files, so it's a good format to work with for maximum portability and +endurance. Most spreadsheet programs can save to delimited text +formats like CSV easily, although they may give you a warning during +the file export. + +To save a file you have opened in Excel in CSV format: + +1. From the top menu select 'File' and 'Save as'. +2. In the 'Format' field, from the list, select 'Comma Separated + Values' (`*.csv`). +3. Double check the file name and the location where you want to save + it and hit 'Save'. + +An important note for backwards compatibility: you can open CSV files +in Excel! + +```{r, results="markup", fig.cap="Saving an Excel file to CSV.", echo=FALSE, purl=FALSE, out.width="70%", fig.align="center"} +knitr::include_graphics("fig/excel-to-csv.png") +``` + +**A note on R and `xls`**: There are R packages that can read `xls` +files (as well as Google spreadsheets). It is even possible to access +different worksheets in the `xls` documents. + +**But** + +- some of these only work on Windows. +- this equates to replacing a (simple but manual) export to `csv` with + additional complexity/dependencies in the data analysis R code. +- data formatting best practice still apply. +- Is there really a good reason why `csv` (or similar) is not + adequate? + +### Caveats on commas + +In some datasets, the data values themselves may include commas +(,). In that case, the software which you use (including Excel) will +most likely incorrectly display the data in columns. This is because +the commas which are a part of the data values will be interpreted as +delimiters. + +For example, our data might look like this: + +``` +species_id,genus,species,taxa +AB,Amphispiza,bilineata,Bird +AH,Ammospermophilus,harrisi,Rodent, not censused +AS,Ammodramus,savannarum,Bird +BA,Baiomys,taylori,Rodent +``` + +In the record `AH,Ammospermophilus,harrisi,Rodent, not censused` the +value for `taxa` includes a comma (`Rodent, not censused`). If we try +to read the above into Excel (or other spreadsheet program), we will +get something like this: + +```{r, results="markup", fig.cap="The risks of having commas inside comma-separated data.", echo=FALSE, purl=FALSE, out.width="80%", fig.align="center"} +knitr::include_graphics("fig/csv-mistake.png") +``` + +The value for `taxa` was split into two columns (instead of being put +in one column `D`). This can propagate to a number of further +errors. For example, the extra column will be interpreted as a column +with many missing values (and without a proper header). In addition to +that, the value in column `D` for the record in row 3 (so the one +where the value for 'taxa' contained the comma) is now incorrect. + +If you want to store your data in `csv` format and expect that your +data values may contain commas, you can avoid the problem discussed +above by putting the values in quotes (""). Applying this rule, our +data might look like this: + +``` +species_id,genus,species,taxa +"AB","Amphispiza","bilineata","Bird" +"AH","Ammospermophilus","harrisi","Rodent, not censused" +"AS","Ammodramus","savannarum","Bird" +"BA","Baiomys","taylori","Rodent" +``` + +Now opening this file as a `csv` in Excel will not lead to an extra +column, because Excel will only use commas that fall outside of +quotation marks as delimiting characters. + +Alternatively, if you are working with data that contains commas, you +likely will need to use another delimiter when working in a +spreadsheet[^decsep]. In this case, consider using tabs as your delimiter and +working with TSV files. TSV files can be exported from spreadsheet +programs in the same way as CSV files. + +[^decsep]: This is particularly relevant in European + countries where the comma is used as a decimal + separator. In such cases, the default value separator in a + csv file will be the semi-colon (;), or values will be + systematically quoted. + +If you are working with an already existing dataset in which the data +values are not included in "" but which have commas as both delimiters +and parts of data values, you are potentially facing a major problem +with data cleaning. If the dataset you're dealing with contains +hundreds or thousands of records, cleaning them up manually (by either +removing commas from the data values or putting the values into +quotes - "") is not only going to take hours and hours but may +potentially end up with you accidentally introducing many errors. + +Cleaning up datasets is one of the major problems in many scientific +disciplines. The approach almost always depends on the particular +context. However, it is a good practice to clean the data in an +automated fashion, for example by writing and running a script. The +Python and R lessons will give you the basis for developing skills to +build relevant scripts. + +## Summary + +```{r analysis, results="asis", fig.margin=TRUE, fig.cap="A typical data analysis workflow.", fig.width=7, fig.height=4, echo=FALSE, purl=FALSE} +knitr::include_graphics("fig/analysis.png") +``` + +A typical data analysis workflow is illustrated in the figure above, +where data is repeatedly transformed, visualised, and modelled. This +iteration is repeated multiple times until the data is understood. In +many real-life cases, however, most time is spent cleaning up and +preparing the data, rather than actually analysing and understanding +it. + +An agile data analysis workflow, with several fast iterations of the +transform/visualise/model cycle is only feasible if the data is +formatted in a predictable way and one can reason about the data +without having to look at it and/or fix it. + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- Good data organization is the foundation of any research project. + +:::::::::::::::::::::::::::::::::::::::::::::::::: From 99598b2326d4fd295425b970978bcf23da8a30ab Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Fri, 29 Dec 2023 06:10:21 +0900 Subject: [PATCH 003/334] New translations 10-data-organisation.md (Spanish) --- locale/es/episodes/10-data-organisation.Rmd | 829 ++++++++++++++++++++ 1 file changed, 829 insertions(+) create mode 100644 locale/es/episodes/10-data-organisation.Rmd diff --git a/locale/es/episodes/10-data-organisation.Rmd b/locale/es/episodes/10-data-organisation.Rmd new file mode 100644 index 000000000..77b5925cd --- /dev/null +++ b/locale/es/episodes/10-data-organisation.Rmd @@ -0,0 +1,829 @@ +--- +source: Rmd +title: Data organisation with spreadsheets +teaching: 30 +exercises: 30 +--- + +```{r, include=FALSE} +``` + +::::::::::::::::::::::::::::::::::::::: objetivos + +- Learn about spreadsheets, their strengths and weaknesses. +- How do we format data in spreadsheets for effective data use? +- Learn about common spreadsheet errors and how to correct them. +- Organise your data according to tidy data principles. +- Learn about text-based spreadsheet formats such as the comma-separated (CSV) or tab-separated (TSV) formats. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- How to organise tabular data? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Spreadsheet programs + +**Question** + +- What are basic principles for using spreadsheets for good data + organization? + +**Objective** + +- Describe best practices for organizing data so computers can make + the best use of datasets. + +**Keypoint** + +- Good data organization is the foundation of any research project. + +Good data organization is the foundation of your research +project. Most researchers have data or do data entry in +spreadsheets. Spreadsheet programs are very useful graphical +interfaces for designing data tables and handling very basic data +quality control functions. See also @Broman:2018. + +### Spreadsheet outline + +Spreadsheets are good for data entry. Therefore we have a lot of data +in spreadsheets. Much of your time as a researcher will be spent in +this 'data wrangling' stage. It's not the most fun, but it's +necessary. We'll teach you how to think about data organization and +some practices for more effective data wrangling. + +### What this lesson will not teach you + +- How to do _statistics_ in a spreadsheet +- How to do _plotting_ in a spreadsheet +- How to _write code_ in spreadsheet programs + +If you're looking to do this, a good reference is Head First +Excel, +published by O'Reilly. + +### Why aren't we teaching data analysis in spreadsheets + +- Data analysis in spreadsheets usually requires a lot of manual + work. If you want to change a parameter or run an analysis with a + new dataset, you usually have to redo everything by hand. (We do + know that you can create macros, but see the next point.) + +- It is also difficult to track or reproduce statistical or plotting + analyses done in spreadsheet programs when you want to go back to + your work or someone asks for details of your analysis. + +Many spreadsheet programs are available. Since most participants +utilise Excel as their primary spreadsheet program, this lesson will +make use of Excel examples. A free spreadsheet program that can also +be used is LibreOffice. Commands may differ a bit between programs, +but the general idea is the same. + +Spreadsheet programs encompass a lot of the things we need to be able +to do as researchers. We can use them for: + +- Data entry +- Organizing data +- Subsetting and sorting data +- Statistics +- Plotting + +Spreadsheet programs use tables to represent and display data. Data +formatted as tables is also the main theme of this chapter, and we +will see how to organise data into tables in a standardised way to +ensure efficient downstream analysis. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge: Discuss the following points with your neighbour + +- Have you used spreadsheets, in your research, courses, + or at home? +- What kind of operations do you do in spreadsheets? +- Which ones do you think spreadsheets are good for? +- Have you accidentally done something in a spreadsheet program that made you + frustrated or sad? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +### Problems with spreadsheets + +Spreadsheets are good for data entry, but in reality we tend to +use spreadsheet programs for much more than data entry. We use them +to create data tables for publications, to generate summary +statistics, and make figures. + +Generating tables for publications in a spreadsheet is not +optimal - often, when formatting a data table for publication, we're +reporting key summary statistics in a way that is not really meant to +be read as data, and often involves special formatting +(merging cells, creating borders, making it pretty). We advise you to +do this sort of operation within your document editing software. + +The latter two applications, generating statistics and figures, should +be used with caution: because of the graphical, drag and drop nature of +spreadsheet programs, it can be very difficult, if not impossible, to +replicate your steps (much less retrace anyone else's), particularly if your +stats or figures require you to do more complex calculations. Furthermore, +in doing calculations in a spreadsheet, it's easy to accidentally apply a +slightly different formula to multiple adjacent cells. When using a +command-line based statistics program like R or SAS, it's practically +impossible to apply a calculation to one observation in your +dataset but not another unless you're doing it on purpose. + +### Using spreadsheets for data entry and cleaning + +In this lesson, we will assume that you are most likely using Excel as +your primary spreadsheet program - there are others (gnumeric, Calc +from OpenOffice), and their functionality is similar, but Excel seems +to be the program most used by biologists and biomedical researchers. + +In this lesson we're going to talk about: + +1. Formatting data tables in spreadsheets +2. Formatting problems +3. Exporting data + +## Formatting data tables in spreadsheets + +**Questions** + +- How do we format data in spreadsheets for effective data use? + +**Objectives** + +- Describe best practices for data entry and formatting in + spreadsheets. + +- Apply best practices to arrange variables and observations in a + spreadsheet. + +**Keypoints** + +- Never modify your raw data. Always make a copy before making any + changes. + +- Keep track of all of the steps you take to clean your data in a + plain text file. + +- Organise your data according to tidy data principles. + +The most common mistake made is treating spreadsheet programs like lab +notebooks, that is, relying on context, notes in the margin, spatial +layout of data and fields to convey information. As humans, we can +(usually) interpret these things, but computers don't view information +the same way, and unless we explain to the computer what every single +thing means (and that can be hard!), it will not be able to see how +our data fits together. + +Using the power of computers, we can manage and analyse data in much +more effective and faster ways, but to use that power, we have to set +up our data for the computer to be able to understand it (and +computers are very literal). + +This is why it's extremely important to set up well-formatted tables +from the outset - before you even start entering data from your very +first preliminary experiment. Data organization is the foundation of +your research project. It can make it easier or harder to work with +your data throughout your analysis, so it's worth thinking about when +you're doing your data entry or setting up your experiment. You can +set things up in different ways in spreadsheets, but some of these +choices can limit your ability to work with the data in other programs +or have the you-of-6-months-from-now or your collaborator work with +the data. + +**Note:** the best layouts/formats (as well as software and +interfaces) for data entry and data analysis might be different. It is +important to take this into account, and ideally automate the +conversion from one to another. + +### Keeping track of your analyses + +When you're working with spreadsheets, during data clean up or +analyses, it's very easy to end up with a spreadsheet that looks very +different from the one you started with. In order to be able to +reproduce your analyses or figure out what you did when a reviewer or +instructor asks for a different analysis, you should + +- create a new file with your cleaned or analysed data. Don't modify + the original dataset, or you will never know where you started! + +- keep track of the steps you took in your clean up or analysis. You + should track these steps as you would any step in an experiment. We + recommend that you do this in a plain text file stored in the same + folder as the data file. + +This might be an example of a spreadsheet setup: + +![](fig/spreadsheet-setup-updated.png) + +Put these principles in to practice today during your exercises. + +While versioning is out of scope for this course, you can look at the +Carpentries lesson on +['Git'](https://swcarpentry.github.io/git-novice/) to learn how to +maintain **version control** over your data. See also this blog +post for a quick tutorial or +@Perez-Riverol:2016 for a more research-oriented use-case. + +### Structuring data in spreadsheets + +The cardinal rules of using spreadsheet programs for data: + +1. Put all your variables in columns - the thing you're measuring, + like 'weight' or 'temperature'. +2. Put each observation in its own row. +3. Don't combine multiple pieces of information in one cell. Sometimes + it just seems like one thing, but think if that's the only way + you'll want to be able to use or sort that data. +4. Leave the raw data raw - don't change it! +5. Export the cleaned data to a text-based format like CSV + (comma-separated values) format. This ensures that anyone can use + the data, and is required by most data repositories. + +For instance, we have data from patients that visited several +hospitals in Brussels, Belgium. They recorded the date of the visit, +the hospital, the patients' gender, weight and blood group. + +If we were to keep track of the data like this: + +![](fig/multiple-info.png) + +the problem is that the ABO and Rhesus groups are in the same `Blood` +type column. So, if they wanted to look at all observations of the A +group or look at weight distributions by ABO group, it would be tricky +to do this using this data setup. If instead we put the ABO and Rhesus +groups in different columns, you can see that it would be much easier. + +![](fig/single-info.png) + +An important rule when setting up a datasheet, is that **columns are +used for variables** and **rows are used for observations**: + +- columns are variables +- rows are observations +- cells are individual values + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge: We're going to take a messy dataset and describe how we would clean it up. + +1. Download a messy dataset by clicking + [here](https://github.com/UCLouvain-CBIO/WSBIM1207/raw/master/data/messy_covid.xlsx). + +2. Open up the data in a spreadsheet program. + +3. You can see that there are two tabs. The data contains various + clinical variables recorded in various hospitals in Brussels during + the first and second COVID-19 waves in 2020. As you can see, the + data have been recorded differently during the March and November + waves. Now you're the person in charge of this project and you want + to be able to start analyzing the data. + +4. With the person next to you, identify what is wrong with this + spreadsheet. Also discuss the steps you would need to take to clean + up first and second wave tabs, and to put them all together in one + spreadsheet. + +**Important:** Do not forget our first piece of advice: to create a +new file (or tab) for the cleaned data, never modify your original +(raw) data. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +After you go through this exercise, we'll discuss as a group what was +wrong with this data and how you would fix it. + +<!-- - Take about 10 minutes to work on this exercise. --> + +<!-- - All the mistakes in the *common mistakes* section below are present --> + +<!-- in the messy dataset. If the exercise is done during a workshop, ask --> + +<!-- people what they saw as wrong with the data. As they bring up --> + +<!-- different points, you can refer to the common mistakes or expand a --> + +<!-- bit on the point they brought up. --> + +<!-- - If you get a response where they've fixed the date, you can pause --> + +<!-- and go to the dates lesson. Or you can say you'll come back to dates --> + +<!-- at the end. --> + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge: Once you have tidied up the data, answer the following questions: + +- How many men and women took part in the study? +- How many A, AB, and B types have been tested? +- As above, but disregarding the contaminated samples? +- How many Rhesus + and - have been tested? +- How many universal donors (O-) have been tested? +- What is the average weight of AB men? +- How many samples have been tested in the different hospitals? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +An **excellent reference**, in particular with regard to R scripting +is the _Tidy Data_ paper @Wickham:2014. + +## Common spreadsheet errors + +**Questions** + +- What are some common challenges with formatting data in spreadsheets + and how can we avoid them? + +**Objectives** + +- Recognise and resolve common spreadsheet formatting problems. + +**Keypoints** + +- Avoid using multiple tables within one spreadsheet. +- Avoid spreading data across multiple tabs. +- Record zeros as zeros. +- Use an appropriate null value to record missing data. +- Don't use formatting to convey information or to make your spreadsheet look pretty. +- Place comments in a separate column. +- Record units in column headers. +- Include only one piece of information in a cell. +- Avoid spaces, numbers and special characters in column headers. +- Avoid special characters in your data. +- Record metadata in a separate plain text file. + +<!-- This lesson is meant to be used as a reference for discussion as --> + +<!-- learners identify issues with the messy dataset discussed in the --> + +<!-- previous lesson. Instructors: don't go through this lesson except to --> + +<!-- refer to responses to the exercise in the previous lesson. --> + +There are a few potential errors to be on the lookout for in your own +data as well as data from collaborators or the Internet. If you are +aware of the errors and the possible negative effect on downstream +data analysis and result interpretation, it might motivate yourself +and your project members to try and avoid them. Making small changes +to the way you format your data in spreadsheets, can have a great +impact on efficiency and reliability when it comes to data cleaning +and analysis. + +- [Using multiple tables](#tables) +- [Using multiple tabs](#tabs) +- [Not filling in zeros](#zeros) +- [Using problematic null values](#null) +- [Using formatting to convey information](#formatting) +- [Using formatting to make the data sheet look pretty](#formatting_pretty) +- [Placing comments or units in cells](#units) +- [Entering more than one piece of information in a cell](#info) +- [Using problematic field names](#field_name) +- [Using special characters in data](#special) +- [Inclusion of metadata in data table](#metadata) + +### Using multiple tables {#tables} + +A common strategy is creating multiple data tables within one +spreadsheet. This confuses the computer, so don't do this! When you +create multiple tables within one spreadsheet, you're drawing false +associations between things for the computer, which sees each row as +an observation. You're also potentially using the same field name in +multiple places, which will make it harder to clean your data up into +a usable form. The example below depicts the problem: + +![](fig/2_datasheet_example.jpg) + +In the example above, the computer will see (for example) row 4 and +assume that all columns A-AF refer to the same sample. This row +actually represents four distinct samples (sample 1 for each of four +different collection dates - May 29th, June 12th, June 19th, and June +26th), as well as some calculated summary statistics (an average (avr) +and standard error of measurement (SEM)) for two of those +samples. Other rows are similarly problematic. + +### Using multiple tabs {#tabs} + +But what about workbook tabs? That seems like an easy way to organise +data, right? Well, yes and no. When you create extra tabs, you fail to +allow the computer to see connections in the data that are there (you +have to introduce spreadsheet application-specific functions or +scripting to ensure this connection). Say, for instance, you make a +separate tab for each day you take a measurement. + +This isn't good practice for two reasons: + +1. you are more likely to accidentally add inconsistencies to your + data if each time you take a measurement, you start recording data + in a new tab, and + +2. even if you manage to prevent all inconsistencies from creeping in, + you will add an extra step for yourself before you analyse the data + because you will have to combine these data into a single + datatable. You will have to explicitly tell the computer how to + combine tabs - and if the tabs are inconsistently formatted, you + might even have to do it manually. + +The next time you're entering data, and you go to create another tab +or table, ask yourself if you could avoid adding this tab by adding +another column to your original spreadsheet. We used multiple tabs in +our example of a messy data file, but now you've seen how you can +reorganise your data to consolidate across tabs. + +Your data sheet might get very long over the course of the +experiment. This makes it harder to enter data if you can't see your +headers at the top of the spreadsheet. But don't repeat your header +row. These can easily get mixed into the data, leading to problems +down the road. Instead you can freeze the column +headers +so that they remain visible even when you have a spreadsheet with many +rows. + +### Not filling in zeros {#zeros} + +It might be that when you're measuring something, it's usually a zero, +say the number of times a rabbit is observed in the survey. Why bother +writing in the number zero in that column, when it's mostly zeros? + +However, there's a difference between a zero and a blank cell in a +spreadsheet. To the computer, a zero is actually data. You measured or +counted it. A blank cell means that it wasn't measured and the +computer will interpret it as an unknown value (also known as a null +or missing value). + +The spreadsheets or statistical programs will likely misinterpret +blank cells that you intend to be zeros. By not entering the value of +your observation, you are telling your computer to represent that data +as unknown or missing (null). This can cause problems with subsequent +calculations or analyses. For example, the average of a set of numbers +which includes a single null value is always null (because the +computer can't guess the value of the missing observations). Because +of this, it's very important to record zeros as zeros and truly +missing data as nulls. + +### Using problematic null values {#null} + +**Example**: using -999 or other numerical values (or zero) to +represent missing data. + +**Solutions**: + +There are a few reasons why null values get represented differently +within a dataset. Sometimes confusing null values are automatically +recorded from the measuring device. If that's the case, there's not +much you can do, but it can be addressed in data cleaning with a tool +like +[OpenRefine](https://www.datacarpentry.org/OpenRefine-ecology-lesson/) +before analysis. Other times different null values are used to convey +different reasons why the data isn't there. This is important +information to capture, but is in effect using one column to capture +two pieces of information. Like for using formatting to convey +information it would be good here to create a new +column like 'data_missing' and use that column to capture the +different reasons. + +Whatever the reason, it's a problem if unknown or missing data is +recorded as -999, 999, or 0. + +Many statistical programs will not recognise that these are intended +to represent missing (null) values. How these values are interpreted +will depend on the software you use to analyse your data. It is +essential to use a clearly defined and consistent null indicator. + +Blanks (most applications) and NA (for R) are good +choices. @White:2013 explain good choices for indicating null values +for different software applications in their article: + +![](fig/3_white_table_1.jpg) + +### Using formatting to convey information {#formatting} + +**Example**: highlighting cells, rows or columns that should be +excluded from an analysis, leaving blank rows to indicate +separations in data. + +![](fig/formatting.png) + +**Solution**: create a new field to encode which data should be +excluded. + +![](fig/good_formatting.png) + +### Using formatting to make the data sheet look pretty {#formatting_pretty} + +**Example**: merging cells. + +**Solution**: If you're not careful, formatting a worksheet to be more +aesthetically pleasing can compromise your computer's ability to see +associations in the data. Merged cells will make your data unreadable +by statistics software. Consider restructuring your data in such a way +that you will not need to merge cells to organise your data. + +### Placing comments or units in cells {#units} + +Most analysis software can't see Excel or LibreOffice comments, and +would be confused by comments placed within your data cells. As +described above for formatting, create another field if you need to +add notes to cells. Similarly, don't include units in cells: ideally, +all the measurements you place in one column should be in the same +unit, but if for some reason they aren't, create another field and +specify the units the cell is in. + +### Entering more than one piece of information in a cell {#info} + +**Example**: Recording ABO and Rhesus groups in one cell, such as A+, +B+, A-, ... + +**Solution**: Don't include more than one piece of information in a +cell. This will limit the ways in which you can analyse your data. If +you need both these measurements, design your data sheet to include +this information. For example, include one column for the ABO group and +one for the Rhesus group. + +### Using problematic field names {#field_name} + +Choose descriptive field names, but be careful not to include spaces, +numbers, or special characters of any kind. Spaces can be +misinterpreted by parsers that use whitespace as delimiters and some +programs don't like field names that are text strings that start with +numbers. + +Underscores (`_`) are a good alternative to spaces. Consider writing +names in camel case (like this: ExampleFileName) to improve +readability. Remember that abbreviations that make sense at the moment +may not be so obvious in 6 months, but don't overdo it with names that +are excessively long. Including the units in the field names avoids +confusion and enables others to readily interpret your fields. + +**Examples** + +| Good Name | Good Alternative | Avoid | +| ---------------------------------------------------------- | -------------------------------------- | ------------------------------------ | +| Max_temp_C | MaxTemp | Maximum Temp (°C) | +| Precipitation_mm | Precipitation | precmm | +| Mean_year_growth | MeanYearGrowth | Mean growth/year | +| sex | sex | M/F | +| weight | weight | w. | +| cell_type | CellType | Cell Type | +| Observation_01 | first_observation | 1st Obs | + +### Using special characters in data {#special} + +**Example**: You treat your spreadsheet program as a word processor +when writing notes, for example copying data directly from Word or +other applications. + +**Solution**: This is a common strategy. For example, when writing +longer text in a cell, people often include line breaks, em-dashes, +etc. in their spreadsheet. Also, when copying data in from +applications such as Word, formatting and fancy non-standard +characters (such as left- and right-aligned quotation marks) are +included. When exporting this data into a coding/statistical +environment or into a relational database, dangerous things may occur, +such as lines being cut in half and encoding errors being thrown. + +General best practice is to avoid adding characters such as newlines, +tabs, and vertical tabs. In other words, treat a text cell as if it +were a simple web form that can only contain text and spaces. + +### Inclusion of metadata in data table {#metadata} + +**Example**: You add a legend at the top or bottom of your data table +explaining column meaning, units, exceptions, etc. + +**Solution**: Recording data about your data ("metadata") is +essential. You may be on intimate terms with your dataset while you +are collecting and analysing it, but the chances that you will still +remember that the variable "sglmemgp" means single member of group, +for example, or the exact algorithm you used to transform a variable +or create a derived one, after a few months, a year, or more are slim. + +As well, there are many reasons other people may want to examine or +use your data - to understand your findings, to verify your findings, +to review your submitted publication, to replicate your results, to +design a similar study, or even to archive your data for access and +re-use by others. While digital data by definition are +machine-readable, understanding their meaning is a job for human +beings. The importance of documenting your data during the collection +and analysis phase of your research cannot be overestimated, +especially if your research is going to be part of the scholarly +record. + +However, metadata should not be contained in the data file +itself. Unlike a table in a paper or a supplemental file, metadata (in +the form of legends) should not be included in a data file since this +information is not data, and including it can disrupt how computer +programs interpret your data file. Rather, metadata should be stored +as a separate file in the same directory as your data file, preferably +in plain text format with a name that clearly associates it with your +data file. Because metadata files are free text format, they also +allow you to encode comments, units, information about how null values +are encoded, etc. that are important to document but can disrupt the +formatting of your data file. + +Additionally, file or database level metadata describes how files that +make up the dataset relate to each other; what format they are in; and +whether they supercede or are superceded by previous files. A +folder-level readme.txt file is the classic way of accounting for all +the files and folders in a project. + +(Text on metadata adapted from the online course Research Data +[MANTRA](https://datalib.edina.ac.uk/mantra) by EDINA and Data Library, +University of Edinburgh. MANTRA is licensed under a Creative Commons +Attribution 4.0 International +License.) + +## Exporting data + +**Question** + +- How can we export data from spreadsheets in a way that is useful for + downstream applications? + +**Objectives** + +- Store spreadsheet data in universal file formats. +- Export data from a spreadsheet to a CSV file. + +**Keypoints** + +- Data stored in common spreadsheet formats will often not be read + correctly into data analysis software, introducing errors into your + data. + +- Exporting data from spreadsheets to formats like CSV or TSV puts it + in a format that can be used consistently by most programs. + +Storing the data you're going to work with for your analyses in Excel +default file format (`*.xls` or `*.xlsx` - depending on the Excel +version) isn't a good idea. Why? + +- Because it is a proprietary format, and it is possible that in the + future, technology won't exist (or will become sufficiently rare) to + make it inconvenient, if not impossible, to open the file. + +- Other spreadsheet software may not be able to open files saved in a + proprietary Excel format. + +- Different versions of Excel may handle data differently, leading to + inconsistencies. [Dates](https://datacarpentry.org/spreadsheet-ecology-lesson/03-dates-as-data/index.html) + is a well-documented example of inconsistencies in data storage. + +- Finally, more journals and grant agencies are requiring you to + deposit your data in a data repository, and most of them don't + accept Excel format. It needs to be in one of the formats discussed + below. + +- The above points also apply to other formats such as open data + formats used by LibreOffice / Open Office. These formats are not + static and do not get parsed the same way by different software + packages. + +Storing data in a universal, open, and static format will help deal +with this problem. Try tab-delimited (tab separated values or TSV) or +comma-delimited (comma separated values or CSV). CSV files are plain +text files where the columns are separated by commas, hence 'comma +separated values' or CSV. The advantage of a CSV file over an +Excel/SPSS/etc. file is that we can open and read a CSV file using +just about any software, including plain text editors like TextEdit or +NotePad. Data in a CSV file can also be easily imported into other +formats and environments, such as SQLite and R. We're not tied to a +certain version of a certain expensive program when we work with CSV +files, so it's a good format to work with for maximum portability and +endurance. Most spreadsheet programs can save to delimited text +formats like CSV easily, although they may give you a warning during +the file export. + +To save a file you have opened in Excel in CSV format: + +1. From the top menu select 'File' and 'Save as'. +2. In the 'Format' field, from the list, select 'Comma Separated + Values' (`*.csv`). +3. Double check the file name and the location where you want to save + it and hit 'Save'. + +An important note for backwards compatibility: you can open CSV files +in Excel! + +```{r, results="markup", fig.cap="Saving an Excel file to CSV.", echo=FALSE, purl=FALSE, out.width="70%", fig.align="center"} +knitr::include_graphics("fig/excel-to-csv.png") +``` + +**A note on R and `xls`**: There are R packages that can read `xls` +files (as well as Google spreadsheets). It is even possible to access +different worksheets in the `xls` documents. + +**But** + +- some of these only work on Windows. +- this equates to replacing a (simple but manual) export to `csv` with + additional complexity/dependencies in the data analysis R code. +- data formatting best practice still apply. +- Is there really a good reason why `csv` (or similar) is not + adequate? + +### Caveats on commas + +In some datasets, the data values themselves may include commas +(,). In that case, the software which you use (including Excel) will +most likely incorrectly display the data in columns. This is because +the commas which are a part of the data values will be interpreted as +delimiters. + +For example, our data might look like this: + +``` +species_id,genus,species,taxa +AB,Amphispiza,bilineata,Bird +AH,Ammospermophilus,harrisi,Rodent, not censused +AS,Ammodramus,savannarum,Bird +BA,Baiomys,taylori,Rodent +``` + +In the record `AH,Ammospermophilus,harrisi,Rodent, not censused` the +value for `taxa` includes a comma (`Rodent, not censused`). If we try +to read the above into Excel (or other spreadsheet program), we will +get something like this: + +```{r, results="markup", fig.cap="The risks of having commas inside comma-separated data.", echo=FALSE, purl=FALSE, out.width="80%", fig.align="center"} +knitr::include_graphics("fig/csv-mistake.png") +``` + +The value for `taxa` was split into two columns (instead of being put +in one column `D`). This can propagate to a number of further +errors. For example, the extra column will be interpreted as a column +with many missing values (and without a proper header). In addition to +that, the value in column `D` for the record in row 3 (so the one +where the value for 'taxa' contained the comma) is now incorrect. + +If you want to store your data in `csv` format and expect that your +data values may contain commas, you can avoid the problem discussed +above by putting the values in quotes (""). Applying this rule, our +data might look like this: + +``` +species_id,genus,species,taxa +"AB","Amphispiza","bilineata","Bird" +"AH","Ammospermophilus","harrisi","Rodent, not censused" +"AS","Ammodramus","savannarum","Bird" +"BA","Baiomys","taylori","Rodent" +``` + +Now opening this file as a `csv` in Excel will not lead to an extra +column, because Excel will only use commas that fall outside of +quotation marks as delimiting characters. + +Alternatively, if you are working with data that contains commas, you +likely will need to use another delimiter when working in a +spreadsheet[^decsep]. In this case, consider using tabs as your delimiter and +working with TSV files. TSV files can be exported from spreadsheet +programs in the same way as CSV files. + +[^decsep]: This is particularly relevant in European + countries where the comma is used as a decimal + separator. In such cases, the default value separator in a + csv file will be the semi-colon (;), or values will be + systematically quoted. + +If you are working with an already existing dataset in which the data +values are not included in "" but which have commas as both delimiters +and parts of data values, you are potentially facing a major problem +with data cleaning. If the dataset you're dealing with contains +hundreds or thousands of records, cleaning them up manually (by either +removing commas from the data values or putting the values into +quotes - "") is not only going to take hours and hours but may +potentially end up with you accidentally introducing many errors. + +Cleaning up datasets is one of the major problems in many scientific +disciplines. The approach almost always depends on the particular +context. However, it is a good practice to clean the data in an +automated fashion, for example by writing and running a script. The +Python and R lessons will give you the basis for developing skills to +build relevant scripts. + +## Summary + +```{r analysis, results="asis", fig.margin=TRUE, fig.cap="A typical data analysis workflow.", fig.width=7, fig.height=4, echo=FALSE, purl=FALSE} +knitr::include_graphics("fig/analysis.png") +``` + +A typical data analysis workflow is illustrated in the figure above, +where data is repeatedly transformed, visualised, and modelled. This +iteration is repeated multiple times until the data is understood. In +many real-life cases, however, most time is spent cleaning up and +preparing the data, rather than actually analysing and understanding +it. + +An agile data analysis workflow, with several fast iterations of the +transform/visualise/model cycle is only feasible if the data is +formatted in a predictable way and one can reason about the data +without having to look at it and/or fix it. + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- Good data organization is the foundation of any research project. + +:::::::::::::::::::::::::::::::::::::::::::::::::: From d4fa59fafb3bf3897a24465f7958ffca70dda73b Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Fri, 29 Dec 2023 06:10:23 +0900 Subject: [PATCH 004/334] New translations 10-data-organisation.md (Japanese) --- locale/ja/episodes/10-data-organisation.Rmd | 828 ++++++++++++++++++++ 1 file changed, 828 insertions(+) create mode 100644 locale/ja/episodes/10-data-organisation.Rmd diff --git a/locale/ja/episodes/10-data-organisation.Rmd b/locale/ja/episodes/10-data-organisation.Rmd new file mode 100644 index 000000000..b12c852cf --- /dev/null +++ b/locale/ja/episodes/10-data-organisation.Rmd @@ -0,0 +1,828 @@ +--- +source: Rmd +title: Data organisation with spreadsheets +teaching: 30 +exercises: 30 +--- + +```{r, include=FALSE} +``` + +::::::::::::::::::::::::::::::::::::::: 目的 + +- Learn about spreadsheets, their strengths and weaknesses. +- How do we format data in spreadsheets for effective data use? +- Learn about common spreadsheet errors and how to correct them. +- Organise your data according to tidy data principles. +- Learn about text-based spreadsheet formats such as the comma-separated (CSV) or tab-separated (TSV) formats. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- How to organise tabular data? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Spreadsheet programs + +**Question** + +- 優れたデータ統合用にスプレッドシートを使用するための基本的な原則は何でしょうか? + +**Objective** + +- Describe best practices for organizing data so computers can make + the best use of datasets. + +**Keypoint** + +- Good data organization is the foundation of any research project. + +Good data organization is the foundation of your research +project. Most researchers have data or do data entry in +spreadsheets. Spreadsheet programs are very useful graphical +interfaces for designing data tables and handling very basic data +quality control functions. See also @Broman:2018. + +### Spreadsheet outline + +Spreadsheets are good for data entry. Therefore we have a lot of data +in spreadsheets. Much of your time as a researcher will be spent in +this 'data wrangling' stage. It's not the most fun, but it's +necessary. We'll teach you how to think about data organization and +some practices for more effective data wrangling. + +### What this lesson will not teach you + +- How to do _statistics_ in a spreadsheet +- How to do _plotting_ in a spreadsheet +- How to _write code_ in spreadsheet programs + +If you're looking to do this, a good reference is Head First +Excel, +published by O'Reilly. + +### Why aren't we teaching data analysis in spreadsheets + +- Data analysis in spreadsheets usually requires a lot of manual + work. If you want to change a parameter or run an analysis with a + new dataset, you usually have to redo everything by hand. (We do + know that you can create macros, but see the next point.) + +- It is also difficult to track or reproduce statistical or plotting + analyses done in spreadsheet programs when you want to go back to + your work or someone asks for details of your analysis. + +Many spreadsheet programs are available. Since most participants +utilise Excel as their primary spreadsheet program, this lesson will +make use of Excel examples. A free spreadsheet program that can also +be used is LibreOffice. Commands may differ a bit between programs, +but the general idea is the same. + +Spreadsheet programs encompass a lot of the things we need to be able +to do as researchers. We can use them for: + +- Data entry +- Organizing data +- Subsetting and sorting data +- Statistics +- Plotting + +Spreadsheet programs use tables to represent and display data. Data +formatted as tables is also the main theme of this chapter, and we +will see how to organise data into tables in a standardised way to +ensure efficient downstream analysis. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge: Discuss the following points with your neighbour + +- Have you used spreadsheets, in your research, courses, + or at home? +- What kind of operations do you do in spreadsheets? +- Which ones do you think spreadsheets are good for? +- Have you accidentally done something in a spreadsheet program that made you + frustrated or sad? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +### Problems with spreadsheets + +Spreadsheets are good for data entry, but in reality we tend to +use spreadsheet programs for much more than data entry. We use them +to create data tables for publications, to generate summary +statistics, and make figures. + +Generating tables for publications in a spreadsheet is not +optimal - often, when formatting a data table for publication, we're +reporting key summary statistics in a way that is not really meant to +be read as data, and often involves special formatting +(merging cells, creating borders, making it pretty). We advise you to +do this sort of operation within your document editing software. + +The latter two applications, generating statistics and figures, should +be used with caution: because of the graphical, drag and drop nature of +spreadsheet programs, it can be very difficult, if not impossible, to +replicate your steps (much less retrace anyone else's), particularly if your +stats or figures require you to do more complex calculations. Furthermore, +in doing calculations in a spreadsheet, it's easy to accidentally apply a +slightly different formula to multiple adjacent cells. When using a +command-line based statistics program like R or SAS, it's practically +impossible to apply a calculation to one observation in your +dataset but not another unless you're doing it on purpose. + +### Using spreadsheets for data entry and cleaning + +In this lesson, we will assume that you are most likely using Excel as +your primary spreadsheet program - there are others (gnumeric, Calc +from OpenOffice), and their functionality is similar, but Excel seems +to be the program most used by biologists and biomedical researchers. + +In this lesson we're going to talk about: + +1. Formatting data tables in spreadsheets +2. Formatting problems +3. Exporting data + +## Formatting data tables in spreadsheets + +**Questions** + +- How do we format data in spreadsheets for effective data use? + +**Objectives** + +- Describe best practices for data entry and formatting in + spreadsheets. + +- Apply best practices to arrange variables and observations in a + spreadsheet. + +**Keypoints** + +- Never modify your raw data. Always make a copy before making any + changes. + +- Keep track of all of the steps you take to clean your data in a + plain text file. + +- Organise your data according to tidy data principles. + +The most common mistake made is treating spreadsheet programs like lab +notebooks, that is, relying on context, notes in the margin, spatial +layout of data and fields to convey information. As humans, we can +(usually) interpret these things, but computers don't view information +the same way, and unless we explain to the computer what every single +thing means (and that can be hard!), it will not be able to see how +our data fits together. + +Using the power of computers, we can manage and analyse data in much +more effective and faster ways, but to use that power, we have to set +up our data for the computer to be able to understand it (and +computers are very literal). + +This is why it's extremely important to set up well-formatted tables +from the outset - before you even start entering data from your very +first preliminary experiment. Data organization is the foundation of +your research project. It can make it easier or harder to work with +your data throughout your analysis, so it's worth thinking about when +you're doing your data entry or setting up your experiment. You can +set things up in different ways in spreadsheets, but some of these +choices can limit your ability to work with the data in other programs +or have the you-of-6-months-from-now or your collaborator work with +the data. + +**Note:** the best layouts/formats (as well as software and +interfaces) for data entry and data analysis might be different. It is +important to take this into account, and ideally automate the +conversion from one to another. + +### Keeping track of your analyses + +When you're working with spreadsheets, during data clean up or +analyses, it's very easy to end up with a spreadsheet that looks very +different from the one you started with. In order to be able to +reproduce your analyses or figure out what you did when a reviewer or +instructor asks for a different analysis, you should + +- create a new file with your cleaned or analysed data. Don't modify + the original dataset, or you will never know where you started! + +- keep track of the steps you took in your clean up or analysis. You + should track these steps as you would any step in an experiment. We + recommend that you do this in a plain text file stored in the same + folder as the data file. + +This might be an example of a spreadsheet setup: + +![](fig/spreadsheet-setup-updated.png) + +Put these principles in to practice today during your exercises. + +While versioning is out of scope for this course, you can look at the +Carpentries lesson on +['Git'](https://swcarpentry.github.io/git-novice/) to learn how to +maintain **version control** over your data. See also this blog +post for a quick tutorial or +@Perez-Riverol:2016 for a more research-oriented use-case. + +### Structuring data in spreadsheets + +The cardinal rules of using spreadsheet programs for data: + +1. Put all your variables in columns - the thing you're measuring, + like 'weight' or 'temperature'. +2. Put each observation in its own row. +3. Don't combine multiple pieces of information in one cell. Sometimes + it just seems like one thing, but think if that's the only way + you'll want to be able to use or sort that data. +4. Leave the raw data raw - don't change it! +5. Export the cleaned data to a text-based format like CSV + (comma-separated values) format. This ensures that anyone can use + the data, and is required by most data repositories. + +For instance, we have data from patients that visited several +hospitals in Brussels, Belgium. They recorded the date of the visit, +the hospital, the patients' gender, weight and blood group. + +If we were to keep track of the data like this: + +![](fig/multiple-info.png) + +the problem is that the ABO and Rhesus groups are in the same `Blood` +type column. So, if they wanted to look at all observations of the A +group or look at weight distributions by ABO group, it would be tricky +to do this using this data setup. If instead we put the ABO and Rhesus +groups in different columns, you can see that it would be much easier. + +![](fig/single-info.png) + +An important rule when setting up a datasheet, is that **columns are +used for variables** and **rows are used for observations**: + +- columns are variables +- rows are observations +- cells are individual values + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge: We're going to take a messy dataset and describe how we would clean it up. + +1. Download a messy dataset by clicking + [here](https://github.com/UCLouvain-CBIO/WSBIM1207/raw/master/data/messy_covid.xlsx). + +2. Open up the data in a spreadsheet program. + +3. You can see that there are two tabs. The data contains various + clinical variables recorded in various hospitals in Brussels during + the first and second COVID-19 waves in 2020. As you can see, the + data have been recorded differently during the March and November + waves. Now you're the person in charge of this project and you want + to be able to start analyzing the data. + +4. With the person next to you, identify what is wrong with this + spreadsheet. Also discuss the steps you would need to take to clean + up first and second wave tabs, and to put them all together in one + spreadsheet. + +**Important:** Do not forget our first piece of advice: to create a +new file (or tab) for the cleaned data, never modify your original +(raw) data. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +After you go through this exercise, we'll discuss as a group what was +wrong with this data and how you would fix it. + +<!-- - Take about 10 minutes to work on this exercise. --> + +<!-- - All the mistakes in the *common mistakes* section below are present --> + +<!-- in the messy dataset. If the exercise is done during a workshop, ask --> + +<!-- people what they saw as wrong with the data. As they bring up --> + +<!-- different points, you can refer to the common mistakes or expand a --> + +<!-- bit on the point they brought up. --> + +<!-- - If you get a response where they've fixed the date, you can pause --> + +<!-- and go to the dates lesson. Or you can say you'll come back to dates --> + +<!-- at the end. --> + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge: Once you have tidied up the data, answer the following questions: + +- How many men and women took part in the study? +- How many A, AB, and B types have been tested? +- As above, but disregarding the contaminated samples? +- How many Rhesus + and - have been tested? +- How many universal donors (O-) have been tested? +- What is the average weight of AB men? +- How many samples have been tested in the different hospitals? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +An **excellent reference**, in particular with regard to R scripting +is the _Tidy Data_ paper @Wickham:2014. + +## Common spreadsheet errors + +**Questions** + +- What are some common challenges with formatting data in spreadsheets + and how can we avoid them? + +**Objectives** + +- Recognise and resolve common spreadsheet formatting problems. + +**Keypoints** + +- Avoid using multiple tables within one spreadsheet. +- Avoid spreading data across multiple tabs. +- Record zeros as zeros. +- Use an appropriate null value to record missing data. +- Don't use formatting to convey information or to make your spreadsheet look pretty. +- Place comments in a separate column. +- Record units in column headers. +- Include only one piece of information in a cell. +- Avoid spaces, numbers and special characters in column headers. +- Avoid special characters in your data. +- Record metadata in a separate plain text file. + +<!-- This lesson is meant to be used as a reference for discussion as --> + +<!-- learners identify issues with the messy dataset discussed in the --> + +<!-- previous lesson. Instructors: don't go through this lesson except to --> + +<!-- refer to responses to the exercise in the previous lesson. --> + +There are a few potential errors to be on the lookout for in your own +data as well as data from collaborators or the Internet. If you are +aware of the errors and the possible negative effect on downstream +data analysis and result interpretation, it might motivate yourself +and your project members to try and avoid them. Making small changes +to the way you format your data in spreadsheets, can have a great +impact on efficiency and reliability when it comes to data cleaning +and analysis. + +- [Using multiple tables](#tables) +- [Using multiple tabs](#tabs) +- [Not filling in zeros](#zeros) +- [Using problematic null values](#null) +- [Using formatting to convey information](#formatting) +- [Using formatting to make the data sheet look pretty](#formatting_pretty) +- [Placing comments or units in cells](#units) +- [Entering more than one piece of information in a cell](#info) +- [Using problematic field names](#field_name) +- [Using special characters in data](#special) +- [Inclusion of metadata in data table](#metadata) + +### Using multiple tables {#tables} + +A common strategy is creating multiple data tables within one +spreadsheet. This confuses the computer, so don't do this! When you +create multiple tables within one spreadsheet, you're drawing false +associations between things for the computer, which sees each row as +an observation. You're also potentially using the same field name in +multiple places, which will make it harder to clean your data up into +a usable form. The example below depicts the problem: + +![](fig/2_datasheet_example.jpg) + +In the example above, the computer will see (for example) row 4 and +assume that all columns A-AF refer to the same sample. This row +actually represents four distinct samples (sample 1 for each of four +different collection dates - May 29th, June 12th, June 19th, and June +26th), as well as some calculated summary statistics (an average (avr) +and standard error of measurement (SEM)) for two of those +samples. Other rows are similarly problematic. + +### Using multiple tabs {#tabs} + +But what about workbook tabs? That seems like an easy way to organise +data, right? Well, yes and no. When you create extra tabs, you fail to +allow the computer to see connections in the data that are there (you +have to introduce spreadsheet application-specific functions or +scripting to ensure this connection). Say, for instance, you make a +separate tab for each day you take a measurement. + +This isn't good practice for two reasons: + +1. you are more likely to accidentally add inconsistencies to your + data if each time you take a measurement, you start recording data + in a new tab, and + +2. even if you manage to prevent all inconsistencies from creeping in, + you will add an extra step for yourself before you analyse the data + because you will have to combine these data into a single + datatable. You will have to explicitly tell the computer how to + combine tabs - and if the tabs are inconsistently formatted, you + might even have to do it manually. + +The next time you're entering data, and you go to create another tab +or table, ask yourself if you could avoid adding this tab by adding +another column to your original spreadsheet. We used multiple tabs in +our example of a messy data file, but now you've seen how you can +reorganise your data to consolidate across tabs. + +Your data sheet might get very long over the course of the +experiment. This makes it harder to enter data if you can't see your +headers at the top of the spreadsheet. But don't repeat your header +row. These can easily get mixed into the data, leading to problems +down the road. Instead you can freeze the column +headers +so that they remain visible even when you have a spreadsheet with many +rows. + +### Not filling in zeros {#zeros} + +It might be that when you're measuring something, it's usually a zero, +say the number of times a rabbit is observed in the survey. Why bother +writing in the number zero in that column, when it's mostly zeros? + +However, there's a difference between a zero and a blank cell in a +spreadsheet. To the computer, a zero is actually data. You measured or +counted it. A blank cell means that it wasn't measured and the +computer will interpret it as an unknown value (also known as a null +or missing value). + +The spreadsheets or statistical programs will likely misinterpret +blank cells that you intend to be zeros. By not entering the value of +your observation, you are telling your computer to represent that data +as unknown or missing (null). This can cause problems with subsequent +calculations or analyses. For example, the average of a set of numbers +which includes a single null value is always null (because the +computer can't guess the value of the missing observations). Because +of this, it's very important to record zeros as zeros and truly +missing data as nulls. + +### Using problematic null values {#null} + +**Example**: using -999 or other numerical values (or zero) to +represent missing data. + +**Solutions**: + +There are a few reasons why null values get represented differently +within a dataset. Sometimes confusing null values are automatically +recorded from the measuring device. If that's the case, there's not +much you can do, but it can be addressed in data cleaning with a tool +like +[OpenRefine](https://www.datacarpentry.org/OpenRefine-ecology-lesson/) +before analysis. Other times different null values are used to convey +different reasons why the data isn't there. This is important +information to capture, but is in effect using one column to capture +two pieces of information. Like for using formatting to convey +information it would be good here to create a new +column like 'data_missing' and use that column to capture the +different reasons. + +Whatever the reason, it's a problem if unknown or missing data is +recorded as -999, 999, or 0. + +Many statistical programs will not recognise that these are intended +to represent missing (null) values. How these values are interpreted +will depend on the software you use to analyse your data. It is +essential to use a clearly defined and consistent null indicator. + +Blanks (most applications) and NA (for R) are good +choices. @White:2013 explain good choices for indicating null values +for different software applications in their article: + +![](fig/3_white_table_1.jpg) + +### Using formatting to convey information {#formatting} + +**Example**: highlighting cells, rows or columns that should be +excluded from an analysis, leaving blank rows to indicate +separations in data. + +![](fig/formatting.png) + +**Solution**: create a new field to encode which data should be +excluded. + +![](fig/good_formatting.png) + +### Using formatting to make the data sheet look pretty {#formatting_pretty} + +**Example**: merging cells. + +**Solution**: If you're not careful, formatting a worksheet to be more +aesthetically pleasing can compromise your computer's ability to see +associations in the data. Merged cells will make your data unreadable +by statistics software. Consider restructuring your data in such a way +that you will not need to merge cells to organise your data. + +### Placing comments or units in cells {#units} + +Most analysis software can't see Excel or LibreOffice comments, and +would be confused by comments placed within your data cells. As +described above for formatting, create another field if you need to +add notes to cells. Similarly, don't include units in cells: ideally, +all the measurements you place in one column should be in the same +unit, but if for some reason they aren't, create another field and +specify the units the cell is in. + +### Entering more than one piece of information in a cell {#info} + +**Example**: Recording ABO and Rhesus groups in one cell, such as A+, +B+, A-, ... + +**Solution**: Don't include more than one piece of information in a +cell. This will limit the ways in which you can analyse your data. If +you need both these measurements, design your data sheet to include +this information. For example, include one column for the ABO group and +one for the Rhesus group. + +### Using problematic field names {#field_name} + +Choose descriptive field names, but be careful not to include spaces, +numbers, or special characters of any kind. Spaces can be +misinterpreted by parsers that use whitespace as delimiters and some +programs don't like field names that are text strings that start with +numbers. + +Underscores (`_`) are a good alternative to spaces. Consider writing +names in camel case (like this: ExampleFileName) to improve +readability. Remember that abbreviations that make sense at the moment +may not be so obvious in 6 months, but don't overdo it with names that +are excessively long. Including the units in the field names avoids +confusion and enables others to readily interpret your fields. + +**Examples** + +| Good Name | Good Alternative | Avoid | +| ---------------------------------------------------------- | -------------------------------------- | ------------------------------------ | +| Max_temp_C | MaxTemp | Maximum Temp (°C) | +| Precipitation_mm | Precipitation | precmm | +| Mean_year_growth | MeanYearGrowth | Mean growth/year | +| sex | sex | M/F | +| weight | weight | w. | +| cell_type | CellType | Cell Type | +| Observation_01 | first_observation | 1st Obs | + +### Using special characters in data {#special} + +**Example**: You treat your spreadsheet program as a word processor +when writing notes, for example copying data directly from Word or +other applications. + +**Solution**: This is a common strategy. For example, when writing +longer text in a cell, people often include line breaks, em-dashes, +etc. in their spreadsheet. Also, when copying data in from +applications such as Word, formatting and fancy non-standard +characters (such as left- and right-aligned quotation marks) are +included. When exporting this data into a coding/statistical +environment or into a relational database, dangerous things may occur, +such as lines being cut in half and encoding errors being thrown. + +General best practice is to avoid adding characters such as newlines, +tabs, and vertical tabs. In other words, treat a text cell as if it +were a simple web form that can only contain text and spaces. + +### Inclusion of metadata in data table {#metadata} + +**Example**: You add a legend at the top or bottom of your data table +explaining column meaning, units, exceptions, etc. + +**Solution**: Recording data about your data ("metadata") is +essential. You may be on intimate terms with your dataset while you +are collecting and analysing it, but the chances that you will still +remember that the variable "sglmemgp" means single member of group, +for example, or the exact algorithm you used to transform a variable +or create a derived one, after a few months, a year, or more are slim. + +As well, there are many reasons other people may want to examine or +use your data - to understand your findings, to verify your findings, +to review your submitted publication, to replicate your results, to +design a similar study, or even to archive your data for access and +re-use by others. While digital data by definition are +machine-readable, understanding their meaning is a job for human +beings. The importance of documenting your data during the collection +and analysis phase of your research cannot be overestimated, +especially if your research is going to be part of the scholarly +record. + +However, metadata should not be contained in the data file +itself. Unlike a table in a paper or a supplemental file, metadata (in +the form of legends) should not be included in a data file since this +information is not data, and including it can disrupt how computer +programs interpret your data file. Rather, metadata should be stored +as a separate file in the same directory as your data file, preferably +in plain text format with a name that clearly associates it with your +data file. Because metadata files are free text format, they also +allow you to encode comments, units, information about how null values +are encoded, etc. that are important to document but can disrupt the +formatting of your data file. + +Additionally, file or database level metadata describes how files that +make up the dataset relate to each other; what format they are in; and +whether they supercede or are superceded by previous files. A +folder-level readme.txt file is the classic way of accounting for all +the files and folders in a project. + +(Text on metadata adapted from the online course Research Data +[MANTRA](https://datalib.edina.ac.uk/mantra) by EDINA and Data Library, +University of Edinburgh. MANTRA is licensed under a Creative Commons +Attribution 4.0 International +License.) + +## Exporting data + +**Question** + +- How can we export data from spreadsheets in a way that is useful for + downstream applications? + +**Objectives** + +- Store spreadsheet data in universal file formats. +- Export data from a spreadsheet to a CSV file. + +**Keypoints** + +- Data stored in common spreadsheet formats will often not be read + correctly into data analysis software, introducing errors into your + data. + +- Exporting data from spreadsheets to formats like CSV or TSV puts it + in a format that can be used consistently by most programs. + +Storing the data you're going to work with for your analyses in Excel +default file format (`*.xls` or `*.xlsx` - depending on the Excel +version) isn't a good idea. Why? + +- Because it is a proprietary format, and it is possible that in the + future, technology won't exist (or will become sufficiently rare) to + make it inconvenient, if not impossible, to open the file. + +- Other spreadsheet software may not be able to open files saved in a + proprietary Excel format. + +- Different versions of Excel may handle data differently, leading to + inconsistencies. [Dates](https://datacarpentry.org/spreadsheet-ecology-lesson/03-dates-as-data/index.html) + is a well-documented example of inconsistencies in data storage. + +- Finally, more journals and grant agencies are requiring you to + deposit your data in a data repository, and most of them don't + accept Excel format. It needs to be in one of the formats discussed + below. + +- The above points also apply to other formats such as open data + formats used by LibreOffice / Open Office. These formats are not + static and do not get parsed the same way by different software + packages. + +Storing data in a universal, open, and static format will help deal +with this problem. Try tab-delimited (tab separated values or TSV) or +comma-delimited (comma separated values or CSV). CSV files are plain +text files where the columns are separated by commas, hence 'comma +separated values' or CSV. The advantage of a CSV file over an +Excel/SPSS/etc. file is that we can open and read a CSV file using +just about any software, including plain text editors like TextEdit or +NotePad. Data in a CSV file can also be easily imported into other +formats and environments, such as SQLite and R. We're not tied to a +certain version of a certain expensive program when we work with CSV +files, so it's a good format to work with for maximum portability and +endurance. Most spreadsheet programs can save to delimited text +formats like CSV easily, although they may give you a warning during +the file export. + +To save a file you have opened in Excel in CSV format: + +1. From the top menu select 'File' and 'Save as'. +2. In the 'Format' field, from the list, select 'Comma Separated + Values' (`*.csv`). +3. Double check the file name and the location where you want to save + it and hit 'Save'. + +An important note for backwards compatibility: you can open CSV files +in Excel! + +```{r, results="markup", fig.cap="Saving an Excel file to CSV.", echo=FALSE, purl=FALSE, out.width="70%", fig.align="center"} +knitr::include_graphics("fig/excel-to-csv.png") +``` + +**A note on R and `xls`**: There are R packages that can read `xls` +files (as well as Google spreadsheets). It is even possible to access +different worksheets in the `xls` documents. + +**But** + +- some of these only work on Windows. +- this equates to replacing a (simple but manual) export to `csv` with + additional complexity/dependencies in the data analysis R code. +- data formatting best practice still apply. +- Is there really a good reason why `csv` (or similar) is not + adequate? + +### Caveats on commas + +In some datasets, the data values themselves may include commas +(,). In that case, the software which you use (including Excel) will +most likely incorrectly display the data in columns. This is because +the commas which are a part of the data values will be interpreted as +delimiters. + +For example, our data might look like this: + +``` +species_id,genus,species,taxa +AB,Amphispiza,bilineata,Bird +AH,Ammospermophilus,harrisi,Rodent, not censused +AS,Ammodramus,savannarum,Bird +BA,Baiomys,taylori,Rodent +``` + +In the record `AH,Ammospermophilus,harrisi,Rodent, not censused` the +value for `taxa` includes a comma (`Rodent, not censused`). If we try +to read the above into Excel (or other spreadsheet program), we will +get something like this: + +```{r, results="markup", fig.cap="The risks of having commas inside comma-separated data.", echo=FALSE, purl=FALSE, out.width="80%", fig.align="center"} +knitr::include_graphics("fig/csv-mistake.png") +``` + +The value for `taxa` was split into two columns (instead of being put +in one column `D`). This can propagate to a number of further +errors. For example, the extra column will be interpreted as a column +with many missing values (and without a proper header). In addition to +that, the value in column `D` for the record in row 3 (so the one +where the value for 'taxa' contained the comma) is now incorrect. + +If you want to store your data in `csv` format and expect that your +data values may contain commas, you can avoid the problem discussed +above by putting the values in quotes (""). Applying this rule, our +data might look like this: + +``` +species_id,genus,species,taxa +"AB","Amphispiza","bilineata","Bird" +"AH","Ammospermophilus","harrisi","Rodent, not censused" +"AS","Ammodramus","savannarum","Bird" +"BA","Baiomys","taylori","Rodent" +``` + +Now opening this file as a `csv` in Excel will not lead to an extra +column, because Excel will only use commas that fall outside of +quotation marks as delimiting characters. + +Alternatively, if you are working with data that contains commas, you +likely will need to use another delimiter when working in a +spreadsheet[^decsep]. In this case, consider using tabs as your delimiter and +working with TSV files. TSV files can be exported from spreadsheet +programs in the same way as CSV files. + +[^decsep]: This is particularly relevant in European + countries where the comma is used as a decimal + separator. In such cases, the default value separator in a + csv file will be the semi-colon (;), or values will be + systematically quoted. + +If you are working with an already existing dataset in which the data +values are not included in "" but which have commas as both delimiters +and parts of data values, you are potentially facing a major problem +with data cleaning. If the dataset you're dealing with contains +hundreds or thousands of records, cleaning them up manually (by either +removing commas from the data values or putting the values into +quotes - "") is not only going to take hours and hours but may +potentially end up with you accidentally introducing many errors. + +Cleaning up datasets is one of the major problems in many scientific +disciplines. The approach almost always depends on the particular +context. However, it is a good practice to clean the data in an +automated fashion, for example by writing and running a script. The +Python and R lessons will give you the basis for developing skills to +build relevant scripts. + +## Summary + +```{r analysis, results="asis", fig.margin=TRUE, fig.cap="A typical data analysis workflow.", fig.width=7, fig.height=4, echo=FALSE, purl=FALSE} +knitr::include_graphics("fig/analysis.png") +``` + +A typical data analysis workflow is illustrated in the figure above, +where data is repeatedly transformed, visualised, and modelled. This +iteration is repeated multiple times until the data is understood. In +many real-life cases, however, most time is spent cleaning up and +preparing the data, rather than actually analysing and understanding +it. + +An agile data analysis workflow, with several fast iterations of the +transform/visualise/model cycle is only feasible if the data is +formatted in a predictable way and one can reason about the data +without having to look at it and/or fix it. + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- Good data organization is the foundation of any research project. + +:::::::::::::::::::::::::::::::::::::::::::::::::: From 7b8f627711779a24a017553538f2ecc335cde6c9 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Fri, 29 Dec 2023 06:10:25 +0900 Subject: [PATCH 005/334] New translations 10-data-organisation.md (Portuguese) --- locale/pt/episodes/10-data-organisation.Rmd | 791 ++++++++++++++++++++ 1 file changed, 791 insertions(+) create mode 100644 locale/pt/episodes/10-data-organisation.Rmd diff --git a/locale/pt/episodes/10-data-organisation.Rmd b/locale/pt/episodes/10-data-organisation.Rmd new file mode 100644 index 000000000..888061af7 --- /dev/null +++ b/locale/pt/episodes/10-data-organisation.Rmd @@ -0,0 +1,791 @@ +--- +source: Rmd +title: Data organisation with spreadsheets +teaching: 30 +exercises: 30 +--- + +```{r, include=FALSE} +``` + +::::::::::::::::::::::::::::::::::::::: objectives + +- Learn about spreadsheets, their strengths and weaknesses. +- How do we format data in spreadsheets for effective data use? +- Learn about common spreadsheet errors and how to correct them. +- Organise your data according to tidy data principles. +- Learn about text-based spreadsheet formats such as the comma-separated (CSV) or tab-separated (TSV) formats. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- How to organise tabular data? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Spreadsheet programs + +**Question** + +- What are basic principles for using spreadsheets for good data + organization? + +**Objective** + +- Describe best practices for organizing data so computers can make + the best use of datasets. + +**Keypoint** + +- Good data organization is the foundation of any research project. + +Uma boa organização dos dados é a base do seu projeto de pesquisa. A maioria dos pesquisadores tem dados ou faz entrada de dados em +formato de planilhas. Programas de análise de tabelas são interfaces +gráficas muito úteis para projetar dados e lidar com funções e controle de qualidade. Ver também @Broman:2018. + +### Spreadsheet outline + +Spreadsheets are good for data entry. Portanto, temos muitos dados +em planilhas. Muito do seu tempo como pesquisador será gasto em +desembaralhar seus dados e a forma como foram inseridos. Não é o mais divertido, mas é +necessário. Vamos te ensinar como pensar sobre organização de dados e +algumas práticas para um melhor desembaraço dos dados. + +### What this lesson will not teach you + +- How to do _statistics_ in a spreadsheet +- How to do _plotting_ in a spreadsheet +- How to _write code_ in spreadsheet programs + +If you're looking to do this, a good reference is Head First +Excel, +published by O'Reilly. + +### Why aren't we teaching data analysis in spreadsheets + +- Data analysis in spreadsheets usually requires a lot of manual + work. Se você alterar um parâmetro ou efetuar uma análise com um + novo conjunto de dados, normalmente terá de refazer tudo à mão. (Nós + sabemos que é posível criar macros, mas veja o próximo ponto) + +- It is also difficult to track or reproduce statistical or plotting + analyses done in spreadsheet programs when you want to go back to + your work or someone asks for details of your analysis. + +Muitos programas para análise de tabelas estão disponíveis. Uma vez que a maioria dos participantes +utiliza o Excel como o seu principal programa de análise de planilhas, esta lição +utilizará exemplos do Excel. Um programa de análises de planilhas gratuito, que também pode ser usado +é o LibreOffice. Os comandos podem diferir um pouco entre programas, +mas a ideia geral é a mesma. + +Programas de análise de planilhas englobam muitas das coisas que precisamos ser capazes de +fazer como pesquisadores. We can use them for: + +- Data entry +- Organizing data +- Subsetting and sorting data +- Statistics +- Plotting + +Spreadsheet programs use tables to represent and display data. Dados +formatados como tabelas é também o tema principal deste capítulo, e +veremos como organizar dados em tabelas de uma forma padronizada que +garante uma análise a jusante eficiente. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge: Discuss the following points with your neighbour + +- Have you used spreadsheets, in your research, courses, + or at home? +- What kind of operations do you do in spreadsheets? +- Which ones do you think spreadsheets are good for? +- Have you accidentally done something in a spreadsheet program that made you + frustrated or sad? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +### Problems with spreadsheets + +Spreadsheets are good for data entry, but in reality we tend to +use spreadsheet programs for much more than data entry. Usamos elas +para criar tabelas de dados para publicações, para gerar estatísticas +sumarizadas e fazer figuras. + +Gerar tabelas para publicações em uma planilha não é +ideal - frequentemente, ao formatar uma tabela de dados para publicação, Estamos +relatando estatísticas chaves resumidas de uma forma que não é realmente para +ser lida como dado, e muitas vezes envolve uma formatação especial +(mesclando células, criando margens, tornando-a bonita). Aconselhamos que +faça esse tipo de operação dentro do seu software de edição de documentos. + +As duas últimas aplicações, a geração de estatísticas e números, +deve ser usada com cuidado: por causa da natureza gráfica dos programas, arrastar e soltar, pode ser muito difícil, se não impossível, se quiser repetir seus movimentos (e para qualquer outra pessoa reproduzir), particularmente se suas estatísticas +ou números exigem que você faça cálculos mais complexos. Além disso, +ao efetuar cálculos numa planilha, é fácil aplicar acidentalmente uma +fórmula ligeiramente diferente a várias células adjacentes. Quando se utiliza um programa de estatística baseado na linha de comandos +como o R ou o SAS, é praticamente +impossível aplicar um cálculo a uma observação no seu conjunto de dados +mas não a outra, a menos que o faça de propósito. + +### Using spreadsheets for data entry and cleaning + +In this lesson, we will assume that you are most likely using Excel as +your primary spreadsheet program - there are others (gnumeric, Calc +from OpenOffice), and their functionality is similar, but Excel seems +to be the program most used by biologists and biomedical researchers. + +In this lesson we're going to talk about: + +1. Formatting data tables in spreadsheets +2. Formatting problems +3. Exporting data + +## Formatting data tables in spreadsheets + +**Questions** + +- How do we format data in spreadsheets for effective data use? + +**Objectives** + +- Describe best practices for data entry and formatting in + spreadsheets. + +- Apply best practices to arrange variables and observations in a + spreadsheet. + +**Keypoints** + +- Never modify your raw data. Sempre fazer uma cópia antes de fazer quaisquer alterações. + +- Keep track of all of the steps you take to clean your data in a + plain text file. + +- Organise your data according to tidy data principles. + +O erro mais comum cometido é tratar programas de análise de planilha como cadernos de +laboratórios, ou seja, dependendo do contexto, notas na margem, layout espacial +de dados e campos para transmitir informações. Como seres humanos, podemos +(normalmente) interpretar estas coisas, mas os computadores não vêem a informação +da mesma forma e, a menos que expliquemos ao computador o que cada +coisa significa (e isso pode ser difícil!), ele não será capaz de ver como +os dados se encaixam. + +Utilizando o poder dos computadores, podemos gerir e analisar dados de formas muito +mais eficazes e rápidas, mas para utilizar esse poder, temos de +configurar os nossos dados para que o computador os possa compreender (e +os computadores são muito literais). + +É por isso que é extremamente importante criar tabelas bem formatadas +desde o início - antes mesmo de começar a introduzir os dados da sua +primeira experiência. Uma boa organização dos dados é a base do seu projeto de pesquisa. Ela pode tornar mais fácil ou mais difícil trabalhar com +os seus dados ao longo da análise, assim vale a pena pensar nisso quando +estiver introduzindo os dados ou a preparar a experiência. Você pode +configurar as coisas de formas diferentes em planilhas, mas algumas destas +opções podem limitar sua capacidade de trabalhar com os dados de outros programas +ou limitar o você de daqui há 6 meses e o seu colaborador de trabalhar com +os dados. + +\*\*Nota: Os melhores layouts/formatos (bem como o software e as interfaces) para a introdução e análise de dados podem ser diferentes. É +importante ter isso em conta e procurar automatizar a conversão +de um para outro. + +### Keeping track of your analyses + +When you're working with spreadsheets, during data clean up or +analyses, it's very easy to end up with a spreadsheet that looks very +different from the one you started with. In order to be able to +reproduce your analyses or figure out what you did when a reviewer or +instructor asks for a different analysis, you should + +- create a new file with your cleaned or analysed data. Não modifique + o conjunto de dados original, ou você nunca saberá por onde começou! + +- keep track of the steps you took in your clean up or analysis. Você + deve acompanhar estes passos como você faria em passos de um experimento de bancada. Nós + recomendamos que você faça isso em um arquivo de texto simples armazenado na mesma pasta + do arquivo de dados. + +This might be an example of a spreadsheet setup: + +![](fig/spreadsheet-setup-updated.png) + +Put these principles in to practice today during your exercises. + +Enquanto o controle de versão está fora de escopo, você pode ver a aula +do Carpentries em +['Git'](https\://swcarpentry. ithub.io/git-novice/) para aprender como +manter um **controle de versão** sobre seus dados. Veja também este blog +post para um tutorial rápido, ou +@Perez-Riverol:2016 para um exemplo mais voltado à pesquisa. + +### Structuring data in spreadsheets + +The cardinal rules of using spreadsheet programs for data: + +1. Coloque todas as suas variáveis em colunas - a coisa que você está medindo, + como 'peso' ou 'temperatura'. +2. Coloque cada observação em sua própria linha. +3. Não combina várias informações em uma só célula. Às vezes + parece apenas uma coisa, mas pense se essa é a única maneira + você vai conseguir usar ou ordenar esses dados. +4. Deixe os dados brutos, brutos - não mude! +5. Exportar os dados limpos para um formato baseado em texto, como o formato CSV + (valores separados por vírgula). Isso garante que qualquer pessoa possa usar + os dados e é exigido pela maioria dos repositórios de dados. + +Por exemplo, temos dados de pacientes que visitaram vários hospitais +em Bruxelas, Bélgica. Eles registraram a data da visita, +o hospital, o gênero, o peso e o grupo sanguíneo dos pacientes. + +If we were to keep track of the data like this: + +![](fig/multiple-info.png) + +the problem is that the ABO and Rhesus groups are in the same `Blood` +type column. Então, se eles quiserem ver todas as observações do grupo A +ou ver as distribuições de peso por grupo ABO, seria complicado +fazer isso usando essa configuração de dados. Em vez disso, se colocarmos os grupos ABO e Rhesus +em colunas diferentes, você poderá ver que isso seria muito mais fácil. + +![](fig/single-info.png) + +An important rule when setting up a datasheet, is that **columns are +used for variables** and **rows are used for observations**: + +- columns are variables +- rows are observations +- cells are individual values + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge: We're going to take a messy dataset and describe how we would clean it up. + +1. Download a messy dataset by clicking + [here](https://github.com/UCLouvain-CBIO/WSBIM1207/raw/master/data/messy_covid.xlsx). + +2. Abra os dados em um programa de planilha. + +3. Podemos ver que existem duas abas. Os dados contêm diversas + variáveis clínicas registadas em vários hospitais de Bruxelas durante + a primeira e a segunda ondas da COVID-19 em 2020. Como se pode ver, os dados de + foram registados de forma diferente durante as ondas de Março e Novembro. Agora você é a pessoa responsável por este projeto e você quer + poder começar a analisar os dados. + +4. Com a pessoa ao seu lado, identifique o que há de errado com esta planilha. Também discuta os passos que você precisa dar para limpar + as abas associadas à primeira e à segunda ondas, e para colocá-los todos juntos em uma planilha. + +**Importante:** Não se esqueça do nosso primeiro conselho: criar um +novo arquivo (ou aba) para os dados limpos, nunca modificar os dados originais +(brutos). + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +After you go through this exercise, we'll discuss as a group what was +wrong with this data and how you would fix it. + +<!-- - Take about 10 minutes to work on this exercise. --> + +<!-- - All the mistakes in the *common mistakes* section below are present --> + +<!-- in the messy dataset. If the exercise is done during a workshop, ask --> + +<!-- people what they saw as wrong with the data. As they bring up --> + +<!-- different points, you can refer to the common mistakes or expand a --> + +<!-- bit on the point they brought up. --> + +<!-- - If you get a response where they've fixed the date, you can pause --> + +<!-- and go to the dates lesson. Or you can say you'll come back to dates --> + +<!-- at the end. --> + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge: Once you have tidied up the data, answer the following questions: + +- How many men and women took part in the study? +- How many A, AB, and B types have been tested? +- As above, but disregarding the contaminated samples? +- How many Rhesus + and - have been tested? +- How many universal donors (O-) have been tested? +- What is the average weight of AB men? +- How many samples have been tested in the different hospitals? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +An **excellent reference**, in particular with regard to R scripting +is the _Tidy Data_ paper @Wickham:2014. + +## Common spreadsheet errors + +**Questions** + +- What are some common challenges with formatting data in spreadsheets + and how can we avoid them? + +**Objectives** + +- Recognise and resolve common spreadsheet formatting problems. + +**Keypoints** + +- Avoid using multiple tables within one spreadsheet. +- Avoid spreading data across multiple tabs. +- Record zeros as zeros. +- Use an appropriate null value to record missing data. +- Don't use formatting to convey information or to make your spreadsheet look pretty. +- Place comments in a separate column. +- Record units in column headers. +- Include only one piece of information in a cell. +- Avoid spaces, numbers and special characters in column headers. +- Avoid special characters in your data. +- Record metadata in a separate plain text file. + +<!-- This lesson is meant to be used as a reference for discussion as --> + +<!-- learners identify issues with the messy dataset discussed in the --> + +<!-- previous lesson. Instructors: don't go through this lesson except to --> + +<!-- refer to responses to the exercise in the previous lesson. --> + +There are a few potential errors to be on the lookout for in your own +data as well as data from collaborators or the Internet. Se você esta +ciente dos erros e do possível efeito negativo na análise e interpretação de resultados a jusante, isso pode servir de motivação para você e para +os membros do seu projeto para tentar evitá-los. Fazendo pequenas alterações +na forma como você formata seus dados em planilhas pode ter um grande +impacto na eficiência e confiabilidade na limpeza de dados +e análise. + +- [Using multiple tables](#tables) +- [Using multiple tabs](#tabs) +- [Not filling in zeros](#zeros) +- [Using problematic null values](#null) +- [Using formatting to convey information](#formatting) +- [Using formatting to make the data sheet look pretty](#formatting_pretty) +- [Placing comments or units in cells](#units) +- [Entering more than one piece of information in a cell](#info) +- [Using problematic field names](#field_name) +- [Using special characters in data](#special) +- [Inclusion of metadata in data table](#metadata) + +### Using multiple tables {#tables} + +A common strategy is creating multiple data tables within one +spreadsheet. Isso confunde o computador, então não faça isso! Quando você +cria múltiplas tabelas dentro de uma planilha, você está desenhando falsas +associações entre as coisas para o computador, que vê cada linha como +uma observação. Você também está potencialmente usando o mesmo nome em +múltiplos lugares, o que dificultará a limpeza de seus dados em +um formulário utilizável. The example below depicts the problem: + +![](fig/2_datasheet_example.jpg) + +In the example above, the computer will see (for example) row 4 and +assume that all columns A-AF refer to the same sample. Esta linha +representa na verdade quatro amostras distintas (amostra 1 para cada uma das +datas diferentes de coleção - dia 29 de maio, 12 de junho, 19 de junho, e +26 de junho), assim como algumas estatísticas calculadas uma média (avr) +e um erro padrão (SEM) para duas dessas amostras. Outras linhas da planilha são igualmente problemáticas. + +### Using multiple tabs {#tabs} + +But what about workbook tabs? Essa parece uma maneira fácil de organizar +dados, certo? Bem, sim e não. Quando você criar abas extras, você impede que o computador veja conexões nos dados que existem (você +tem que introduzir funções específicas em uma planilha ou o script +para garantir essa conexão). Digamos, por exemplo, que faz uma +aba para cada dia de medição. + +This isn't good practice for two reasons: + +1. you are more likely to accidentally add inconsistencies to your + data if each time you take a measurement, you start recording data + in a new tab, and + +2. mesmo se você conseguir evitar que todas as inconsistências entrem, + você irá adicionar um passo extra antes de analisar os dados + pois terá que combinar esses dados em um único + dataset. Você terá que dizer explicitamente ao computador como + combinar as abas - e se as abas forem formatadas de forma inconsistente, você + talvez tenha até que combiná-las manualmente. + +Na próxima vez que você estiver inserindo dados, e pense em criar outra aba +ou tabela, Pergunte se você poderia evitar adicionar esta aba adicionando +outra coluna à sua planilha original. Usamos várias abas em +nosso exemplo de um arquivo de dados confuso, mas agora você viu como pode +reorganizar seus dados para os consolidar através de abas. + +Sua planilha pode ficar muito longa durante o experimento. Isso dificulta a entrada de dados se você não consegue ver os cabeçalhos +no topo da planilha. Mas não repita a sua linha de +cabeçalho. Ele pode facilmente se misturar com os dados, levando a problemas +ao longo da análise. Em vez disso, você pode congelar a coluna +cabeçalho +para que elas permaneçam visíveis mesmo quando você tiver uma planilha com muitas linhas. + +### Not filling in zeros {#zeros} + +It might be that when you're measuring something, it's usually a zero, +say the number of times a rabbit is observed in the survey. Porquê dar-se ao trabalho de +escrever o número zero nessa coluna, quando a maior parte serão zeros? + +No entanto, há uma diferença entre uma célula com um zero e uma célula em branco em uma planilha. Para o computador, um zero é de fato um dado. Você teve informações sobre ele e o mediu. Uma célula em branco significa que a informação não foi medida e o computador +irá interpretá-la como um valor desconhecido (também conhecido como um valor +nulo ou valor faltando). + +As planilhas ou programas estatísticos provavelmente interpretarão erradamente +células em branco que você pretende deveriam ser zeros. Ao não inserir o valor de +sua observação, você está dizendo ao seu computador para representar esses dados +como desconhecido ou faltando (nulos ou null). Isso pode causar problemas com cálculos ou análises posteriores. Por exemplo, a média de um conjunto de números +que inclui um único valor nulo é sempre nulo (porque o computador +não consegue adivinhar o valor das observações que faltam). Porque +disso é muito importante gravar zeros como zeros e os +dados ausentes como nulos. + +### Using problematic null values {#null} + +**Example**: using -999 or other numerical values (or zero) to +represent missing data. + +**Solutions**: + +There are a few reasons why null values get represented differently +within a dataset. Às vezes, valores nulos confusos são automaticamente +gravados a depender do dispositivo de medição. Se for esse o caso, não há +muito que se possa fazer, mas pode ser resolvido na limpeza de dados com uma ferramenta +como +[OpenRefine](https://www.datacarpentry.org/OpenRefine-ecology-lesson/) +antes da análise. Outras vezes valores nulos diferentes são usados para transmitir +diferentes razões porque os dados não estão lá. Essa é uma informação +importante para capturar, mas está em vigor usando uma coluna para capturar +dois tipos de informação diferentes. Assim como para [usando formatação para transmitir informação](#formatting) seria bom aqui criar uma nova coluna +como 'data_missing' e usar essa coluna para capturar as diferentes razões para o dado ser nulo. + +Seja qual for a razão, é um problema se dados desconhecidos ou ausentes são +registrados como -999, 999 ou 0. + +Muitos programas estatísticos não reconhecerão que esses valores se destinam a +a representar valores (null) ausentes. A forma como esses valores são interpretados +dependerá do software que você usa para analisar seus dados. É +essencial usar um indicador de dados nulos claramente definido e consistente. + +Espaços em branco (maioria dos aplicativos) e NA (para R) são boas +escolhas. @White:2013 explain good choices for indicating null values +for different software applications in their article: + +![](fig/3_white_table_1.jpg) + +### Using formatting to convey information {#formatting} + +**Example**: highlighting cells, rows or columns that should be +excluded from an analysis, leaving blank rows to indicate +separations in data. + +![](fig/formatting.png) + +**Solution**: create a new field to encode which data should be +excluded. + +![](fig/good_formatting.png) + +### Using formatting to make the data sheet look pretty {#formatting_pretty} + +**Example**: merging cells. + +**Solução**: se você não for cuidadoso, formatar uma tabela para ser mais +esteticamente agradável pode comprometer a capacidade do seu computador de identificar +associações nos dados. As células fundidas tornarão os seus dados ilegíveis +por softwares estatístico. Considere a reestruturação de seus dados de forma a +que você não precisará mesclar/combinar células para organizar seus dados. + +### Placing comments or units in cells {#units} + +Most analysis software can't see Excel or LibreOffice comments, and +would be confused by comments placed within your data cells. Como o +descrito acima para a formatação, crie outro campo se você precisar +adicionar anotações a uma linha. Da mesma forma, não inclua unidades de medida nas células: idealmente, +todas as medidas que se coloca numa coluna devem estar na mesma unidade, mas se por alguma razão não estiverem, crie outro campo e +especifique as unidades em que a célula está. + +### Entering more than one piece of information in a cell {#info} + +**Example**: Recording ABO and Rhesus groups in one cell, such as A+, +B+, A-, ... + +**Solução**: Não incluir mais do que uma informação numa célula. Isso limitará as maneiras pelas quais você pode analisar seus dados. Se +precisar destas duas medidas, crie sua tabela para incluir +estas informações. Por exemplo, inclua uma coluna para o grupo ABO e +uma para o grupo Rhesus. + +### Using problematic field names {#field_name} + +Choose descriptive field names, but be careful not to include spaces, +numbers, or special characters of any kind. Os espaços podem ser +mal interpretados por analisadores que utilizam espaços em branco como delimitadores e alguns programas +não gostam de nomes de campos que são cadeias de texto que começam com +números. + +Os sublinhados (`_`) são uma boa alternativa aos espaços. Considere escrever +nomes de forma as palavras se separarem por letras maiúsculas (como este: ExampleFileName) para melhorar a legibilidade. Lembre-se que as abreviações que fazem sentido no momento +podem não ser tão óbvias em 6 meses, mas não exagere com nomes que +são excessivamente longos. Incluindo as unidades de medida no campo os nomes evitam +confusão e permitem que outros interpretem prontamente suas colunas. + +**Examples** + +| Good Name | Good Alternative | Avoid | +| ---------------------------------------------------------- | -------------------------------------- | ------------------------------------ | +| Max_temp_C | MaxTemp | Maximum Temp (°C) | +| Precipitation_mm | Precipitation | precmm | +| Mean_year_growth | MeanYearGrowth | Mean growth/year | +| sex | sex | M/F | +| weight | weight | w. | +| cell_type | CellType | Cell Type | +| Observation_01 | first_observation | 1st Obs | + +### Using special characters in data {#special} + +**Example**: You treat your spreadsheet program as a word processor +when writing notes, for example copying data directly from Word or +other applications. + +**Solução**: Essa é uma estratégia comum. Por exemplo, ao escrever +mais texto em uma célula, as pessoas muitas vezes incluem quebras de linha, e-dashes, +etc. em sua planilha. Além disso, ao copiar dados de +aplicações como Word, caracteres de formatação e +não-padrão (tais como aspas alinhadas à esquerda e à direita) estão +incluídos. Ao exportar esses dados para um ambiente de codificação/estatística +ou em um banco de dados, coisas perigosas podem ocorrer, +como linhas sendo cortadas ao meio e os erros de codificação são lançados. + +As melhores práticas gerais são evitar a adição de caracteresextras como novas linhas, abas e colunas. Em outras palavras, trate uma célula de texto como se +fosse um simples formulário online que pode conter apenas texto e espaços. + +### Inclusion of metadata in data table {#metadata} + +**Example**: You add a legend at the top or bottom of your data table +explaining column meaning, units, exceptions, etc. + +**Solution**: Recording data about your data ("metadata") is +essential. Mesmo que você conheça muito bem seus dados enquanto os coleta e analisa, as chances de que você lembrará que a variável "sglmemgp" significa cada membro de um grupo, ou o exato algorítimo usado diminuem com o passar dos mêses. + +Também existem muitas razões pelas quais outras pessoas podem querer examinar ou +usar seus dados - para entender suas descobertas, para verificar suas descobertas, +para rever uma publicação submetida, para replicar seus resultados, para +projetar um estudo semelhante, ou até mesmo para arquivar seus dados para acesso e +reutiliza-los. Embora os dados sejam, por definição, +legíveis por máquinas, a compreensão do seu significado é uma tarefa para humanos. A importância de documentar os seus dados durante a fase de coleta +e análise da sua investigação não pode ser subestimada, +especialmente se a sua investigação for fazer parte do registo académico. + +No entanto, os metadados não devem estar contidos no arquivo de dados +em si. Ao contrário de uma tabela em um papel ou um arquivo suplementar metadados (em forma de legendas) não devem ser incluídos em um arquivo de dados já que esta informação +não é de dados, e incluir isso pode interromper como o computador interpreta seu arquivo de dados. Em vez disso, os metadados devem ser armazenados +como um arquivo separado, no mesmo diretório do seu arquivo de dados de preferência, +em formato de texto simples com um nome que claramente o associa ao seu arquivo de dados. Como arquivos de metadados são no formato de texto livre, eles também +permitem que você adicione comentários, unidades, informações sobre como valores nulos +são codificados, etc. Informações que são importantes na documentação, mas podem interromper a formatação +do seu arquivo de dados. + +Além disso, os metadados do arquivo ou do banco de dados descrevem como arquivos que +compõem o conjunto de dados se relacionam entre si; em que formato eles estão; e +se eles superpõe ou são substituídos por arquivos anteriores. Um arquivo readme.txt +presente na pasta é a maneira clássica de contabilizar todos os arquivos e pastas +em um projeto. + +(Texto sobre metadados adaptados do curso Pesquisa Dados +[MANTRA](https://datalib.edina.ac.uk/mantra) pela EDINA e Biblioteca de Dados, +Universidade de Edinburgh. MANTRA está licenciado sob uma Creative Commons +Atribuição 4.0 International +License.) + +## Exporting data + +**Question** + +- How can we export data from spreadsheets in a way that is useful for + downstream applications? + +**Objectives** + +- Store spreadsheet data in universal file formats. +- Export data from a spreadsheet to a CSV file. + +**Keypoints** + +- Data stored in common spreadsheet formats will often not be read + correctly into data analysis software, introducing errors into your + data. + +- Exporting data from spreadsheets to formats like CSV or TSV puts it + in a format that can be used consistently by most programs. + +Armazenando os dados com os quais você trabalhará para suas análises no formato de arquivo Excel +padrão (`*. ls` or `*.xlsx` - dependendo da versão do Excel) não é uma boa ideia. Por quê? + +- Because it is a proprietary format, and it is possible that in the + future, technology won't exist (or will become sufficiently rare) to + make it inconvenient, if not impossible, to open the file. + +- Other spreadsheet software may not be able to open files saved in a + proprietary Excel format. + +- Different versions of Excel may handle data differently, leading to + inconsistencies. [Dates](https://datacarpentry.org/spreadsheet-ecology-lesson/03-dates-as-data/index.html) + é um exemplo bem documentado de inconsistências no armazenamento de dados. + +- Finally, more journals and grant agencies are requiring you to + deposit your data in a data repository, and most of them don't + accept Excel format. Ele precisa estar em um dos formatos discutidos + abaixo. + +- The above points also apply to other formats such as open data + formats used by LibreOffice / Open Office. Estes formatos não são + estáticos e não são analisados da mesma maneira por diferentes softwares + ou pacotes. + +Armazenar dados em um formato universal, aberto e estático ajudará a lidar com este problema. Tente formatos delimitado por tabulações (valores separados por tabulação ou TSV) ou +delimitado por vírgulas (valores separados por vírgula ou CSV). Arquivos CSV são arquivos com texto +simples onde as colunas são separadas por vírgulas, portanto 'vírgula +separam valores' ou CSV (do inglês, comma separated files). A vantagem de um arquivo CSV em um +Excel/SPSS/etc. é que podemos abrir e ler um arquivo CSV usando +praticamente qualquer software, incluindo editores de texto simples como TextEdit ou +NotePad. Os dados em um arquivo CSV também podem ser facilmente importados para outros formatos +e ambientes, como SQLite e R. Nós não estamos vinculados a uma determinada versão +de um certo programa caro quando trabalhamos com arquivos CSV +então é um bom formato trabalhar com o máximo de portabilidade e resistência. A maioria dos programas de planilha pode delimitar textos +formatos como CSV facilmente, embora eles possam dar a você um aviso durante +a exportação do arquivo que o formato original será perdido. + +To save a file you have opened in Excel in CSV format: + +1. No menu superior selecione 'Arquivo' e 'Salvar como'. +2. No campo "Formato", selecione da lista "Comma Separated + Values" (`*.csv`). +3. Verifique o nome do arquivo e o local onde você quer salvar-lo e pressione 'Salvar'. + +Uma nota importante para compatibilidade retroativa: você pode abrir os arquivos CSV +em Excel! + +```{r, results="markup", fig.cap="Saving an Excel file to CSV.", echo=FALSE, purl=FALSE, out.width="70%", fig.align="center"} +knitr::include_graphics("fig/excel-to-csv.png") +``` + +**A note on R and `xls`**: There are R packages that can read `xls` +files (as well as Google spreadsheets). É até possível acessar +diferentes abas nos documentos 'xls'. + +**But** + +- some of these only work on Windows. +- this equates to replacing a (simple but manual) export to `csv` with + additional complexity/dependencies in the data analysis R code. +- data formatting best practice still apply. +- Is there really a good reason why `csv` (or similar) is not + adequate? + +### Caveats on commas + +In some datasets, the data values themselves may include commas +(,). Nesse caso, o software que você utiliza (incluindo o Excel) irá provavelmente exibir +de forma incorreta os dados em colunas. Isso é porque +as vírgulas que fazem parte dos valores de dados serão interpretadas como +delimitadores. + +For example, our data might look like this: + +``` +species_id,genus,species,taxa +AB,Amphispiza,bilineata,Bird +AH,Ammospermophilus,harrisi,Rodent, not censused +AS,Ammodramus,savannarum,Bird +BA,Baiomys,taylori,Rodent +``` + +In the record `AH,Ammospermophilus,harrisi,Rodent, not censused` the +value for `taxa` includes a comma (`Rodent, not censused`). If we try +to read the above into Excel (or other spreadsheet program), we will +get something like this: + +```{r, results="markup", fig.cap="The risks of having commas inside comma-separated data.", echo=FALSE, purl=FALSE, out.width="80%", fig.align="center"} +knitr::include_graphics("fig/csv-mistake.png") +``` + +The value for `taxa` was split into two columns (instead of being put +in one column `D`). Isso pode propagar-se para uma série de outros +erros. Por exemplo, a coluna extra será interpretada como uma coluna +com muitos valores ausentes (e sem um cabeçalho adequado). Além de +isso, o valor na coluna `D` para o registro na linha 3 (então o valor +onde o valor de 'taxa' continha a vírgula) está agora incorreto. + +Se você quiser armazenar seus dados em formato `csv` e esperar que seus valores de dados +possam conter vírgulas, você pode evitar o problema discutido +acima colocando os valores nas aspas (""). Applying this rule, our +data might look like this: + +``` +species_id,genus,species,taxa +"AB","Amphispiza","bilineata","Bird" +"AH","Ammospermophilus","harrisi","Rodent, not censused" +"AS","Ammodramus","savannarum","Bird" +"BA","Baiomys","taylori","Rodent" +``` + +Now opening this file as a `csv` in Excel will not lead to an extra +column, because Excel will only use commas that fall outside of +quotation marks as delimiting characters. + +Em alternativa, se estiver trabalhando com dados que contenham vírgulas, é +provável que tenha de utilizar outro delimitador quando trabalhar numa folha de cálculo +[^decsep]. Neste caso, considere usar abas como seu delimitador e +trabalhando com arquivos TSV. Arquivos TSV podem ser exportados a partir de planilhas +programas da mesma forma que os arquivos CSV. + +[^decsep]: This is particularly relevant in European + countries where the comma is used as a decimal + separator. Em tais casos, o separador de valor padrão em um arquivo + csv será o ponto e vírgula (;), ou os valores serão + sistematicamente citados. + +Se você estiver trabalhando com um conjunto de dados já existente no qual os valores +não estão incluídos em "" mas que tem vírgulas como delimitadores +e partes de valores de dados, você está potencialmente enfrentando um grande problema +na limpeza de dados. Se o conjunto de dados que você está lidando contiver +centenas ou milhares de linhas, limpa-los manualmente (por +remover vírgulas dos valores de dados ou colocar os valores em +aspas - "") não só levará horas e horas, mas pode +potencialmente acabar com você introduzindo acidentalmente muitos erros. + +Limpeza de conjuntos de dados é um dos principais problemas de muitas disciplinas +científicas. A abordagem quase sempre depende do contexto +específico. No entanto, é uma boa prática limpar os dados de forma +automatizada, por exemplo, escrevendo e executando um script. As lições +de Python e R lhe darão a base para +criar scripts relevantes. + +## Summary + +```{r analysis, results="asis", fig.margin=TRUE, fig.cap="A typical data analysis workflow.", fig.width=7, fig.height=4, echo=FALSE, purl=FALSE} +knitr::include_graphics("fig/analysis.png") +``` + +A typical data analysis workflow is illustrated in the figure above, +where data is repeatedly transformed, visualised, and modelled. Esta iteração +é repetida várias vezes até os dados serem compreendidos. Em +muitos casos da vida real, no entanto, a maioria do tempo é gasto limpando e +preparando os dados, em vez de realmente os analisando. + +Um fluxo de trabalho de análise de dados ágeis, com várias iterações rápidas do ciclo +transformar/visualizar/modelo só é possível se os dados forem +formatados de forma previsível e se puderem raciocinar sobre os dados +sem ter que olhar para eles e/ou corrigi-los. + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- Good data organization is the foundation of any research project. + +:::::::::::::::::::::::::::::::::::::::::::::::::: From dd77231dc5b75ebe1a9acdef66837d41f6ff5d1c Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Fri, 29 Dec 2023 06:10:27 +0900 Subject: [PATCH 006/334] New translations 10-data-organisation.md (Chinese Simplified) --- locale/zh/episodes/10-data-organisation.Rmd | 829 ++++++++++++++++++++ 1 file changed, 829 insertions(+) create mode 100644 locale/zh/episodes/10-data-organisation.Rmd diff --git a/locale/zh/episodes/10-data-organisation.Rmd b/locale/zh/episodes/10-data-organisation.Rmd new file mode 100644 index 000000000..d52686828 --- /dev/null +++ b/locale/zh/episodes/10-data-organisation.Rmd @@ -0,0 +1,829 @@ +--- +source: Rmd +title: Data organisation with spreadsheets +teaching: 30 +exercises: 30 +--- + +```{r, include=FALSE} +``` + +::::::::::::::::::::::::::::::::::::::: objectives + +- Learn about spreadsheets, their strengths and weaknesses. +- How do we format data in spreadsheets for effective data use? +- Learn about common spreadsheet errors and how to correct them. +- Organise your data according to tidy data principles. +- Learn about text-based spreadsheet formats such as the comma-separated (CSV) or tab-separated (TSV) formats. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- How to organise tabular data? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Spreadsheet programs + +**Question** + +- What are basic principles for using spreadsheets for good data + organization? + +**Objective** + +- Describe best practices for organizing data so computers can make + the best use of datasets. + +**Keypoint** + +- Good data organization is the foundation of any research project. + +Good data organization is the foundation of your research +project. Most researchers have data or do data entry in +spreadsheets. Spreadsheet programs are very useful graphical +interfaces for designing data tables and handling very basic data +quality control functions. See also @Broman:2018. + +### Spreadsheet outline + +Spreadsheets are good for data entry. Therefore we have a lot of data +in spreadsheets. Much of your time as a researcher will be spent in +this 'data wrangling' stage. It's not the most fun, but it's +necessary. We'll teach you how to think about data organization and +some practices for more effective data wrangling. + +### What this lesson will not teach you + +- How to do _statistics_ in a spreadsheet +- How to do _plotting_ in a spreadsheet +- How to _write code_ in spreadsheet programs + +If you're looking to do this, a good reference is Head First +Excel, +published by O'Reilly. + +### Why aren't we teaching data analysis in spreadsheets + +- Data analysis in spreadsheets usually requires a lot of manual + work. If you want to change a parameter or run an analysis with a + new dataset, you usually have to redo everything by hand. (We do + know that you can create macros, but see the next point.) + +- It is also difficult to track or reproduce statistical or plotting + analyses done in spreadsheet programs when you want to go back to + your work or someone asks for details of your analysis. + +Many spreadsheet programs are available. Since most participants +utilise Excel as their primary spreadsheet program, this lesson will +make use of Excel examples. A free spreadsheet program that can also +be used is LibreOffice. Commands may differ a bit between programs, +but the general idea is the same. + +Spreadsheet programs encompass a lot of the things we need to be able +to do as researchers. We can use them for: + +- Data entry +- Organizing data +- Subsetting and sorting data +- Statistics +- Plotting + +Spreadsheet programs use tables to represent and display data. Data +formatted as tables is also the main theme of this chapter, and we +will see how to organise data into tables in a standardised way to +ensure efficient downstream analysis. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge: Discuss the following points with your neighbour + +- Have you used spreadsheets, in your research, courses, + or at home? +- What kind of operations do you do in spreadsheets? +- Which ones do you think spreadsheets are good for? +- Have you accidentally done something in a spreadsheet program that made you + frustrated or sad? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +### Problems with spreadsheets + +Spreadsheets are good for data entry, but in reality we tend to +use spreadsheet programs for much more than data entry. We use them +to create data tables for publications, to generate summary +statistics, and make figures. + +Generating tables for publications in a spreadsheet is not +optimal - often, when formatting a data table for publication, we're +reporting key summary statistics in a way that is not really meant to +be read as data, and often involves special formatting +(merging cells, creating borders, making it pretty). We advise you to +do this sort of operation within your document editing software. + +The latter two applications, generating statistics and figures, should +be used with caution: because of the graphical, drag and drop nature of +spreadsheet programs, it can be very difficult, if not impossible, to +replicate your steps (much less retrace anyone else's), particularly if your +stats or figures require you to do more complex calculations. Furthermore, +in doing calculations in a spreadsheet, it's easy to accidentally apply a +slightly different formula to multiple adjacent cells. When using a +command-line based statistics program like R or SAS, it's practically +impossible to apply a calculation to one observation in your +dataset but not another unless you're doing it on purpose. + +### Using spreadsheets for data entry and cleaning + +In this lesson, we will assume that you are most likely using Excel as +your primary spreadsheet program - there are others (gnumeric, Calc +from OpenOffice), and their functionality is similar, but Excel seems +to be the program most used by biologists and biomedical researchers. + +In this lesson we're going to talk about: + +1. Formatting data tables in spreadsheets +2. Formatting problems +3. Exporting data + +## Formatting data tables in spreadsheets + +**Questions** + +- How do we format data in spreadsheets for effective data use? + +**Objectives** + +- Describe best practices for data entry and formatting in + spreadsheets. + +- Apply best practices to arrange variables and observations in a + spreadsheet. + +**Keypoints** + +- Never modify your raw data. Always make a copy before making any + changes. + +- Keep track of all of the steps you take to clean your data in a + plain text file. + +- Organise your data according to tidy data principles. + +The most common mistake made is treating spreadsheet programs like lab +notebooks, that is, relying on context, notes in the margin, spatial +layout of data and fields to convey information. As humans, we can +(usually) interpret these things, but computers don't view information +the same way, and unless we explain to the computer what every single +thing means (and that can be hard!), it will not be able to see how +our data fits together. + +Using the power of computers, we can manage and analyse data in much +more effective and faster ways, but to use that power, we have to set +up our data for the computer to be able to understand it (and +computers are very literal). + +This is why it's extremely important to set up well-formatted tables +from the outset - before you even start entering data from your very +first preliminary experiment. Data organization is the foundation of +your research project. It can make it easier or harder to work with +your data throughout your analysis, so it's worth thinking about when +you're doing your data entry or setting up your experiment. You can +set things up in different ways in spreadsheets, but some of these +choices can limit your ability to work with the data in other programs +or have the you-of-6-months-from-now or your collaborator work with +the data. + +**Note:** the best layouts/formats (as well as software and +interfaces) for data entry and data analysis might be different. It is +important to take this into account, and ideally automate the +conversion from one to another. + +### Keeping track of your analyses + +When you're working with spreadsheets, during data clean up or +analyses, it's very easy to end up with a spreadsheet that looks very +different from the one you started with. In order to be able to +reproduce your analyses or figure out what you did when a reviewer or +instructor asks for a different analysis, you should + +- create a new file with your cleaned or analysed data. Don't modify + the original dataset, or you will never know where you started! + +- keep track of the steps you took in your clean up or analysis. You + should track these steps as you would any step in an experiment. We + recommend that you do this in a plain text file stored in the same + folder as the data file. + +This might be an example of a spreadsheet setup: + +![](fig/spreadsheet-setup-updated.png) + +Put these principles in to practice today during your exercises. + +While versioning is out of scope for this course, you can look at the +Carpentries lesson on +['Git'](https://swcarpentry.github.io/git-novice/) to learn how to +maintain **version control** over your data. See also this blog +post for a quick tutorial or +@Perez-Riverol:2016 for a more research-oriented use-case. + +### Structuring data in spreadsheets + +The cardinal rules of using spreadsheet programs for data: + +1. Put all your variables in columns - the thing you're measuring, + like 'weight' or 'temperature'. +2. Put each observation in its own row. +3. Don't combine multiple pieces of information in one cell. Sometimes + it just seems like one thing, but think if that's the only way + you'll want to be able to use or sort that data. +4. Leave the raw data raw - don't change it! +5. Export the cleaned data to a text-based format like CSV + (comma-separated values) format. This ensures that anyone can use + the data, and is required by most data repositories. + +For instance, we have data from patients that visited several +hospitals in Brussels, Belgium. They recorded the date of the visit, +the hospital, the patients' gender, weight and blood group. + +If we were to keep track of the data like this: + +![](fig/multiple-info.png) + +the problem is that the ABO and Rhesus groups are in the same `Blood` +type column. So, if they wanted to look at all observations of the A +group or look at weight distributions by ABO group, it would be tricky +to do this using this data setup. If instead we put the ABO and Rhesus +groups in different columns, you can see that it would be much easier. + +![](fig/single-info.png) + +An important rule when setting up a datasheet, is that **columns are +used for variables** and **rows are used for observations**: + +- columns are variables +- rows are observations +- cells are individual values + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge: We're going to take a messy dataset and describe how we would clean it up. + +1. Download a messy dataset by clicking + [here](https://github.com/UCLouvain-CBIO/WSBIM1207/raw/master/data/messy_covid.xlsx). + +2. Open up the data in a spreadsheet program. + +3. You can see that there are two tabs. The data contains various + clinical variables recorded in various hospitals in Brussels during + the first and second COVID-19 waves in 2020. As you can see, the + data have been recorded differently during the March and November + waves. Now you're the person in charge of this project and you want + to be able to start analyzing the data. + +4. With the person next to you, identify what is wrong with this + spreadsheet. Also discuss the steps you would need to take to clean + up first and second wave tabs, and to put them all together in one + spreadsheet. + +**Important:** Do not forget our first piece of advice: to create a +new file (or tab) for the cleaned data, never modify your original +(raw) data. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +After you go through this exercise, we'll discuss as a group what was +wrong with this data and how you would fix it. + +<!-- - Take about 10 minutes to work on this exercise. --> + +<!-- - All the mistakes in the *common mistakes* section below are present --> + +<!-- in the messy dataset. If the exercise is done during a workshop, ask --> + +<!-- people what they saw as wrong with the data. As they bring up --> + +<!-- different points, you can refer to the common mistakes or expand a --> + +<!-- bit on the point they brought up. --> + +<!-- - If you get a response where they've fixed the date, you can pause --> + +<!-- and go to the dates lesson. Or you can say you'll come back to dates --> + +<!-- at the end. --> + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge: Once you have tidied up the data, answer the following questions: + +- How many men and women took part in the study? +- How many A, AB, and B types have been tested? +- As above, but disregarding the contaminated samples? +- How many Rhesus + and - have been tested? +- How many universal donors (O-) have been tested? +- What is the average weight of AB men? +- How many samples have been tested in the different hospitals? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +An **excellent reference**, in particular with regard to R scripting +is the _Tidy Data_ paper @Wickham:2014. + +## Common spreadsheet errors + +**Questions** + +- What are some common challenges with formatting data in spreadsheets + and how can we avoid them? + +**Objectives** + +- Recognise and resolve common spreadsheet formatting problems. + +**Keypoints** + +- Avoid using multiple tables within one spreadsheet. +- Avoid spreading data across multiple tabs. +- Record zeros as zeros. +- Use an appropriate null value to record missing data. +- Don't use formatting to convey information or to make your spreadsheet look pretty. +- Place comments in a separate column. +- Record units in column headers. +- Include only one piece of information in a cell. +- Avoid spaces, numbers and special characters in column headers. +- Avoid special characters in your data. +- Record metadata in a separate plain text file. + +<!-- This lesson is meant to be used as a reference for discussion as --> + +<!-- learners identify issues with the messy dataset discussed in the --> + +<!-- previous lesson. Instructors: don't go through this lesson except to --> + +<!-- refer to responses to the exercise in the previous lesson. --> + +There are a few potential errors to be on the lookout for in your own +data as well as data from collaborators or the Internet. If you are +aware of the errors and the possible negative effect on downstream +data analysis and result interpretation, it might motivate yourself +and your project members to try and avoid them. Making small changes +to the way you format your data in spreadsheets, can have a great +impact on efficiency and reliability when it comes to data cleaning +and analysis. + +- [Using multiple tables](#tables) +- [Using multiple tabs](#tabs) +- [Not filling in zeros](#zeros) +- [Using problematic null values](#null) +- [Using formatting to convey information](#formatting) +- [Using formatting to make the data sheet look pretty](#formatting_pretty) +- [Placing comments or units in cells](#units) +- [Entering more than one piece of information in a cell](#info) +- [Using problematic field names](#field_name) +- [Using special characters in data](#special) +- [Inclusion of metadata in data table](#metadata) + +### Using multiple tables {#tables} + +A common strategy is creating multiple data tables within one +spreadsheet. This confuses the computer, so don't do this! When you +create multiple tables within one spreadsheet, you're drawing false +associations between things for the computer, which sees each row as +an observation. You're also potentially using the same field name in +multiple places, which will make it harder to clean your data up into +a usable form. The example below depicts the problem: + +![](fig/2_datasheet_example.jpg) + +In the example above, the computer will see (for example) row 4 and +assume that all columns A-AF refer to the same sample. This row +actually represents four distinct samples (sample 1 for each of four +different collection dates - May 29th, June 12th, June 19th, and June +26th), as well as some calculated summary statistics (an average (avr) +and standard error of measurement (SEM)) for two of those +samples. Other rows are similarly problematic. + +### Using multiple tabs {#tabs} + +But what about workbook tabs? That seems like an easy way to organise +data, right? Well, yes and no. When you create extra tabs, you fail to +allow the computer to see connections in the data that are there (you +have to introduce spreadsheet application-specific functions or +scripting to ensure this connection). Say, for instance, you make a +separate tab for each day you take a measurement. + +This isn't good practice for two reasons: + +1. you are more likely to accidentally add inconsistencies to your + data if each time you take a measurement, you start recording data + in a new tab, and + +2. even if you manage to prevent all inconsistencies from creeping in, + you will add an extra step for yourself before you analyse the data + because you will have to combine these data into a single + datatable. You will have to explicitly tell the computer how to + combine tabs - and if the tabs are inconsistently formatted, you + might even have to do it manually. + +The next time you're entering data, and you go to create another tab +or table, ask yourself if you could avoid adding this tab by adding +another column to your original spreadsheet. We used multiple tabs in +our example of a messy data file, but now you've seen how you can +reorganise your data to consolidate across tabs. + +Your data sheet might get very long over the course of the +experiment. This makes it harder to enter data if you can't see your +headers at the top of the spreadsheet. But don't repeat your header +row. These can easily get mixed into the data, leading to problems +down the road. Instead you can freeze the column +headers +so that they remain visible even when you have a spreadsheet with many +rows. + +### Not filling in zeros {#zeros} + +It might be that when you're measuring something, it's usually a zero, +say the number of times a rabbit is observed in the survey. Why bother +writing in the number zero in that column, when it's mostly zeros? + +However, there's a difference between a zero and a blank cell in a +spreadsheet. To the computer, a zero is actually data. You measured or +counted it. A blank cell means that it wasn't measured and the +computer will interpret it as an unknown value (also known as a null +or missing value). + +The spreadsheets or statistical programs will likely misinterpret +blank cells that you intend to be zeros. By not entering the value of +your observation, you are telling your computer to represent that data +as unknown or missing (null). This can cause problems with subsequent +calculations or analyses. For example, the average of a set of numbers +which includes a single null value is always null (because the +computer can't guess the value of the missing observations). Because +of this, it's very important to record zeros as zeros and truly +missing data as nulls. + +### Using problematic null values {#null} + +**Example**: using -999 or other numerical values (or zero) to +represent missing data. + +**Solutions**: + +There are a few reasons why null values get represented differently +within a dataset. Sometimes confusing null values are automatically +recorded from the measuring device. If that's the case, there's not +much you can do, but it can be addressed in data cleaning with a tool +like +[OpenRefine](https://www.datacarpentry.org/OpenRefine-ecology-lesson/) +before analysis. Other times different null values are used to convey +different reasons why the data isn't there. This is important +information to capture, but is in effect using one column to capture +two pieces of information. Like for using formatting to convey +information it would be good here to create a new +column like 'data_missing' and use that column to capture the +different reasons. + +Whatever the reason, it's a problem if unknown or missing data is +recorded as -999, 999, or 0. + +Many statistical programs will not recognise that these are intended +to represent missing (null) values. How these values are interpreted +will depend on the software you use to analyse your data. It is +essential to use a clearly defined and consistent null indicator. + +Blanks (most applications) and NA (for R) are good +choices. @White:2013 explain good choices for indicating null values +for different software applications in their article: + +![](fig/3_white_table_1.jpg) + +### Using formatting to convey information {#formatting} + +**Example**: highlighting cells, rows or columns that should be +excluded from an analysis, leaving blank rows to indicate +separations in data. + +![](fig/formatting.png) + +**Solution**: create a new field to encode which data should be +excluded. + +![](fig/good_formatting.png) + +### Using formatting to make the data sheet look pretty {#formatting_pretty} + +**Example**: merging cells. + +**Solution**: If you're not careful, formatting a worksheet to be more +aesthetically pleasing can compromise your computer's ability to see +associations in the data. Merged cells will make your data unreadable +by statistics software. Consider restructuring your data in such a way +that you will not need to merge cells to organise your data. + +### Placing comments or units in cells {#units} + +Most analysis software can't see Excel or LibreOffice comments, and +would be confused by comments placed within your data cells. As +described above for formatting, create another field if you need to +add notes to cells. Similarly, don't include units in cells: ideally, +all the measurements you place in one column should be in the same +unit, but if for some reason they aren't, create another field and +specify the units the cell is in. + +### Entering more than one piece of information in a cell {#info} + +**Example**: Recording ABO and Rhesus groups in one cell, such as A+, +B+, A-, ... + +**Solution**: Don't include more than one piece of information in a +cell. This will limit the ways in which you can analyse your data. If +you need both these measurements, design your data sheet to include +this information. For example, include one column for the ABO group and +one for the Rhesus group. + +### Using problematic field names {#field_name} + +Choose descriptive field names, but be careful not to include spaces, +numbers, or special characters of any kind. Spaces can be +misinterpreted by parsers that use whitespace as delimiters and some +programs don't like field names that are text strings that start with +numbers. + +Underscores (`_`) are a good alternative to spaces. Consider writing +names in camel case (like this: ExampleFileName) to improve +readability. Remember that abbreviations that make sense at the moment +may not be so obvious in 6 months, but don't overdo it with names that +are excessively long. Including the units in the field names avoids +confusion and enables others to readily interpret your fields. + +**Examples** + +| Good Name | Good Alternative | Avoid | +| ---------------------------------------------------------- | -------------------------------------- | ------------------------------------ | +| Max_temp_C | MaxTemp | Maximum Temp (°C) | +| Precipitation_mm | Precipitation | precmm | +| Mean_year_growth | MeanYearGrowth | Mean growth/year | +| sex | sex | M/F | +| weight | weight | w. | +| cell_type | CellType | Cell Type | +| Observation_01 | first_observation | 1st Obs | + +### Using special characters in data {#special} + +**Example**: You treat your spreadsheet program as a word processor +when writing notes, for example copying data directly from Word or +other applications. + +**Solution**: This is a common strategy. For example, when writing +longer text in a cell, people often include line breaks, em-dashes, +etc. in their spreadsheet. Also, when copying data in from +applications such as Word, formatting and fancy non-standard +characters (such as left- and right-aligned quotation marks) are +included. When exporting this data into a coding/statistical +environment or into a relational database, dangerous things may occur, +such as lines being cut in half and encoding errors being thrown. + +General best practice is to avoid adding characters such as newlines, +tabs, and vertical tabs. In other words, treat a text cell as if it +were a simple web form that can only contain text and spaces. + +### Inclusion of metadata in data table {#metadata} + +**Example**: You add a legend at the top or bottom of your data table +explaining column meaning, units, exceptions, etc. + +**Solution**: Recording data about your data ("metadata") is +essential. You may be on intimate terms with your dataset while you +are collecting and analysing it, but the chances that you will still +remember that the variable "sglmemgp" means single member of group, +for example, or the exact algorithm you used to transform a variable +or create a derived one, after a few months, a year, or more are slim. + +As well, there are many reasons other people may want to examine or +use your data - to understand your findings, to verify your findings, +to review your submitted publication, to replicate your results, to +design a similar study, or even to archive your data for access and +re-use by others. While digital data by definition are +machine-readable, understanding their meaning is a job for human +beings. The importance of documenting your data during the collection +and analysis phase of your research cannot be overestimated, +especially if your research is going to be part of the scholarly +record. + +However, metadata should not be contained in the data file +itself. Unlike a table in a paper or a supplemental file, metadata (in +the form of legends) should not be included in a data file since this +information is not data, and including it can disrupt how computer +programs interpret your data file. Rather, metadata should be stored +as a separate file in the same directory as your data file, preferably +in plain text format with a name that clearly associates it with your +data file. Because metadata files are free text format, they also +allow you to encode comments, units, information about how null values +are encoded, etc. that are important to document but can disrupt the +formatting of your data file. + +Additionally, file or database level metadata describes how files that +make up the dataset relate to each other; what format they are in; and +whether they supercede or are superceded by previous files. A +folder-level readme.txt file is the classic way of accounting for all +the files and folders in a project. + +(Text on metadata adapted from the online course Research Data +[MANTRA](https://datalib.edina.ac.uk/mantra) by EDINA and Data Library, +University of Edinburgh. MANTRA is licensed under a Creative Commons +Attribution 4.0 International +License.) + +## Exporting data + +**Question** + +- How can we export data from spreadsheets in a way that is useful for + downstream applications? + +**Objectives** + +- Store spreadsheet data in universal file formats. +- Export data from a spreadsheet to a CSV file. + +**Keypoints** + +- Data stored in common spreadsheet formats will often not be read + correctly into data analysis software, introducing errors into your + data. + +- Exporting data from spreadsheets to formats like CSV or TSV puts it + in a format that can be used consistently by most programs. + +Storing the data you're going to work with for your analyses in Excel +default file format (`*.xls` or `*.xlsx` - depending on the Excel +version) isn't a good idea. Why? + +- Because it is a proprietary format, and it is possible that in the + future, technology won't exist (or will become sufficiently rare) to + make it inconvenient, if not impossible, to open the file. + +- Other spreadsheet software may not be able to open files saved in a + proprietary Excel format. + +- Different versions of Excel may handle data differently, leading to + inconsistencies. [Dates](https://datacarpentry.org/spreadsheet-ecology-lesson/03-dates-as-data/index.html) + is a well-documented example of inconsistencies in data storage. + +- Finally, more journals and grant agencies are requiring you to + deposit your data in a data repository, and most of them don't + accept Excel format. It needs to be in one of the formats discussed + below. + +- The above points also apply to other formats such as open data + formats used by LibreOffice / Open Office. These formats are not + static and do not get parsed the same way by different software + packages. + +Storing data in a universal, open, and static format will help deal +with this problem. Try tab-delimited (tab separated values or TSV) or +comma-delimited (comma separated values or CSV). CSV files are plain +text files where the columns are separated by commas, hence 'comma +separated values' or CSV. The advantage of a CSV file over an +Excel/SPSS/etc. file is that we can open and read a CSV file using +just about any software, including plain text editors like TextEdit or +NotePad. Data in a CSV file can also be easily imported into other +formats and environments, such as SQLite and R. We're not tied to a +certain version of a certain expensive program when we work with CSV +files, so it's a good format to work with for maximum portability and +endurance. Most spreadsheet programs can save to delimited text +formats like CSV easily, although they may give you a warning during +the file export. + +To save a file you have opened in Excel in CSV format: + +1. From the top menu select 'File' and 'Save as'. +2. In the 'Format' field, from the list, select 'Comma Separated + Values' (`*.csv`). +3. Double check the file name and the location where you want to save + it and hit 'Save'. + +An important note for backwards compatibility: you can open CSV files +in Excel! + +```{r, results="markup", fig.cap="Saving an Excel file to CSV.", echo=FALSE, purl=FALSE, out.width="70%", fig.align="center"} +knitr::include_graphics("fig/excel-to-csv.png") +``` + +**A note on R and `xls`**: There are R packages that can read `xls` +files (as well as Google spreadsheets). It is even possible to access +different worksheets in the `xls` documents. + +**But** + +- some of these only work on Windows. +- this equates to replacing a (simple but manual) export to `csv` with + additional complexity/dependencies in the data analysis R code. +- data formatting best practice still apply. +- Is there really a good reason why `csv` (or similar) is not + adequate? + +### Caveats on commas + +In some datasets, the data values themselves may include commas +(,). In that case, the software which you use (including Excel) will +most likely incorrectly display the data in columns. This is because +the commas which are a part of the data values will be interpreted as +delimiters. + +For example, our data might look like this: + +``` +species_id,genus,species,taxa +AB,Amphispiza,bilineata,Bird +AH,Ammospermophilus,harrisi,Rodent, not censused +AS,Ammodramus,savannarum,Bird +BA,Baiomys,taylori,Rodent +``` + +In the record `AH,Ammospermophilus,harrisi,Rodent, not censused` the +value for `taxa` includes a comma (`Rodent, not censused`). If we try +to read the above into Excel (or other spreadsheet program), we will +get something like this: + +```{r, results="markup", fig.cap="The risks of having commas inside comma-separated data.", echo=FALSE, purl=FALSE, out.width="80%", fig.align="center"} +knitr::include_graphics("fig/csv-mistake.png") +``` + +The value for `taxa` was split into two columns (instead of being put +in one column `D`). This can propagate to a number of further +errors. For example, the extra column will be interpreted as a column +with many missing values (and without a proper header). In addition to +that, the value in column `D` for the record in row 3 (so the one +where the value for 'taxa' contained the comma) is now incorrect. + +If you want to store your data in `csv` format and expect that your +data values may contain commas, you can avoid the problem discussed +above by putting the values in quotes (""). Applying this rule, our +data might look like this: + +``` +species_id,genus,species,taxa +"AB","Amphispiza","bilineata","Bird" +"AH","Ammospermophilus","harrisi","Rodent, not censused" +"AS","Ammodramus","savannarum","Bird" +"BA","Baiomys","taylori","Rodent" +``` + +Now opening this file as a `csv` in Excel will not lead to an extra +column, because Excel will only use commas that fall outside of +quotation marks as delimiting characters. + +Alternatively, if you are working with data that contains commas, you +likely will need to use another delimiter when working in a +spreadsheet[^decsep]. In this case, consider using tabs as your delimiter and +working with TSV files. TSV files can be exported from spreadsheet +programs in the same way as CSV files. + +[^decsep]: This is particularly relevant in European + countries where the comma is used as a decimal + separator. In such cases, the default value separator in a + csv file will be the semi-colon (;), or values will be + systematically quoted. + +If you are working with an already existing dataset in which the data +values are not included in "" but which have commas as both delimiters +and parts of data values, you are potentially facing a major problem +with data cleaning. If the dataset you're dealing with contains +hundreds or thousands of records, cleaning them up manually (by either +removing commas from the data values or putting the values into +quotes - "") is not only going to take hours and hours but may +potentially end up with you accidentally introducing many errors. + +Cleaning up datasets is one of the major problems in many scientific +disciplines. The approach almost always depends on the particular +context. However, it is a good practice to clean the data in an +automated fashion, for example by writing and running a script. The +Python and R lessons will give you the basis for developing skills to +build relevant scripts. + +## Summary + +```{r analysis, results="asis", fig.margin=TRUE, fig.cap="A typical data analysis workflow.", fig.width=7, fig.height=4, echo=FALSE, purl=FALSE} +knitr::include_graphics("fig/analysis.png") +``` + +A typical data analysis workflow is illustrated in the figure above, +where data is repeatedly transformed, visualised, and modelled. This +iteration is repeated multiple times until the data is understood. In +many real-life cases, however, most time is spent cleaning up and +preparing the data, rather than actually analysing and understanding +it. + +An agile data analysis workflow, with several fast iterations of the +transform/visualise/model cycle is only feasible if the data is +formatted in a predictable way and one can reason about the data +without having to look at it and/or fix it. + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- Good data organization is the foundation of any research project. + +:::::::::::::::::::::::::::::::::::::::::::::::::: From 0e9666d80ce975f1352a02a1fd22c8e00f21dd28 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Fri, 29 Dec 2023 06:10:29 +0900 Subject: [PATCH 007/334] New translations 20-r-rstudio.md (French) --- locale/fr/episodes/20-r-rstudio.Rmd | 665 ++++++++++++++++++++++++++++ 1 file changed, 665 insertions(+) create mode 100644 locale/fr/episodes/20-r-rstudio.Rmd diff --git a/locale/fr/episodes/20-r-rstudio.Rmd b/locale/fr/episodes/20-r-rstudio.Rmd new file mode 100644 index 000000000..ad0b73472 --- /dev/null +++ b/locale/fr/episodes/20-r-rstudio.Rmd @@ -0,0 +1,665 @@ +--- +source: Rmd +title: R and RStudio +teaching: 30 +exercises: 0 +--- + +```{r, include=FALSE} +``` + +::::::::::::::::::::::::::::::::::::::: objectives + +- Describe the purpose of the RStudio Script, Console, Environment, and Plots panes. +- Organise files and directories for a set of analyses as an R project, and understand the purpose of the working directory. +- Use the built-in RStudio help interface to search for more information on R functions. +- Demonstrate how to provide sufficient information for troubleshooting with the R user community. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- What are R and RStudio? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## What is R? What is RStudio? + +The term [R](https://www.r-project.org/) is used to refer to the +_programming language_, the _environment for statistical computing_ +and _the software_ that interprets the scripts written using it. + +[RStudio](https://rstudio.com) is currently a very popular way to not +only write your R scripts but also to interact with the R +software[^plainr]. To function correctly, RStudio needs R and +therefore both need to be installed on your computer. + +[^plainr]: As opposed to using R directly from the command line + console. There exist other software that interface and integrate + with R, but RStudio is particularly well suited for beginners + while providing numerous very advanced features. + +The RStudio IDE Cheat +Sheet +provides much more information than will be covered here, but can be +useful to learn keyboard shortcuts and discover new features. + +## Why learn R? + +### R does not involve lots of pointing and clicking, and that's a good thing + +The learning curve might be steeper than with other software, but with +R, the results of your analysis do not rely on remembering a +succession of pointing and clicking, but instead on a series of +written commands, and that's a good thing! So, if you want to redo +your analysis because you collected more data, you don't have to +remember which button you clicked in which order to obtain your +results; you just have to run your script again. + +Working with scripts makes the steps you used in your analysis clear, +and the code you write can be inspected by someone else who can give +you feedback and spot mistakes. + +Working with scripts forces you to have a deeper understanding of what +you are doing, and facilitates your learning and comprehension of the +methods you use. + +### R code is great for reproducibility + +Reproducibility means that someone else (including your future self) can +obtain the same results from the same dataset when using the same +analysis code. + +R integrates with other tools to generate manuscripts or reports from your +code. If you collect more data, or fix a mistake in your dataset, the +figures and the statistical tests in your manuscript or report are updated +automatically. + +An increasing number of journals and funding agencies expect analyses +to be reproducible, so knowing R will give you an edge with these +requirements. + +### R is interdisciplinary and extensible + +With 10000+ packages[^whatarepkgs] that can be installed to extend its +capabilities, R provides a framework that allows you to combine +statistical approaches from many scientific disciplines to best suit +the analytical framework you need to analyse your data. For instance, +R has packages for image analysis, GIS, time series, population +genetics, and a lot more. + +[^whatarepkgs]: i.e. add-ons that confer R with new functionality, + such as bioinformatics data analysis. + +```{r, fig.cap="Exponential increase of the number of packages available on [CRAN](https://cran.r-project.org/), the Comprehensive R Archive Network. From the R Journal, Volume 10/2, December 2018.", echo=FALSE, message=FALSE} +knitr::include_graphics("fig/cran.png") +``` + +### R works on data of all shapes and sizes + +The skills you learn with R scale easily with the size of your +dataset. Whether your dataset has hundreds or millions of lines, it +won't make much difference to you. + +R is designed for data analysis. It comes with special data structures +and data types that make handling of missing data and statistical +factors convenient. + +R can connect to spreadsheets, databases, and many other data formats, +on your computer or on the web. + +### R produces high-quality graphics + +The plotting functionalities in R are extensive, and allow you to adjust +any aspect of your graph to convey most effectively the message from +your data. + +### R has a large and welcoming community + +Thousands of people use R daily. Many of them are willing to help you +through mailing lists and websites such as Stack +Overflow, or on the RStudio +community. These broad user communities +extend to specialised areas such as bioinformatics. One such subset of the R community is [Bioconductor](https://bioconductor.org/), a scientific project for analysis and comprehension "of data from current and emerging biological assays." This workshop was developed by members of the Bioconductor community; for more information on Bioconductor, please see the companion workshop ["The Bioconductor Project"](https://carpentries-incubator.github.io/bioc-project/). + +### Not only is R free, but it is also open-source and cross-platform + +Anyone can inspect the source code to see how R works. Because of this +transparency, there is less chance for mistakes, and if you (or +someone else) find some, you can report and fix bugs. + +## Knowing your way around RStudio + +Let's start by learning about [RStudio](https://www.rstudio.com/), +which is an Integrated Development Environment (IDE) for working with +R. + +The RStudio IDE open-source product is free under the Affero General +Public License (AGPL) v3. +The RStudio IDE is also available with a commercial license and +priority email support from Posit, Inc. + +We will use the RStudio IDE to write code, navigate the files on our +computer, inspect the variables we are going to create, and visualise +the plots we will generate. RStudio can also be used for other things +(e.g., version control, developing packages, writing Shiny apps) that +we will not cover during the workshop. + +```{r, results="markup", fig.cap="RStudio interface screenshot. Clockwise from top left: Source, Environment/History, Files/Plots/Packages/Help/Viewer, Console.", echo=FALSE, purl=FALSE, out.width="100%", fig.align="center"} +knitr::include_graphics("fig/rstudio-screenshot.png") +``` + +The RStudio window is divided into 4 "Panes": + +- the **Source** for your scripts and documents (top-left, in the + default layout) +- your **Environment/History** (top-right), +- your **Files/Plots/Packages/Help/Viewer** (bottom-right), and +- the R **Console** (bottom-left). + +The placement of these panes and their content can be customised (see +menu, `Tools -> Global Options -> Pane Layout`). + +One of the advantages of using RStudio is that all the information you +need to write code is available in a single window. Additionally, with +many shortcuts, **autocompletion**, and **highlighting** for the major +file types you use while developing in R, RStudio will make typing +easier and less error-prone. + +## Getting set up + +It is good practice to keep a set of related data, analyses, and text +self-contained in a single folder, called the **working +directory**. All of the scripts within this folder can then use +**relative paths** to files that indicate where inside the project a +file is located (as opposed to absolute paths, which point to where a +file is on a specific computer). Working this way makes it a lot +easier to move your project around on your computer and share it with +others without worrying about whether or not the underlying scripts +will still work. + +RStudio provides a helpful set of tools to do this through its "Projects" +interface, which not only creates a working directory for you, but also remembers +its location (allowing you to quickly navigate to it) and optionally preserves +custom settings and open files to make it easier to resume work after a +break. Go through the steps for creating an "R Project" for this +tutorial below. + +1. Start RStudio. +2. Under the `File` menu, click on `New project`. Choose `New directory`, then + `New project`. +3. Enter a name for this new folder (or "directory"), and choose a + convenient location for it. This will be your **working directory** + for this session (or whole course) (e.g., `bioc-intro`). +4. Click on `Create project`. +5. (Optional) Set Preferences to 'Never' save workspace in RStudio. + +RStudio's default preferences generally work well, but saving a workspace to +.RData can be cumbersome, especially if you are working with larger datasets. +To turn that off, go to Tools --> 'Global Options' and select the 'Never' option +for 'Save workspace to .RData' on exit. + +```{r, results="markup", fig.cap="Set 'Save workspace to .RData on exit' to 'Never'", echo=FALSE, purl=FALSE, out.width="70%", fig.align="center"} +knitr::include_graphics("fig/rstudio-preferences.png") +``` + +To avoid character encoding issues between Windows and other operating +systems, we are +going to set UTF-8 by default: + +```{r, results="markup", fig.cap="Set the default text encoding to UTF-8 to save us headache in the coming future. (Figure from the link above).", echo=FALSE, purl=FALSE, out.width="70%", fig.align="center"} +knitr::include_graphics("fig/utf8.png") +``` + +### Organizing your working directory + +Using a consistent folder structure across your projects will help keep things +organised, and will also make it easy to find/file things in the future. This +can be especially helpful when you have multiple projects. In general, you may +create directories (folders) for **scripts**, **data**, and **documents**. + +- **`data/`** Use this folder to store your raw data and intermediate + datasets you may create for the need of a particular analysis. For + the sake of transparency and + [provenance](https://en.wikipedia.org/wiki/Provenance), you should + _always_ keep a copy of your raw data accessible and do as much of + your data cleanup and preprocessing programmatically (i.e., with + scripts, rather than manually) as possible. Separating raw data + from processed data is also a good idea. For example, you could + have files `data/raw/tree_survey.plot1.txt` and `...plot2.txt` kept + separate from a `data/processed/tree.survey.csv` file generated by + the `scripts/01.preprocess.tree_survey.R` script. +- **`documents/`** This would be a place to keep outlines, drafts, + and other text. +- **`scripts/`** (or `src`) This would be the location to keep your R + scripts for different analyses or plotting, and potentially a + separate folder for your functions (more on that later). + +You may want additional directories or subdirectories depending on +your project needs, but these should form the backbone of your working +directory. + +```{r, results="markup", fig.cap="Example of a working directory structure.", echo=FALSE, purl=FALSE, out.width="100%", fig.align="center"} +knitr::include_graphics("fig/working-directory-structure.png") +``` + +For this course, we will need a `data/` folder to store our raw data, +and we will use `data_output/` for when we learn how to export data as +CSV files, and `fig_output/` folder for the figures that we will save. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge: create your project directory structure + +Under the `Files` tab on the right of the screen, click on `New Folder` and +create a folder named `data` within your newly created working directory +(e.g., `~/bioc-intro/data`). (Alternatively, type `dir.create("data")` at +your R console.) Repeat these operations to create a `data_output/` and a +`fig_output` folders. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +We are going to keep the script in the root of our working directory +because we are only going to use one file and it will make things +easier. + +Your working directory should now look like this: + +```{r, results="markup", fig.cap="How it should look like at the beginning of this lesson", echo=FALSE, purl=FALSE, out.width="100%", fig.align="center"} +knitr::include_graphics("fig/r-starting-how-it-should-look-like.png") +``` + +**Project management** is also applicable to bioinformatics projects, +of course[^bioindatascience]. William Noble (@Noble:2009) proposes the +following directory structure: + +[^bioindatascience]: In this course, we consider bioinformatics as + data science applied to biological or bio-medical data. + +> Directory names are in large typeface, and filenames are in smaller +> typeface. Only a subset of the files are shown here. Note that the +> dates are formatted `<year>-<month>-<day>` so that they can be +> sorted in chronological order. The source code `src/ms-analysis.c` +> is compiled to create `bin/ms-analysis` and is documented in +> `doc/ms-analysis.html`. The `README` files in the data directories +> specify who downloaded the data files from what URL on what +> date. The driver script `results/2009-01-15/runall` automatically +> generates the three subdirectories split1, split2, and split3, +> corresponding to three cross-validation splits. The +> `bin/parse-sqt.py` script is called by both of the `runall` driver +> scripts. + +```{r bioinfoproj, fig.cap="Directory structure for a sample bioinformatics project.", out.width="100%", echo=FALSE} +knitr::include_graphics("fig/noble-bioinfo-project.png") +``` + +The most important aspect of a well defined and well documented +project directory is to enable someone unfamiliar with the +project[^futureself] to + +1. understand what the project is about, what data are available, what + analyses were run, and what results were produced and, most + importantly to + +2. repeat the analysis over again - with new data, or changing some + analysis parameters. + +[^futureself]: That someone could be, and very likely will be your + future self, a couple of months or years after the analyses were + run. + +### The working directory + +The working directory is an important concept to understand. It is the +place from where R will be looking for and saving the files. When you +write code for your project, it should refer to files in relation to +the root of your working directory and only need files within this +structure. + +Using RStudio projects makes this easy and ensures that your working +directory is set properly. If you need to check it, you can use +`getwd()`. If for some reason your working directory is not what it +should be, you can change it in the RStudio interface by navigating in +the file browser where your working directory should be, and clicking +on the blue gear icon `More`, and select `Set As Working Directory`. +Alternatively you can use `setwd("/path/to/working/directory")` to +reset your working directory. However, your scripts should not include +this line because it will fail on someone else's computer. + +**Example** + +The schema below represents the working directory `bioc-intro` with the +`data` and `fig_output` sub-directories, and 2 files in the latter: + +``` +bioc-intro/data/ + /fig_output/fig1.pdf + /fig_output/fig2.png +``` + +If we were in the working directory, we could refer to the `fig1.pdf` +file using the relative path `bioc-intro/fig_output/fig1.pdf` or the +absolute path `/home/user/bioc-intro/fig_output/fig1.pdf`. + +If we were in the `data` directory, we would use the relative path +`../fig_output/fig1.pdf` or the same absolute path +`/home/user/bioc-intro/fig_output/fig1.pdf`. + +## Interacting with R + +The basis of programming is that we write down instructions for the +computer to follow, and then we tell the computer to follow those +instructions. We write, or _code_, instructions in R because it is a +common language that both the computer and we can understand. We call +the instructions _commands_ and we tell the computer to follow the +instructions by _executing_ (also called _running_) those commands. + +There are two main ways of interacting with R: by using the +**console** or by using **scripts** (plain text files that contain +your code). The console pane (in RStudio, the bottom left panel) is +the place where commands written in the R language can be typed and +executed immediately by the computer. It is also where the results +will be shown for commands that have been executed. You can type +commands directly into the console and press `Enter` to execute those +commands, but they will be forgotten when you close the session. + +Because we want our code and workflow to be reproducible, it is better +to type the commands we want in the script editor, and save the +script. This way, there is a complete record of what we did, and +anyone (including our future selves!) can easily replicate the +results on their computer. Note, however, that merely typing the commands +in the script does not automatically _run_ them - they still need to +be sent to the console for execution. + +RStudio allows you to execute commands directly from the script editor +by using the `Ctrl` + `Enter` shortcut (on Macs, `Cmd` + `Return` will +work, too). The command on the current line in the script (indicated +by the cursor) or all of the commands in the currently selected text +will be sent to the console and executed when you press `Ctrl` + +`Enter`. You can find other keyboard shortcuts in this RStudio +cheatsheet about the RStudio +IDE. + +At some point in your analysis you may want to check the content of a +variable or the structure of an object, without necessarily keeping a +record of it in your script. You can type these commands and execute +them directly in the console. RStudio provides the `Ctrl` + `1` and +`Ctrl` + `2` shortcuts allow you to jump between the script and the +console panes. + +If R is ready to accept commands, the R console shows a `>` prompt. If +it receives a command (by typing, copy-pasting or sending from the script +editor using `Ctrl` + `Enter`), R will try to execute it, and when +ready, will show the results and come back with a new `>` prompt to +wait for new commands. + +If R is still waiting for you to enter more data because it isn't +complete yet, the console will show a `+` prompt. It means that you +haven't finished entering a complete command. This is because you have +not 'closed' a parenthesis or quotation, i.e. you don't have the same +number of left-parentheses as right-parentheses, or the same number of +opening and closing quotation marks. When this happens, and you +thought you finished typing your command, click inside the console +window and press `Esc`; this will cancel the incomplete command and +return you to the `>` prompt. + +## How to learn more during and after the course? + +The material we cover during this course will give you an initial +taste of how you can use R to analyse data for your own +research. However, you will need to learn more to do advanced +operations such as cleaning your dataset, using statistical methods, +or creating beautiful graphics[^inthiscoure]. The best way to become +proficient and efficient at R, as with any other tool, is to use it to +address your actual research questions. As a beginner, it can feel +daunting to have to write a script from scratch, and given that many +people make their code available online, modifying existing code to +suit your purpose might make it easier for you to get started. + +[^inthiscoure]: We will introduce most of these (except statistics) + here, but will only manage to scratch the surface of the wealth of + what is possible to do with R. + +```{r kitten, results="markup", echo=FALSE, purl=FALSE, out.width="400px", fig.align="center"} +knitr::include_graphics("fig/kitten-try-things.jpg") +``` + +## Seeking help + +### Use the built-in RStudio help interface to search for more information on R functions + +```{r rstudiohelp, fig.cap="RStudio help interface.", results="markup", echo=FALSE, purl=FALSE, out.width="70%", fig.align="center"} +knitr::include_graphics("fig/rstudiohelp.png") +``` + +One of the fastest ways to get help, is to use the RStudio help +interface. This panel by default can be found at the lower right hand +panel of RStudio. As seen in the screenshot, by typing the word +"Mean", RStudio tries to also give a number of suggestions that you +might be interested in. The description is then shown in the display +window. + +### I know the name of the function I want to use, but I'm not sure how to use it + +If you need help with a specific function, let's say `barplot()`, you +can type: + +```{r, eval=FALSE, purl=TRUE} +?barplot +``` + +If you just need to remind yourself of the names of the arguments, you can use: + +```{r, eval=FALSE, purl=TRUE} +args(lm) +``` + +### I want to use a function that does X, there must be a function for it but I don't know which one... + +If you are looking for a function to do a particular task, you can use the +`help.search()` function, which is called by the double question mark `??`. +However, this only looks through the installed packages for help pages with a +match to your search request + +```{r, eval=FALSE, purl=TRUE} +??kruskal +``` + +If you can't find what you are looking for, you can use +the [rdocumentation.org](https://www.rdocumentation.org) website that searches +through the help files across all packages available. + +Finally, a generic Google or internet search "R \<task>" will often either send +you to the appropriate package documentation or a helpful forum where someone +else has already asked your question. + +### I am stuck... I get an error message that I don't understand + +Start by googling the error message. However, this doesn't always work very well +because often, package developers rely on the error catching provided by R. You +end up with general error messages that might not be very helpful to diagnose a +problem (e.g. "subscript out of bounds"). If the message is very generic, you +might also include the name of the function or package you're using in your +query. + +However, you should check Stack Overflow. Search using the `[r]` tag. Most +questions have already been answered, but the challenge is to use the right +words in the search to find the +answers: + +[http://stackoverflow.com/questions/tagged/r](https://stackoverflow.com/questions/tagged/r) + +The [Introduction to R](https://cran.r-project.org/doc/manuals/R-intro.pdf) can +also be dense for people with little programming experience but it is a good +place to understand the underpinnings of the R language. + +The [R FAQ](https://cran.r-project.org/doc/FAQ/R-FAQ.html) is dense and technical +but it is full of useful information. + +### Asking for help + +The key to receiving help from someone is for them to rapidly grasp +your problem. You should make it as easy as possible to pinpoint where +the issue might be. + +Try to use the correct words to describe your problem. For instance, a +package is not the same thing as a library. Most people will +understand what you meant, but others have really strong feelings +about the difference in meaning. The key point is that it can make +things confusing for people trying to help you. Be as precise as +possible when describing your problem. + +If possible, try to reduce what doesn't work to a simple _reproducible +example_. If you can reproduce the problem using a very small data +frame instead of your 50000 rows and 10000 columns one, provide the +small one with the description of your problem. When appropriate, try +to generalise what you are doing so even people who are not in your +field can understand the question. For instance instead of using a +subset of your real dataset, create a small (3 columns, 5 rows) +generic one. For more information on how to write a reproducible +example see this article by Hadley +Wickham. + +To share an object with someone else, if it's relatively small, you +can use the function `dput()`. It will output R code that can be used +to recreate the exact same object as the one in memory: + +```{r, results="show", purl=TRUE} +## iris is an example data frame that comes with R and head() is a +## function that returns the first part of the data frame +dput(head(iris)) +``` + +If the object is larger, provide either the raw file (i.e., your CSV +file) with your script up to the point of the error (and after +removing everything that is not relevant to your +issue). Alternatively, in particular if your question is not related +to a data frame, you can save any R object to a file[^export]: + +```{r, eval=FALSE, purl=FALSE} +saveRDS(iris, file="/tmp/iris.rds") +``` + +The content of this file is however not human readable and cannot be +posted directly on Stack Overflow. Instead, it can be sent to someone +by email who can read it with the `readRDS()` command (here it is +assumed that the downloaded file is in a `Downloads` folder in the +user's home directory): + +```{r, eval=FALSE, purl=FALSE} +some_data <- readRDS(file="~/Downloads/iris.rds") +``` + +Last, but certainly not least, **always include the output of `sessionInfo()`** +as it provides critical information about your platform, the versions of R and +the packages that you are using, and other information that can be very helpful +to understand your problem. + +```{r, results="show", purl=TRUE} +sessionInfo() +``` + +### Where to ask for help? + +- The person sitting next to you during the course. Don't hesitate to + talk to your neighbour during the workshop, compare your answers, + and ask for help. +- Your friendly colleagues: if you know someone with more experience + than you, they might be able and willing to help you. +- [Stack Overflow](https://stackoverflow.com/questions/tagged/r): if + your question hasn't been answered before and is well crafted, + chances are you will get an answer in less than 5 min. Remember to + follow their guidelines on how to ask a good + question. +- The R-help mailing + list: it is read by a + lot of people (including most of the R core team), a lot of people + post to it, but the tone can be pretty dry, and it is not always + very welcoming to new users. If your question is valid, you are + likely to get an answer very fast but don't expect that it will come + with smiley faces. Also, here more than anywhere else, be sure to + use correct vocabulary (otherwise you might get an answer pointing + to the misuse of your words rather than answering your + question). You will also have more success if your question is about + a base function rather than a specific package. +- If your question is about a specific package, see if there is a + mailing list for it. Usually it's included in the DESCRIPTION file + of the package that can be accessed using + `packageDescription("name-of-package")`. You may also want to try to + email the author of the package directly, or open an issue on the + code repository (e.g., GitHub). +- There are also some topic-specific mailing lists (GIS, + phylogenetics, etc...), the complete list is + [here](https://www.r-project.org/mail.html). + +### More resources + +- The [Posting Guide](https://www.r-project.org/posting-guide.html) for + the R mailing lists. + +- How to ask for R + help + useful guidelines. + +- This blog post by Jon + Skeet + has quite comprehensive advice on how to ask programming questions. + +- The [reprex](https://cran.rstudio.com/web/packages/reprex/) package + is very helpful to create reproducible examples when asking for + help. The rOpenSci community call "How to ask questions so they get + answered" (Github + link and video + recording) includes a presentation of + the reprex package and of its philosophy. + +## R packages + +### Loading packages + +As we have seen above, R packages play a fundamental role in R. The +make use of a package's functionality, assuming it is installed, we +first need to load it to be able to use it. This is done with the +`library()` function. Below, we load `ggplot2`. + +```{r loadp, eval=FALSE, purl=TRUE} +library("ggplot2") +``` + +### Installing packages + +The default package repository is The _Comprehensive R Archive +Network_ (CRAN), and any package that is available on CRAN can be +installed with the `install.packages()` function. Below, for example, +we install the `dplyr` package that we will learn about later. + +```{r craninstall, eval=FALSE, purl=TRUE} +install.packages("dplyr") +``` + +This command will install the `dplyr` package as well as all its +dependencies, i.e. all the packages that it relies on to function. + +Another major R package repository is maintained by Bioconductor. [Bioconductor packages](https://bioconductor.org/packages/release/BiocViews.html#___Software) are managed and installed using a dedicated package, +namely `BiocManager`, that can be installed from CRAN with + +```{r, eval=FALSE, purl=TRUE} +install.packages("BiocManager") +``` + +Individual packages such as `SummarizedExperiment` (we will use it +later), `DESeq2` (for RNA-Seq analysis), and any others from either Bioconductor or CRAN can then be +installed with `BiocManager::install`. + +```{r, eval=FALSE, purl=TRUE} +BiocManager::install("SummarizedExperiment") +BiocManager::install("DESeq2") +``` + +By default, `BiocManager::install()` will also check all your installed packages and see if there are newer versions available. If there are, it will show them to you and ask you if you want to `Update all/some/none? [a/s/n]:` and then wait for your answer. While you should strive to have the most up-to-date package versions, in practice we recommend only updating packages in a fresh R session before any packages are loaded. + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- Start using R and RStudio + +:::::::::::::::::::::::::::::::::::::::::::::::::: From 03ac481e870d47283dba929a62445be3185719df Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Fri, 29 Dec 2023 06:10:31 +0900 Subject: [PATCH 008/334] New translations 20-r-rstudio.md (Spanish) --- locale/es/episodes/20-r-rstudio.Rmd | 665 ++++++++++++++++++++++++++++ 1 file changed, 665 insertions(+) create mode 100644 locale/es/episodes/20-r-rstudio.Rmd diff --git a/locale/es/episodes/20-r-rstudio.Rmd b/locale/es/episodes/20-r-rstudio.Rmd new file mode 100644 index 000000000..f5f2e0aef --- /dev/null +++ b/locale/es/episodes/20-r-rstudio.Rmd @@ -0,0 +1,665 @@ +--- +source: Rmd +title: R and RStudio +teaching: 30 +exercises: 0 +--- + +```{r, include=FALSE} +``` + +::::::::::::::::::::::::::::::::::::::: objetivos + +- Describe the purpose of the RStudio Script, Console, Environment, and Plots panes. +- Organise files and directories for a set of analyses as an R project, and understand the purpose of the working directory. +- Use the built-in RStudio help interface to search for more information on R functions. +- Demonstrate how to provide sufficient information for troubleshooting with the R user community. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- What are R and RStudio? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## What is R? What is RStudio? + +The term [R](https://www.r-project.org/) is used to refer to the +_programming language_, the _environment for statistical computing_ +and _the software_ that interprets the scripts written using it. + +[RStudio](https://rstudio.com) is currently a very popular way to not +only write your R scripts but also to interact with the R +software[^plainr]. To function correctly, RStudio needs R and +therefore both need to be installed on your computer. + +[^plainr]: As opposed to using R directly from the command line + console. There exist other software that interface and integrate + with R, but RStudio is particularly well suited for beginners + while providing numerous very advanced features. + +The RStudio IDE Cheat +Sheet +provides much more information than will be covered here, but can be +useful to learn keyboard shortcuts and discover new features. + +## Why learn R? + +### R does not involve lots of pointing and clicking, and that's a good thing + +The learning curve might be steeper than with other software, but with +R, the results of your analysis do not rely on remembering a +succession of pointing and clicking, but instead on a series of +written commands, and that's a good thing! So, if you want to redo +your analysis because you collected more data, you don't have to +remember which button you clicked in which order to obtain your +results; you just have to run your script again. + +Working with scripts makes the steps you used in your analysis clear, +and the code you write can be inspected by someone else who can give +you feedback and spot mistakes. + +Working with scripts forces you to have a deeper understanding of what +you are doing, and facilitates your learning and comprehension of the +methods you use. + +### R code is great for reproducibility + +Reproducibility means that someone else (including your future self) can +obtain the same results from the same dataset when using the same +analysis code. + +R integrates with other tools to generate manuscripts or reports from your +code. If you collect more data, or fix a mistake in your dataset, the +figures and the statistical tests in your manuscript or report are updated +automatically. + +An increasing number of journals and funding agencies expect analyses +to be reproducible, so knowing R will give you an edge with these +requirements. + +### R is interdisciplinary and extensible + +With 10000+ packages[^whatarepkgs] that can be installed to extend its +capabilities, R provides a framework that allows you to combine +statistical approaches from many scientific disciplines to best suit +the analytical framework you need to analyse your data. For instance, +R has packages for image analysis, GIS, time series, population +genetics, and a lot more. + +[^whatarepkgs]: i.e. add-ons that confer R with new functionality, + such as bioinformatics data analysis. + +```{r, fig.cap="Exponential increase of the number of packages available on [CRAN](https://cran.r-project.org/), the Comprehensive R Archive Network. From the R Journal, Volume 10/2, December 2018.", echo=FALSE, message=FALSE} +knitr::include_graphics("fig/cran.png") +``` + +### R works on data of all shapes and sizes + +The skills you learn with R scale easily with the size of your +dataset. Whether your dataset has hundreds or millions of lines, it +won't make much difference to you. + +R is designed for data analysis. It comes with special data structures +and data types that make handling of missing data and statistical +factors convenient. + +R can connect to spreadsheets, databases, and many other data formats, +on your computer or on the web. + +### R produces high-quality graphics + +The plotting functionalities in R are extensive, and allow you to adjust +any aspect of your graph to convey most effectively the message from +your data. + +### R has a large and welcoming community + +Thousands of people use R daily. Many of them are willing to help you +through mailing lists and websites such as Stack +Overflow, or on the RStudio +community. These broad user communities +extend to specialised areas such as bioinformatics. One such subset of the R community is [Bioconductor](https://bioconductor.org/), a scientific project for analysis and comprehension "of data from current and emerging biological assays." This workshop was developed by members of the Bioconductor community; for more information on Bioconductor, please see the companion workshop ["The Bioconductor Project"](https://carpentries-incubator.github.io/bioc-project/). + +### Not only is R free, but it is also open-source and cross-platform + +Anyone can inspect the source code to see how R works. Because of this +transparency, there is less chance for mistakes, and if you (or +someone else) find some, you can report and fix bugs. + +## Knowing your way around RStudio + +Let's start by learning about [RStudio](https://www.rstudio.com/), +which is an Integrated Development Environment (IDE) for working with +R. + +The RStudio IDE open-source product is free under the Affero General +Public License (AGPL) v3. +The RStudio IDE is also available with a commercial license and +priority email support from Posit, Inc. + +We will use the RStudio IDE to write code, navigate the files on our +computer, inspect the variables we are going to create, and visualise +the plots we will generate. RStudio can also be used for other things +(e.g., version control, developing packages, writing Shiny apps) that +we will not cover during the workshop. + +```{r, results="markup", fig.cap="RStudio interface screenshot. Clockwise from top left: Source, Environment/History, Files/Plots/Packages/Help/Viewer, Console.", echo=FALSE, purl=FALSE, out.width="100%", fig.align="center"} +knitr::include_graphics("fig/rstudio-screenshot.png") +``` + +The RStudio window is divided into 4 "Panes": + +- the **Source** for your scripts and documents (top-left, in the + default layout) +- your **Environment/History** (top-right), +- your **Files/Plots/Packages/Help/Viewer** (bottom-right), and +- the R **Console** (bottom-left). + +The placement of these panes and their content can be customised (see +menu, `Tools -> Global Options -> Pane Layout`). + +One of the advantages of using RStudio is that all the information you +need to write code is available in a single window. Additionally, with +many shortcuts, **autocompletion**, and **highlighting** for the major +file types you use while developing in R, RStudio will make typing +easier and less error-prone. + +## Getting set up + +It is good practice to keep a set of related data, analyses, and text +self-contained in a single folder, called the **working +directory**. All of the scripts within this folder can then use +**relative paths** to files that indicate where inside the project a +file is located (as opposed to absolute paths, which point to where a +file is on a specific computer). Working this way makes it a lot +easier to move your project around on your computer and share it with +others without worrying about whether or not the underlying scripts +will still work. + +RStudio provides a helpful set of tools to do this through its "Projects" +interface, which not only creates a working directory for you, but also remembers +its location (allowing you to quickly navigate to it) and optionally preserves +custom settings and open files to make it easier to resume work after a +break. Go through the steps for creating an "R Project" for this +tutorial below. + +1. Start RStudio. +2. Under the `File` menu, click on `New project`. Choose `New directory`, then + `New project`. +3. Enter a name for this new folder (or "directory"), and choose a + convenient location for it. This will be your **working directory** + for this session (or whole course) (e.g., `bioc-intro`). +4. Click on `Create project`. +5. (Optional) Set Preferences to 'Never' save workspace in RStudio. + +RStudio's default preferences generally work well, but saving a workspace to +.RData can be cumbersome, especially if you are working with larger datasets. +To turn that off, go to Tools --> 'Global Options' and select the 'Never' option +for 'Save workspace to .RData' on exit. + +```{r, results="markup", fig.cap="Set 'Save workspace to .RData on exit' to 'Never'", echo=FALSE, purl=FALSE, out.width="70%", fig.align="center"} +knitr::include_graphics("fig/rstudio-preferences.png") +``` + +To avoid character encoding issues between Windows and other operating +systems, we are +going to set UTF-8 by default: + +```{r, results="markup", fig.cap="Set the default text encoding to UTF-8 to save us headache in the coming future. (Figure from the link above).", echo=FALSE, purl=FALSE, out.width="70%", fig.align="center"} +knitr::include_graphics("fig/utf8.png") +``` + +### Organizing your working directory + +Using a consistent folder structure across your projects will help keep things +organised, and will also make it easy to find/file things in the future. This +can be especially helpful when you have multiple projects. In general, you may +create directories (folders) for **scripts**, **data**, and **documents**. + +- **`data/`** Use this folder to store your raw data and intermediate + datasets you may create for the need of a particular analysis. For + the sake of transparency and + [provenance](https://en.wikipedia.org/wiki/Provenance), you should + _always_ keep a copy of your raw data accessible and do as much of + your data cleanup and preprocessing programmatically (i.e., with + scripts, rather than manually) as possible. Separating raw data + from processed data is also a good idea. For example, you could + have files `data/raw/tree_survey.plot1.txt` and `...plot2.txt` kept + separate from a `data/processed/tree.survey.csv` file generated by + the `scripts/01.preprocess.tree_survey.R` script. +- **`documents/`** This would be a place to keep outlines, drafts, + and other text. +- **`scripts/`** (or `src`) This would be the location to keep your R + scripts for different analyses or plotting, and potentially a + separate folder for your functions (more on that later). + +You may want additional directories or subdirectories depending on +your project needs, but these should form the backbone of your working +directory. + +```{r, results="markup", fig.cap="Example of a working directory structure.", echo=FALSE, purl=FALSE, out.width="100%", fig.align="center"} +knitr::include_graphics("fig/working-directory-structure.png") +``` + +For this course, we will need a `data/` folder to store our raw data, +and we will use `data_output/` for when we learn how to export data as +CSV files, and `fig_output/` folder for the figures that we will save. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge: create your project directory structure + +Under the `Files` tab on the right of the screen, click on `New Folder` and +create a folder named `data` within your newly created working directory +(e.g., `~/bioc-intro/data`). (Alternatively, type `dir.create("data")` at +your R console.) Repeat these operations to create a `data_output/` and a +`fig_output` folders. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +We are going to keep the script in the root of our working directory +because we are only going to use one file and it will make things +easier. + +Your working directory should now look like this: + +```{r, results="markup", fig.cap="How it should look like at the beginning of this lesson", echo=FALSE, purl=FALSE, out.width="100%", fig.align="center"} +knitr::include_graphics("fig/r-starting-how-it-should-look-like.png") +``` + +**Project management** is also applicable to bioinformatics projects, +of course[^bioindatascience]. William Noble (@Noble:2009) proposes the +following directory structure: + +[^bioindatascience]: In this course, we consider bioinformatics as + data science applied to biological or bio-medical data. + +> Directory names are in large typeface, and filenames are in smaller +> typeface. Only a subset of the files are shown here. Note that the +> dates are formatted `<year>-<month>-<day>` so that they can be +> sorted in chronological order. The source code `src/ms-analysis.c` +> is compiled to create `bin/ms-analysis` and is documented in +> `doc/ms-analysis.html`. The `README` files in the data directories +> specify who downloaded the data files from what URL on what +> date. The driver script `results/2009-01-15/runall` automatically +> generates the three subdirectories split1, split2, and split3, +> corresponding to three cross-validation splits. The +> `bin/parse-sqt.py` script is called by both of the `runall` driver +> scripts. + +```{r bioinfoproj, fig.cap="Directory structure for a sample bioinformatics project.", out.width="100%", echo=FALSE} +knitr::include_graphics("fig/noble-bioinfo-project.png") +``` + +The most important aspect of a well defined and well documented +project directory is to enable someone unfamiliar with the +project[^futureself] to + +1. understand what the project is about, what data are available, what + analyses were run, and what results were produced and, most + importantly to + +2. repeat the analysis over again - with new data, or changing some + analysis parameters. + +[^futureself]: That someone could be, and very likely will be your + future self, a couple of months or years after the analyses were + run. + +### The working directory + +The working directory is an important concept to understand. It is the +place from where R will be looking for and saving the files. When you +write code for your project, it should refer to files in relation to +the root of your working directory and only need files within this +structure. + +Using RStudio projects makes this easy and ensures that your working +directory is set properly. If you need to check it, you can use +`getwd()`. If for some reason your working directory is not what it +should be, you can change it in the RStudio interface by navigating in +the file browser where your working directory should be, and clicking +on the blue gear icon `More`, and select `Set As Working Directory`. +Alternatively you can use `setwd("/path/to/working/directory")` to +reset your working directory. However, your scripts should not include +this line because it will fail on someone else's computer. + +**Example** + +The schema below represents the working directory `bioc-intro` with the +`data` and `fig_output` sub-directories, and 2 files in the latter: + +``` +bioc-intro/data/ + /fig_output/fig1.pdf + /fig_output/fig2.png +``` + +If we were in the working directory, we could refer to the `fig1.pdf` +file using the relative path `bioc-intro/fig_output/fig1.pdf` or the +absolute path `/home/user/bioc-intro/fig_output/fig1.pdf`. + +If we were in the `data` directory, we would use the relative path +`../fig_output/fig1.pdf` or the same absolute path +`/home/user/bioc-intro/fig_output/fig1.pdf`. + +## Interacting with R + +The basis of programming is that we write down instructions for the +computer to follow, and then we tell the computer to follow those +instructions. We write, or _code_, instructions in R because it is a +common language that both the computer and we can understand. We call +the instructions _commands_ and we tell the computer to follow the +instructions by _executing_ (also called _running_) those commands. + +There are two main ways of interacting with R: by using the +**console** or by using **scripts** (plain text files that contain +your code). The console pane (in RStudio, the bottom left panel) is +the place where commands written in the R language can be typed and +executed immediately by the computer. It is also where the results +will be shown for commands that have been executed. You can type +commands directly into the console and press `Enter` to execute those +commands, but they will be forgotten when you close the session. + +Because we want our code and workflow to be reproducible, it is better +to type the commands we want in the script editor, and save the +script. This way, there is a complete record of what we did, and +anyone (including our future selves!) can easily replicate the +results on their computer. Note, however, that merely typing the commands +in the script does not automatically _run_ them - they still need to +be sent to the console for execution. + +RStudio allows you to execute commands directly from the script editor +by using the `Ctrl` + `Enter` shortcut (on Macs, `Cmd` + `Return` will +work, too). The command on the current line in the script (indicated +by the cursor) or all of the commands in the currently selected text +will be sent to the console and executed when you press `Ctrl` + +`Enter`. You can find other keyboard shortcuts in this RStudio +cheatsheet about the RStudio +IDE. + +At some point in your analysis you may want to check the content of a +variable or the structure of an object, without necessarily keeping a +record of it in your script. You can type these commands and execute +them directly in the console. RStudio provides the `Ctrl` + `1` and +`Ctrl` + `2` shortcuts allow you to jump between the script and the +console panes. + +If R is ready to accept commands, the R console shows a `>` prompt. If +it receives a command (by typing, copy-pasting or sending from the script +editor using `Ctrl` + `Enter`), R will try to execute it, and when +ready, will show the results and come back with a new `>` prompt to +wait for new commands. + +If R is still waiting for you to enter more data because it isn't +complete yet, the console will show a `+` prompt. It means that you +haven't finished entering a complete command. This is because you have +not 'closed' a parenthesis or quotation, i.e. you don't have the same +number of left-parentheses as right-parentheses, or the same number of +opening and closing quotation marks. When this happens, and you +thought you finished typing your command, click inside the console +window and press `Esc`; this will cancel the incomplete command and +return you to the `>` prompt. + +## How to learn more during and after the course? + +The material we cover during this course will give you an initial +taste of how you can use R to analyse data for your own +research. However, you will need to learn more to do advanced +operations such as cleaning your dataset, using statistical methods, +or creating beautiful graphics[^inthiscoure]. The best way to become +proficient and efficient at R, as with any other tool, is to use it to +address your actual research questions. As a beginner, it can feel +daunting to have to write a script from scratch, and given that many +people make their code available online, modifying existing code to +suit your purpose might make it easier for you to get started. + +[^inthiscoure]: We will introduce most of these (except statistics) + here, but will only manage to scratch the surface of the wealth of + what is possible to do with R. + +```{r kitten, results="markup", echo=FALSE, purl=FALSE, out.width="400px", fig.align="center"} +knitr::include_graphics("fig/kitten-try-things.jpg") +``` + +## Seeking help + +### Use the built-in RStudio help interface to search for more information on R functions + +```{r rstudiohelp, fig.cap="RStudio help interface.", results="markup", echo=FALSE, purl=FALSE, out.width="70%", fig.align="center"} +knitr::include_graphics("fig/rstudiohelp.png") +``` + +One of the fastest ways to get help, is to use the RStudio help +interface. This panel by default can be found at the lower right hand +panel of RStudio. As seen in the screenshot, by typing the word +"Mean", RStudio tries to also give a number of suggestions that you +might be interested in. The description is then shown in the display +window. + +### I know the name of the function I want to use, but I'm not sure how to use it + +If you need help with a specific function, let's say `barplot()`, you +can type: + +```{r, eval=FALSE, purl=TRUE} +?barplot +``` + +If you just need to remind yourself of the names of the arguments, you can use: + +```{r, eval=FALSE, purl=TRUE} +args(lm) +``` + +### I want to use a function that does X, there must be a function for it but I don't know which one... + +If you are looking for a function to do a particular task, you can use the +`help.search()` function, which is called by the double question mark `??`. +However, this only looks through the installed packages for help pages with a +match to your search request + +```{r, eval=FALSE, purl=TRUE} +??kruskal +``` + +If you can't find what you are looking for, you can use +the [rdocumentation.org](https://www.rdocumentation.org) website that searches +through the help files across all packages available. + +Finally, a generic Google or internet search "R \<task>" will often either send +you to the appropriate package documentation or a helpful forum where someone +else has already asked your question. + +### I am stuck... I get an error message that I don't understand + +Start by googling the error message. However, this doesn't always work very well +because often, package developers rely on the error catching provided by R. You +end up with general error messages that might not be very helpful to diagnose a +problem (e.g. "subscript out of bounds"). If the message is very generic, you +might also include the name of the function or package you're using in your +query. + +However, you should check Stack Overflow. Search using the `[r]` tag. Most +questions have already been answered, but the challenge is to use the right +words in the search to find the +answers: + +[http://stackoverflow.com/questions/tagged/r](https://stackoverflow.com/questions/tagged/r) + +The [Introduction to R](https://cran.r-project.org/doc/manuals/R-intro.pdf) can +also be dense for people with little programming experience but it is a good +place to understand the underpinnings of the R language. + +The [R FAQ](https://cran.r-project.org/doc/FAQ/R-FAQ.html) is dense and technical +but it is full of useful information. + +### Asking for help + +The key to receiving help from someone is for them to rapidly grasp +your problem. You should make it as easy as possible to pinpoint where +the issue might be. + +Try to use the correct words to describe your problem. For instance, a +package is not the same thing as a library. Most people will +understand what you meant, but others have really strong feelings +about the difference in meaning. The key point is that it can make +things confusing for people trying to help you. Be as precise as +possible when describing your problem. + +If possible, try to reduce what doesn't work to a simple _reproducible +example_. If you can reproduce the problem using a very small data +frame instead of your 50000 rows and 10000 columns one, provide the +small one with the description of your problem. When appropriate, try +to generalise what you are doing so even people who are not in your +field can understand the question. For instance instead of using a +subset of your real dataset, create a small (3 columns, 5 rows) +generic one. For more information on how to write a reproducible +example see this article by Hadley +Wickham. + +To share an object with someone else, if it's relatively small, you +can use the function `dput()`. It will output R code that can be used +to recreate the exact same object as the one in memory: + +```{r, results="show", purl=TRUE} +## iris is an example data frame that comes with R and head() is a +## function that returns the first part of the data frame +dput(head(iris)) +``` + +If the object is larger, provide either the raw file (i.e., your CSV +file) with your script up to the point of the error (and after +removing everything that is not relevant to your +issue). Alternatively, in particular if your question is not related +to a data frame, you can save any R object to a file[^export]: + +```{r, eval=FALSE, purl=FALSE} +saveRDS(iris, file="/tmp/iris.rds") +``` + +The content of this file is however not human readable and cannot be +posted directly on Stack Overflow. Instead, it can be sent to someone +by email who can read it with the `readRDS()` command (here it is +assumed that the downloaded file is in a `Downloads` folder in the +user's home directory): + +```{r, eval=FALSE, purl=FALSE} +some_data <- readRDS(file="~/Downloads/iris.rds") +``` + +Last, but certainly not least, **always include the output of `sessionInfo()`** +as it provides critical information about your platform, the versions of R and +the packages that you are using, and other information that can be very helpful +to understand your problem. + +```{r, results="show", purl=TRUE} +sessionInfo() +``` + +### Where to ask for help? + +- The person sitting next to you during the course. Don't hesitate to + talk to your neighbour during the workshop, compare your answers, + and ask for help. +- Your friendly colleagues: if you know someone with more experience + than you, they might be able and willing to help you. +- [Stack Overflow](https://stackoverflow.com/questions/tagged/r): if + your question hasn't been answered before and is well crafted, + chances are you will get an answer in less than 5 min. Remember to + follow their guidelines on how to ask a good + question. +- The R-help mailing + list: it is read by a + lot of people (including most of the R core team), a lot of people + post to it, but the tone can be pretty dry, and it is not always + very welcoming to new users. If your question is valid, you are + likely to get an answer very fast but don't expect that it will come + with smiley faces. Also, here more than anywhere else, be sure to + use correct vocabulary (otherwise you might get an answer pointing + to the misuse of your words rather than answering your + question). You will also have more success if your question is about + a base function rather than a specific package. +- If your question is about a specific package, see if there is a + mailing list for it. Usually it's included in the DESCRIPTION file + of the package that can be accessed using + `packageDescription("name-of-package")`. You may also want to try to + email the author of the package directly, or open an issue on the + code repository (e.g., GitHub). +- There are also some topic-specific mailing lists (GIS, + phylogenetics, etc...), the complete list is + [here](https://www.r-project.org/mail.html). + +### More resources + +- The [Posting Guide](https://www.r-project.org/posting-guide.html) for + the R mailing lists. + +- How to ask for R + help + useful guidelines. + +- This blog post by Jon + Skeet + has quite comprehensive advice on how to ask programming questions. + +- The [reprex](https://cran.rstudio.com/web/packages/reprex/) package + is very helpful to create reproducible examples when asking for + help. The rOpenSci community call "How to ask questions so they get + answered" (Github + link and video + recording) includes a presentation of + the reprex package and of its philosophy. + +## R packages + +### Loading packages + +As we have seen above, R packages play a fundamental role in R. The +make use of a package's functionality, assuming it is installed, we +first need to load it to be able to use it. This is done with the +`library()` function. Below, we load `ggplot2`. + +```{r loadp, eval=FALSE, purl=TRUE} +library("ggplot2") +``` + +### Installing packages + +The default package repository is The _Comprehensive R Archive +Network_ (CRAN), and any package that is available on CRAN can be +installed with the `install.packages()` function. Below, for example, +we install the `dplyr` package that we will learn about later. + +```{r craninstall, eval=FALSE, purl=TRUE} +install.packages("dplyr") +``` + +This command will install the `dplyr` package as well as all its +dependencies, i.e. all the packages that it relies on to function. + +Another major R package repository is maintained by Bioconductor. [Bioconductor packages](https://bioconductor.org/packages/release/BiocViews.html#___Software) are managed and installed using a dedicated package, +namely `BiocManager`, that can be installed from CRAN with + +```{r, eval=FALSE, purl=TRUE} +install.packages("BiocManager") +``` + +Individual packages such as `SummarizedExperiment` (we will use it +later), `DESeq2` (for RNA-Seq analysis), and any others from either Bioconductor or CRAN can then be +installed with `BiocManager::install`. + +```{r, eval=FALSE, purl=TRUE} +BiocManager::install("SummarizedExperiment") +BiocManager::install("DESeq2") +``` + +By default, `BiocManager::install()` will also check all your installed packages and see if there are newer versions available. If there are, it will show them to you and ask you if you want to `Update all/some/none? [a/s/n]:` and then wait for your answer. While you should strive to have the most up-to-date package versions, in practice we recommend only updating packages in a fresh R session before any packages are loaded. + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- Start using R and RStudio + +:::::::::::::::::::::::::::::::::::::::::::::::::: From 3a0fa6c649960855301f5c3dd8b1829221817a33 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Fri, 29 Dec 2023 06:10:32 +0900 Subject: [PATCH 009/334] New translations 20-r-rstudio.md (Japanese) --- locale/ja/episodes/20-r-rstudio.Rmd | 665 ++++++++++++++++++++++++++++ 1 file changed, 665 insertions(+) create mode 100644 locale/ja/episodes/20-r-rstudio.Rmd diff --git a/locale/ja/episodes/20-r-rstudio.Rmd b/locale/ja/episodes/20-r-rstudio.Rmd new file mode 100644 index 000000000..6806f894e --- /dev/null +++ b/locale/ja/episodes/20-r-rstudio.Rmd @@ -0,0 +1,665 @@ +--- +source: Rmd +title: R and RStudio +teaching: 30 +exercises: 0 +--- + +```{r, include=FALSE} +``` + +::::::::::::::::::::::::::::::::::::::: 目的 + +- Describe the purpose of the RStudio Script, Console, Environment, and Plots panes. +- Organise files and directories for a set of analyses as an R project, and understand the purpose of the working directory. +- Use the built-in RStudio help interface to search for more information on R functions. +- Demonstrate how to provide sufficient information for troubleshooting with the R user community. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- What are R and RStudio? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## What is R? What is RStudio? + +The term [R](https://www.r-project.org/) is used to refer to the +_programming language_, the _environment for statistical computing_ +and _the software_ that interprets the scripts written using it. + +[RStudio](https://rstudio.com) is currently a very popular way to not +only write your R scripts but also to interact with the R +software[^plainr]. To function correctly, RStudio needs R and +therefore both need to be installed on your computer. + +[^plainr]: As opposed to using R directly from the command line + console. There exist other software that interface and integrate + with R, but RStudio is particularly well suited for beginners + while providing numerous very advanced features. + +The RStudio IDE Cheat +Sheet +provides much more information than will be covered here, but can be +useful to learn keyboard shortcuts and discover new features. + +## Why learn R? + +### R does not involve lots of pointing and clicking, and that's a good thing + +The learning curve might be steeper than with other software, but with +R, the results of your analysis do not rely on remembering a +succession of pointing and clicking, but instead on a series of +written commands, and that's a good thing! So, if you want to redo +your analysis because you collected more data, you don't have to +remember which button you clicked in which order to obtain your +results; you just have to run your script again. + +Working with scripts makes the steps you used in your analysis clear, +and the code you write can be inspected by someone else who can give +you feedback and spot mistakes. + +Working with scripts forces you to have a deeper understanding of what +you are doing, and facilitates your learning and comprehension of the +methods you use. + +### R code is great for reproducibility + +Reproducibility means that someone else (including your future self) can +obtain the same results from the same dataset when using the same +analysis code. + +R integrates with other tools to generate manuscripts or reports from your +code. If you collect more data, or fix a mistake in your dataset, the +figures and the statistical tests in your manuscript or report are updated +automatically. + +An increasing number of journals and funding agencies expect analyses +to be reproducible, so knowing R will give you an edge with these +requirements. + +### R is interdisciplinary and extensible + +With 10000+ packages[^whatarepkgs] that can be installed to extend its +capabilities, R provides a framework that allows you to combine +statistical approaches from many scientific disciplines to best suit +the analytical framework you need to analyse your data. For instance, +R has packages for image analysis, GIS, time series, population +genetics, and a lot more. + +[^whatarepkgs]: i.e. add-ons that confer R with new functionality, + such as bioinformatics data analysis. + +```{r, fig.cap="Exponential increase of the number of packages available on [CRAN](https://cran.r-project.org/), the Comprehensive R Archive Network. From the R Journal, Volume 10/2, December 2018.", echo=FALSE, message=FALSE} +knitr::include_graphics("fig/cran.png") +``` + +### R works on data of all shapes and sizes + +The skills you learn with R scale easily with the size of your +dataset. Whether your dataset has hundreds or millions of lines, it +won't make much difference to you. + +R is designed for data analysis. It comes with special data structures +and data types that make handling of missing data and statistical +factors convenient. + +R can connect to spreadsheets, databases, and many other data formats, +on your computer or on the web. + +### R produces high-quality graphics + +The plotting functionalities in R are extensive, and allow you to adjust +any aspect of your graph to convey most effectively the message from +your data. + +### R has a large and welcoming community + +Thousands of people use R daily. Many of them are willing to help you +through mailing lists and websites such as Stack +Overflow, or on the RStudio +community. These broad user communities +extend to specialised areas such as bioinformatics. One such subset of the R community is [Bioconductor](https://bioconductor.org/), a scientific project for analysis and comprehension "of data from current and emerging biological assays." This workshop was developed by members of the Bioconductor community; for more information on Bioconductor, please see the companion workshop ["The Bioconductor Project"](https://carpentries-incubator.github.io/bioc-project/). + +### Not only is R free, but it is also open-source and cross-platform + +Anyone can inspect the source code to see how R works. Because of this +transparency, there is less chance for mistakes, and if you (or +someone else) find some, you can report and fix bugs. + +## Knowing your way around RStudio + +Let's start by learning about [RStudio](https://www.rstudio.com/), +which is an Integrated Development Environment (IDE) for working with +R. + +The RStudio IDE open-source product is free under the Affero General +Public License (AGPL) v3. +The RStudio IDE is also available with a commercial license and +priority email support from Posit, Inc. + +We will use the RStudio IDE to write code, navigate the files on our +computer, inspect the variables we are going to create, and visualise +the plots we will generate. RStudio can also be used for other things +(e.g., version control, developing packages, writing Shiny apps) that +we will not cover during the workshop. + +```{r, results="markup", fig.cap="RStudio interface screenshot. Clockwise from top left: Source, Environment/History, Files/Plots/Packages/Help/Viewer, Console.", echo=FALSE, purl=FALSE, out.width="100%", fig.align="center"} +knitr::include_graphics("fig/rstudio-screenshot.png") +``` + +The RStudio window is divided into 4 "Panes": + +- the **Source** for your scripts and documents (top-left, in the + default layout) +- your **Environment/History** (top-right), +- your **Files/Plots/Packages/Help/Viewer** (bottom-right), and +- the R **Console** (bottom-left). + +The placement of these panes and their content can be customised (see +menu, `Tools -> Global Options -> Pane Layout`). + +One of the advantages of using RStudio is that all the information you +need to write code is available in a single window. Additionally, with +many shortcuts, **autocompletion**, and **highlighting** for the major +file types you use while developing in R, RStudio will make typing +easier and less error-prone. + +## Getting set up + +It is good practice to keep a set of related data, analyses, and text +self-contained in a single folder, called the **working +directory**. All of the scripts within this folder can then use +**relative paths** to files that indicate where inside the project a +file is located (as opposed to absolute paths, which point to where a +file is on a specific computer). Working this way makes it a lot +easier to move your project around on your computer and share it with +others without worrying about whether or not the underlying scripts +will still work. + +RStudio provides a helpful set of tools to do this through its "Projects" +interface, which not only creates a working directory for you, but also remembers +its location (allowing you to quickly navigate to it) and optionally preserves +custom settings and open files to make it easier to resume work after a +break. Go through the steps for creating an "R Project" for this +tutorial below. + +1. Start RStudio. +2. Under the `File` menu, click on `New project`. Choose `New directory`, then + `New project`. +3. Enter a name for this new folder (or "directory"), and choose a + convenient location for it. This will be your **working directory** + for this session (or whole course) (e.g., `bioc-intro`). +4. Click on `Create project`. +5. (Optional) Set Preferences to 'Never' save workspace in RStudio. + +RStudio's default preferences generally work well, but saving a workspace to +.RData can be cumbersome, especially if you are working with larger datasets. +To turn that off, go to Tools --> 'Global Options' and select the 'Never' option +for 'Save workspace to .RData' on exit. + +```{r, results="markup", fig.cap="Set 'Save workspace to .RData on exit' to 'Never'", echo=FALSE, purl=FALSE, out.width="70%", fig.align="center"} +knitr::include_graphics("fig/rstudio-preferences.png") +``` + +To avoid character encoding issues between Windows and other operating +systems, we are +going to set UTF-8 by default: + +```{r, results="markup", fig.cap="Set the default text encoding to UTF-8 to save us headache in the coming future. (Figure from the link above).", echo=FALSE, purl=FALSE, out.width="70%", fig.align="center"} +knitr::include_graphics("fig/utf8.png") +``` + +### Organizing your working directory + +Using a consistent folder structure across your projects will help keep things +organised, and will also make it easy to find/file things in the future. This +can be especially helpful when you have multiple projects. In general, you may +create directories (folders) for **scripts**, **data**, and **documents**. + +- **`data/`** Use this folder to store your raw data and intermediate + datasets you may create for the need of a particular analysis. For + the sake of transparency and + [provenance](https://en.wikipedia.org/wiki/Provenance), you should + _always_ keep a copy of your raw data accessible and do as much of + your data cleanup and preprocessing programmatically (i.e., with + scripts, rather than manually) as possible. Separating raw data + from processed data is also a good idea. For example, you could + have files `data/raw/tree_survey.plot1.txt` and `...plot2.txt` kept + separate from a `data/processed/tree.survey.csv` file generated by + the `scripts/01.preprocess.tree_survey.R` script. +- **`documents/`** This would be a place to keep outlines, drafts, + and other text. +- **`scripts/`** (or `src`) This would be the location to keep your R + scripts for different analyses or plotting, and potentially a + separate folder for your functions (more on that later). + +You may want additional directories or subdirectories depending on +your project needs, but these should form the backbone of your working +directory. + +```{r, results="markup", fig.cap="Example of a working directory structure.", echo=FALSE, purl=FALSE, out.width="100%", fig.align="center"} +knitr::include_graphics("fig/working-directory-structure.png") +``` + +For this course, we will need a `data/` folder to store our raw data, +and we will use `data_output/` for when we learn how to export data as +CSV files, and `fig_output/` folder for the figures that we will save. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge: create your project directory structure + +Under the `Files` tab on the right of the screen, click on `New Folder` and +create a folder named `data` within your newly created working directory +(e.g., `~/bioc-intro/data`). (Alternatively, type `dir.create("data")` at +your R console.) Repeat these operations to create a `data_output/` and a +`fig_output` folders. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +We are going to keep the script in the root of our working directory +because we are only going to use one file and it will make things +easier. + +Your working directory should now look like this: + +```{r, results="markup", fig.cap="How it should look like at the beginning of this lesson", echo=FALSE, purl=FALSE, out.width="100%", fig.align="center"} +knitr::include_graphics("fig/r-starting-how-it-should-look-like.png") +``` + +**Project management** is also applicable to bioinformatics projects, +of course[^bioindatascience]. William Noble (@Noble:2009) proposes the +following directory structure: + +[^bioindatascience]: In this course, we consider bioinformatics as + data science applied to biological or bio-medical data. + +> Directory names are in large typeface, and filenames are in smaller +> typeface. Only a subset of the files are shown here. Note that the +> dates are formatted `<year>-<month>-<day>` so that they can be +> sorted in chronological order. The source code `src/ms-analysis.c` +> is compiled to create `bin/ms-analysis` and is documented in +> `doc/ms-analysis.html`. The `README` files in the data directories +> specify who downloaded the data files from what URL on what +> date. The driver script `results/2009-01-15/runall` automatically +> generates the three subdirectories split1, split2, and split3, +> corresponding to three cross-validation splits. The +> `bin/parse-sqt.py` script is called by both of the `runall` driver +> scripts. + +```{r bioinfoproj, fig.cap="Directory structure for a sample bioinformatics project.", out.width="100%", echo=FALSE} +knitr::include_graphics("fig/noble-bioinfo-project.png") +``` + +The most important aspect of a well defined and well documented +project directory is to enable someone unfamiliar with the +project[^futureself] to + +1. understand what the project is about, what data are available, what + analyses were run, and what results were produced and, most + importantly to + +2. repeat the analysis over again - with new data, or changing some + analysis parameters. + +[^futureself]: That someone could be, and very likely will be your + future self, a couple of months or years after the analyses were + run. + +### The working directory + +The working directory is an important concept to understand. It is the +place from where R will be looking for and saving the files. When you +write code for your project, it should refer to files in relation to +the root of your working directory and only need files within this +structure. + +Using RStudio projects makes this easy and ensures that your working +directory is set properly. If you need to check it, you can use +`getwd()`. If for some reason your working directory is not what it +should be, you can change it in the RStudio interface by navigating in +the file browser where your working directory should be, and clicking +on the blue gear icon `More`, and select `Set As Working Directory`. +Alternatively you can use `setwd("/path/to/working/directory")` to +reset your working directory. However, your scripts should not include +this line because it will fail on someone else's computer. + +**Example** + +The schema below represents the working directory `bioc-intro` with the +`data` and `fig_output` sub-directories, and 2 files in the latter: + +``` +bioc-intro/data/ + /fig_output/fig1.pdf + /fig_output/fig2.png +``` + +If we were in the working directory, we could refer to the `fig1.pdf` +file using the relative path `bioc-intro/fig_output/fig1.pdf` or the +absolute path `/home/user/bioc-intro/fig_output/fig1.pdf`. + +If we were in the `data` directory, we would use the relative path +`../fig_output/fig1.pdf` or the same absolute path +`/home/user/bioc-intro/fig_output/fig1.pdf`. + +## Interacting with R + +The basis of programming is that we write down instructions for the +computer to follow, and then we tell the computer to follow those +instructions. We write, or _code_, instructions in R because it is a +common language that both the computer and we can understand. We call +the instructions _commands_ and we tell the computer to follow the +instructions by _executing_ (also called _running_) those commands. + +There are two main ways of interacting with R: by using the +**console** or by using **scripts** (plain text files that contain +your code). The console pane (in RStudio, the bottom left panel) is +the place where commands written in the R language can be typed and +executed immediately by the computer. It is also where the results +will be shown for commands that have been executed. You can type +commands directly into the console and press `Enter` to execute those +commands, but they will be forgotten when you close the session. + +Because we want our code and workflow to be reproducible, it is better +to type the commands we want in the script editor, and save the +script. This way, there is a complete record of what we did, and +anyone (including our future selves!) can easily replicate the +results on their computer. Note, however, that merely typing the commands +in the script does not automatically _run_ them - they still need to +be sent to the console for execution. + +RStudio allows you to execute commands directly from the script editor +by using the `Ctrl` + `Enter` shortcut (on Macs, `Cmd` + `Return` will +work, too). The command on the current line in the script (indicated +by the cursor) or all of the commands in the currently selected text +will be sent to the console and executed when you press `Ctrl` + +`Enter`. You can find other keyboard shortcuts in this RStudio +cheatsheet about the RStudio +IDE. + +At some point in your analysis you may want to check the content of a +variable or the structure of an object, without necessarily keeping a +record of it in your script. You can type these commands and execute +them directly in the console. RStudio provides the `Ctrl` + `1` and +`Ctrl` + `2` shortcuts allow you to jump between the script and the +console panes. + +If R is ready to accept commands, the R console shows a `>` prompt. If +it receives a command (by typing, copy-pasting or sending from the script +editor using `Ctrl` + `Enter`), R will try to execute it, and when +ready, will show the results and come back with a new `>` prompt to +wait for new commands. + +If R is still waiting for you to enter more data because it isn't +complete yet, the console will show a `+` prompt. It means that you +haven't finished entering a complete command. This is because you have +not 'closed' a parenthesis or quotation, i.e. you don't have the same +number of left-parentheses as right-parentheses, or the same number of +opening and closing quotation marks. When this happens, and you +thought you finished typing your command, click inside the console +window and press `Esc`; this will cancel the incomplete command and +return you to the `>` prompt. + +## How to learn more during and after the course? + +The material we cover during this course will give you an initial +taste of how you can use R to analyse data for your own +research. However, you will need to learn more to do advanced +operations such as cleaning your dataset, using statistical methods, +or creating beautiful graphics[^inthiscoure]. The best way to become +proficient and efficient at R, as with any other tool, is to use it to +address your actual research questions. As a beginner, it can feel +daunting to have to write a script from scratch, and given that many +people make their code available online, modifying existing code to +suit your purpose might make it easier for you to get started. + +[^inthiscoure]: We will introduce most of these (except statistics) + here, but will only manage to scratch the surface of the wealth of + what is possible to do with R. + +```{r kitten, results="markup", echo=FALSE, purl=FALSE, out.width="400px", fig.align="center"} +knitr::include_graphics("fig/kitten-try-things.jpg") +``` + +## Seeking help + +### Use the built-in RStudio help interface to search for more information on R functions + +```{r rstudiohelp, fig.cap="RStudio help interface.", results="markup", echo=FALSE, purl=FALSE, out.width="70%", fig.align="center"} +knitr::include_graphics("fig/rstudiohelp.png") +``` + +One of the fastest ways to get help, is to use the RStudio help +interface. This panel by default can be found at the lower right hand +panel of RStudio. As seen in the screenshot, by typing the word +"Mean", RStudio tries to also give a number of suggestions that you +might be interested in. The description is then shown in the display +window. + +### I know the name of the function I want to use, but I'm not sure how to use it + +If you need help with a specific function, let's say `barplot()`, you +can type: + +```{r, eval=FALSE, purl=TRUE} +?barplot +``` + +If you just need to remind yourself of the names of the arguments, you can use: + +```{r, eval=FALSE, purl=TRUE} +args(lm) +``` + +### I want to use a function that does X, there must be a function for it but I don't know which one... + +If you are looking for a function to do a particular task, you can use the +`help.search()` function, which is called by the double question mark `??`. +However, this only looks through the installed packages for help pages with a +match to your search request + +```{r, eval=FALSE, purl=TRUE} +??kruskal +``` + +If you can't find what you are looking for, you can use +the [rdocumentation.org](https://www.rdocumentation.org) website that searches +through the help files across all packages available. + +Finally, a generic Google or internet search "R \<task>" will often either send +you to the appropriate package documentation or a helpful forum where someone +else has already asked your question. + +### I am stuck... I get an error message that I don't understand + +Start by googling the error message. However, this doesn't always work very well +because often, package developers rely on the error catching provided by R. You +end up with general error messages that might not be very helpful to diagnose a +problem (e.g. "subscript out of bounds"). If the message is very generic, you +might also include the name of the function or package you're using in your +query. + +However, you should check Stack Overflow. Search using the `[r]` tag. Most +questions have already been answered, but the challenge is to use the right +words in the search to find the +answers: + +[http://stackoverflow.com/questions/tagged/r](https://stackoverflow.com/questions/tagged/r) + +The [Introduction to R](https://cran.r-project.org/doc/manuals/R-intro.pdf) can +also be dense for people with little programming experience but it is a good +place to understand the underpinnings of the R language. + +The [R FAQ](https://cran.r-project.org/doc/FAQ/R-FAQ.html) is dense and technical +but it is full of useful information. + +### Asking for help + +The key to receiving help from someone is for them to rapidly grasp +your problem. You should make it as easy as possible to pinpoint where +the issue might be. + +Try to use the correct words to describe your problem. For instance, a +package is not the same thing as a library. Most people will +understand what you meant, but others have really strong feelings +about the difference in meaning. The key point is that it can make +things confusing for people trying to help you. Be as precise as +possible when describing your problem. + +If possible, try to reduce what doesn't work to a simple _reproducible +example_. If you can reproduce the problem using a very small data +frame instead of your 50000 rows and 10000 columns one, provide the +small one with the description of your problem. When appropriate, try +to generalise what you are doing so even people who are not in your +field can understand the question. For instance instead of using a +subset of your real dataset, create a small (3 columns, 5 rows) +generic one. For more information on how to write a reproducible +example see this article by Hadley +Wickham. + +To share an object with someone else, if it's relatively small, you +can use the function `dput()`. It will output R code that can be used +to recreate the exact same object as the one in memory: + +```{r, results="show", purl=TRUE} +## iris is an example data frame that comes with R and head() is a +## function that returns the first part of the data frame +dput(head(iris)) +``` + +If the object is larger, provide either the raw file (i.e., your CSV +file) with your script up to the point of the error (and after +removing everything that is not relevant to your +issue). Alternatively, in particular if your question is not related +to a data frame, you can save any R object to a file[^export]: + +```{r, eval=FALSE, purl=FALSE} +saveRDS(iris, file="/tmp/iris.rds") +``` + +The content of this file is however not human readable and cannot be +posted directly on Stack Overflow. Instead, it can be sent to someone +by email who can read it with the `readRDS()` command (here it is +assumed that the downloaded file is in a `Downloads` folder in the +user's home directory): + +```{r, eval=FALSE, purl=FALSE} +some_data <- readRDS(file="~/Downloads/iris.rds") +``` + +Last, but certainly not least, **always include the output of `sessionInfo()`** +as it provides critical information about your platform, the versions of R and +the packages that you are using, and other information that can be very helpful +to understand your problem. + +```{r, results="show", purl=TRUE} +sessionInfo() +``` + +### Where to ask for help? + +- The person sitting next to you during the course. Don't hesitate to + talk to your neighbour during the workshop, compare your answers, + and ask for help. +- Your friendly colleagues: if you know someone with more experience + than you, they might be able and willing to help you. +- [Stack Overflow](https://stackoverflow.com/questions/tagged/r): if + your question hasn't been answered before and is well crafted, + chances are you will get an answer in less than 5 min. Remember to + follow their guidelines on how to ask a good + question. +- The R-help mailing + list: it is read by a + lot of people (including most of the R core team), a lot of people + post to it, but the tone can be pretty dry, and it is not always + very welcoming to new users. If your question is valid, you are + likely to get an answer very fast but don't expect that it will come + with smiley faces. Also, here more than anywhere else, be sure to + use correct vocabulary (otherwise you might get an answer pointing + to the misuse of your words rather than answering your + question). You will also have more success if your question is about + a base function rather than a specific package. +- If your question is about a specific package, see if there is a + mailing list for it. Usually it's included in the DESCRIPTION file + of the package that can be accessed using + `packageDescription("name-of-package")`. You may also want to try to + email the author of the package directly, or open an issue on the + code repository (e.g., GitHub). +- There are also some topic-specific mailing lists (GIS, + phylogenetics, etc...), the complete list is + [here](https://www.r-project.org/mail.html). + +### More resources + +- The [Posting Guide](https://www.r-project.org/posting-guide.html) for + the R mailing lists. + +- How to ask for R + help + useful guidelines. + +- This blog post by Jon + Skeet + has quite comprehensive advice on how to ask programming questions. + +- The [reprex](https://cran.rstudio.com/web/packages/reprex/) package + is very helpful to create reproducible examples when asking for + help. The rOpenSci community call "How to ask questions so they get + answered" (Github + link and video + recording) includes a presentation of + the reprex package and of its philosophy. + +## R packages + +### Loading packages + +As we have seen above, R packages play a fundamental role in R. The +make use of a package's functionality, assuming it is installed, we +first need to load it to be able to use it. This is done with the +`library()` function. Below, we load `ggplot2`. + +```{r loadp, eval=FALSE, purl=TRUE} +library("ggplot2") +``` + +### Installing packages + +The default package repository is The _Comprehensive R Archive +Network_ (CRAN), and any package that is available on CRAN can be +installed with the `install.packages()` function. Below, for example, +we install the `dplyr` package that we will learn about later. + +```{r craninstall, eval=FALSE, purl=TRUE} +install.packages("dplyr") +``` + +This command will install the `dplyr` package as well as all its +dependencies, i.e. all the packages that it relies on to function. + +Another major R package repository is maintained by Bioconductor. [Bioconductor packages](https://bioconductor.org/packages/release/BiocViews.html#___Software) are managed and installed using a dedicated package, +namely `BiocManager`, that can be installed from CRAN with + +```{r, eval=FALSE, purl=TRUE} +install.packages("BiocManager") +``` + +Individual packages such as `SummarizedExperiment` (we will use it +later), `DESeq2` (for RNA-Seq analysis), and any others from either Bioconductor or CRAN can then be +installed with `BiocManager::install`. + +```{r, eval=FALSE, purl=TRUE} +BiocManager::install("SummarizedExperiment") +BiocManager::install("DESeq2") +``` + +By default, `BiocManager::install()` will also check all your installed packages and see if there are newer versions available. If there are, it will show them to you and ask you if you want to `Update all/some/none? [a/s/n]:` and then wait for your answer. While you should strive to have the most up-to-date package versions, in practice we recommend only updating packages in a fresh R session before any packages are loaded. + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- Start using R and RStudio + +:::::::::::::::::::::::::::::::::::::::::::::::::: From c11f2a51c76452fddae96da44c76b505417ffcf7 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Fri, 29 Dec 2023 06:10:34 +0900 Subject: [PATCH 010/334] New translations 20-r-rstudio.md (Portuguese) --- locale/pt/episodes/20-r-rstudio.Rmd | 640 ++++++++++++++++++++++++++++ 1 file changed, 640 insertions(+) create mode 100644 locale/pt/episodes/20-r-rstudio.Rmd diff --git a/locale/pt/episodes/20-r-rstudio.Rmd b/locale/pt/episodes/20-r-rstudio.Rmd new file mode 100644 index 000000000..8bdebbd77 --- /dev/null +++ b/locale/pt/episodes/20-r-rstudio.Rmd @@ -0,0 +1,640 @@ +--- +source: Rmd +title: R and RStudio +teaching: 30 +exercises: 0 +--- + +```{r, include=FALSE} +``` + +::::::::::::::::::::::::::::::::::::::: objectives + +- Describe the purpose of the RStudio Script, Console, Environment, and Plots panes. +- Organise files and directories for a set of analyses as an R project, and understand the purpose of the working directory. +- Use the built-in RStudio help interface to search for more information on R functions. +- Demonstrate how to provide sufficient information for troubleshooting with the R user community. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- What are R and RStudio? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## What is R? O que é RStudio? + +O termo [R](https://www.r-project.org/) é utilizado para designar a +_linguagem de programação_, o _ambiente de computação estatística_ +e o _software_ que interpreta os scripts com essa linguagem. + +[RStudio](https://rstudio.com) é atualmente uma forma muito popular não só de +escrever os seus scripts R mas também de interagir com o software R +[^plainr]. Para funcionar corretamente, o RStudio necessita do R e ambos têm de estar instalados no seu computador. + +[^plainr]: As opposed to using R directly from the command line + console. Existem outros softwares que fazem a interface e integram + com o R, mas o RStudio é particularmente adequado para principiantes, ao mesmo tempo que oferece funcionalidades muito avançadas. + +O RStudio IDE Cheat +Sheet +fornece muito mais informações do que serão abordadas aqui, mas pode ser +útil para aprender atalhos de teclado e descobrir novos recursos. + +## Why learn R? + +### R does not involve lots of pointing and clicking, and that's a good thing + +The learning curve might be steeper than with other software, but with +R, the results of your analysis do not rely on remembering a +succession of pointing and clicking, but instead on a series of +written commands, and that's a good thing! Assim, se quiser refazer +a sua análise porque obteve mais dados, não tem de se +lembrar em que botão clicou em qual ordem para obter os +resultados; basta executar novamente o script. + +Trabalhar com scripts torna os passos utilizados na sua análise claros, +e o código que escreve pode ser inspecionado por outra pessoa que pode te dar +feedback e detectar erros. + +Trabalhar com scrpts te obriga a ter uma compreensão mais profunda do que +está fazendo e facilita a sua aprendizagem e compreensão dos métodos utilizados. + +### R code is great for reproducibility + +Reproducibility means that someone else (including your future self) can +obtain the same results from the same dataset when using the same +analysis code. + +O R se integra com outras ferramentas para gerar manuscritos ou relatórios a partir do seu código. Se recolher mais dados ou corrigir um erro no seu conjunto de dados, as figuras +e os testes estatísticos no seu manuscrito ou relatório serão atualizados +automaticamente. + +Um número crescente de revistas e agências de financiamento espera que as análises +sejam reprodutíveis, então o conhecimento de R te dará uma vantagem em relação a estes requisitos. + +### R is interdisciplinary and extensible + +With 10000+ packages[^whatarepkgs] that can be installed to extend its +capabilities, R provides a framework that allows you to combine +statistical approaches from many scientific disciplines to best suit +the analytical framework you need to analyse your data. Por exemplo, o +R tem pacotes para análise de imagens, GIS, séries temporais, +genética de populações e muito mais. + +[^whatarepkgs]: i.e. add-ons that confer R with new functionality, + such as bioinformatics data analysis. + +```{r, fig.cap="Exponential increase of the number of packages available on [CRAN](https://cran.r-project.org/), the Comprehensive R Archive Network. From the R Journal, Volume 10/2, December 2018.", echo=FALSE, message=FALSE} +knitr::include_graphics("fig/cran.png") +``` + +### R works on data of all shapes and sizes + +The skills you learn with R scale easily with the size of your +dataset. Mesmo que o seu conjunto de dados tenha centenas ou milhões de linhas, +não fará muita diferença para o R. + +O R foi concebido para a análise de dados. Ele inclui estruturas de dados especiais +e tipos de dados que o tornam conveniente para o tratamento de dados em falta e de fatores estatísticos. + +O R pode ligar-se a planilhas, bases de dados e muitos outros formatos de dados, +no seu computador ou na Web. + +### R produces high-quality graphics + +The plotting functionalities in R are extensive, and allow you to adjust +any aspect of your graph to convey most effectively the message from +your data. + +### R has a large and welcoming community + +Thousands of people use R daily. Muitos deles estão dispostos a ajudá-lo +através de listas de email e sites como Stack +Overflow, ou na [comunidade RStudio](https://community.rstudio.com/). Essas amplas comunidades de usuários +se estendem a áreas especializadas como a bioinformática. Um desses subconjuntos da comunidade R é o [Bioconductor](https://bioconductor.org/), um projeto científico para análise e compreensão "de dados biológicos atuais e emergentes Esta oficina foi desenvolvida por membros da comunidade Biocondutor; para mais informações sobre Bioconductor, por favor veja a oficina complementar ["O Projeto Biocondutor"](https://carpentries-incubator.github.io/bioc-project/). + +### Not only is R free, but it is also open-source and cross-platform + +Anyone can inspect the source code to see how R works. Por causa desta +transparência, há menos chances de erros e se você (ou +alguém fora) encontrar alguns, você pode relatar e corrigi-los. + +## Knowing your way around RStudio + +Let's start by learning about [RStudio](https://www.rstudio.com/), +which is an Integrated Development Environment (IDE) for working with +R. + +The RStudio IDE open-source product is free under the Affero General +Public License (AGPL) v3. +The RStudio IDE is also available with a commercial license and +priority email support from Posit, Inc. + +We will use the RStudio IDE to write code, navigate the files on our +computer, inspect the variables we are going to create, and visualise +the plots we will generate. O RStudio também pode ser utilizado para outras coisas +(por exemplo, controlo de versões, desenvolvimento de pacotes, escrita de aplicações Shiny) que +não serão abordadas durante o workshop. + +```{r, results="markup", fig.cap="RStudio interface screenshot. Clockwise from top left: Source, Environment/History, Files/Plots/Packages/Help/Viewer, Console.", echo=FALSE, purl=FALSE, out.width="100%", fig.align="center"} +knitr::include_graphics("fig/rstudio-screenshot.png") +``` + +The RStudio window is divided into 4 "Panes": + +- the **Source** for your scripts and documents (top-left, in the + default layout) +- your **Environment/History** (top-right), +- your **Files/Plots/Packages/Help/Viewer** (bottom-right), and +- the R **Console** (bottom-left). + +A colocação destes painéis e seu conteúdo podem ser personalizados (ver +menu, `Tools -> Global Options-> Layout`). + +Uma das vantagens de utilizar o RStudio é que toda a informação que +precisa para escrever código estára disponível numa única janela. Além disso, com +muitos atalhos, como **autocompletion**, e **highlight** para os principais tipos +de arquivo que você usa durante o desenvolvimento em R, O RStudio tornará a digitação +mais fácil e menos propensa a erros. + +## Getting set up + +It is good practice to keep a set of related data, analyses, and text +self-contained in a single folder, called the **working +directory**. Todos os scripts dentro desta pasta podem então utilizar +**relative paths** para ficheiros que indicam onde, dentro do projeto, um ficheiro +está localizado (em oposição aos caminhos absolutos, que apontam para onde um ficheiro +está num computador específico). Trabalhar desta forma torna muito +mais fácil mover o seu projeto no seu computador e compartilhá-lo com +outros sem se preocupar se os scripts subjacentes +ainda funcionarão ou não. + +O RStudio fornece um conjunto útil de ferramentas para fazer isso através da sua interface "Projetos", que não só cria um diretório de trabalho para si, mas também se lembra +da sua localização (permitindo navegar rapidamente para ele) e, opcionalmente, preserva +definições personalizadas e ficheiros abertos para facilitar a retomada do trabalho após uma pausa. Siga os passos abaixo para criar um "R Project" para este tutorial. + +1. Inicie o RStudio. +2. No menu `Arquivo` (File), clique em `Novo projeto`(New Project). Choose `New directory`, then + `New project`. +3. Coloque um nome para esta nova pasta (ou "diretório") e escolha uma + localização conveniente para a mesma. Este será o seu **diretório de trabalho** + para esta sessão (ou todo o curso) (por exemplo, `bioc-intro`). +4. Clique em `create project`. +5. (Opcional) Defina Preferências para "Nunca" guardar o workspace no RStudio. + +As preferências predefinidas do RStudio geralmente funcionam bem, mas guardar um espaço de trabalho em +.RData pode ser complicado, especialmente se estiver trabalhando com conjuntos de dados maiores. +Para desativar essa opção, vá a Ferramentas --> 'Opções globais' e selecione a opção 'Nunca' +para 'Guardar espaço de trabalho em .RData' ao sair. + +```{r, results="markup", fig.cap="Set 'Save workspace to .RData on exit' to 'Never'", echo=FALSE, purl=FALSE, out.width="70%", fig.align="center"} +knitr::include_graphics("fig/rstudio-preferences.png") +``` + +To avoid character encoding issues between Windows and other operating +systems, we are +going to set UTF-8 by default: + +```{r, results="markup", fig.cap="Set the default text encoding to UTF-8 to save us headache in the coming future. (Figure from the link above).", echo=FALSE, purl=FALSE, out.width="70%", fig.align="center"} +knitr::include_graphics("fig/utf8.png") +``` + +### Organizing your working directory + +Using a consistent folder structure across your projects will help keep things +organised, and will also make it easy to find/file things in the future. Isto +pode ser especialmente útil quando se tem vários projetos. Em geral, criamos diretorios (pastas) para **scripts**, **dados** e **documentos**. + +- **`data/`** Use this folder to store your raw data and intermediate + datasets you may create for the need of a particular analysis. For + the sake of transparency and + [provenance](https://en.wikipedia.org/wiki/Provenance), you should + _always_ keep a copy of your raw data accessible and do as much of + your data cleanup and preprocessing programmatically (i.e., with + scripts, rather than manually) as possible. Separar os dados em bruto + dos dados processados é também uma boa ideia. Por exemplo, você poderia + ter os arquivos `data/raw/tree_survey.plot1.txt` e `...plot2.txt` mantidos + separados de um arquivo `data/processed/tree.survey.csv` gerado por + o script `scripts/01.preprocess.tree_survey.R`. +- **`documents/`** This would be a place to keep outlines, drafts, + and other text. +- **`scripts/`** (or `src`) This would be the location to keep your R + scripts for different analyses or plotting, and potentially a + separate folder for your functions (more on that later). + +Você pode querer diretórios ou subdiretórios adicionais dependendo de +suas necessidades de projeto, mas eles devem formar a espinha dorsal do seu diretório +funcional. + +```{r, results="markup", fig.cap="Example of a working directory structure.", echo=FALSE, purl=FALSE, out.width="100%", fig.align="center"} +knitr::include_graphics("fig/working-directory-structure.png") +``` + +For this course, we will need a `data/` folder to store our raw data, +and we will use `data_output/` for when we learn how to export data as +CSV files, and `fig_output/` folder for the figures that we will save. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge: create your project directory structure + +Under the `Files` tab on the right of the screen, click on `New Folder` and +create a folder named `data` within your newly created working directory +(e.g., `~/bioc-intro/data`). (Alternativamente, digite `dir.create("data")` em +seu console R.) Repita estas operações para criar uma pasta `data_output/` e uma pasta +`fig_output`. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +We are going to keep the script in the root of our working directory +because we are only going to use one file and it will make things +easier. + +Your working directory should now look like this: + +```{r, results="markup", fig.cap="How it should look like at the beginning of this lesson", echo=FALSE, purl=FALSE, out.width="100%", fig.align="center"} +knitr::include_graphics("fig/r-starting-how-it-should-look-like.png") +``` + +**Project management** is also applicable to bioinformatics projects, +of course[^bioindatascience]. William Noble (@Noble:2009) proposes the +following directory structure: + +[^bioindatascience]: In this course, we consider bioinformatics as + data science applied to biological or bio-medical data. + +> Directory names are in large typeface, and filenames are in smaller +> typeface. Apenas um subconjunto dos arquivos é mostrado aqui. Note that the +> dates are formatted `<year>-<month>-<day>` so that they can be +> sorted in chronological order. The source code `src/ms-analysis.c` +> is compiled to create `bin/ms-analysis` and is documented in +> `doc/ms-analysis.html`. The `README` files in the data directories +> specify who downloaded the data files from what URL on what +> date. The driver script `results/2009-01-15/runall` automatically +> generates the three subdirectories split1, split2, and split3, +> corresponding to three cross-validation splits. The +> `bin/parse-sqt.py` script is called by both of the `runall` driver +> scripts. + +```{r bioinfoproj, fig.cap="Directory structure for a sample bioinformatics project.", out.width="100%", echo=FALSE} +knitr::include_graphics("fig/noble-bioinfo-project.png") +``` + +The most important aspect of a well defined and well documented +project directory is to enable someone unfamiliar with the +project[^futureself] to + +1. understand what the project is about, what data are available, what + analyses were run, and what results were produced and, most + importantly to + +2. repetir a análise mais uma vez - com novos dados ou alterando alguns + parâmetros de análise. + +[^futureself]: That someone could be, and very likely will be your + future self, a couple of months or years after the analyses were + run. + +### The working directory + +The working directory is an important concept to understand. É o lugar +de onde R estará procurando e salvando os arquivos. Quando +escrever código para o seu projeto, este deve referir-se a arquivosem relação +à raiz do seu diretório de trabalho e só precisar de arquivos presentes nesta estrutura. + +A utilização de projetos RStudio facilita este processo e garante que o seu diretório de trabalho +está definido corretamente. Se você precisar verificar, você pode usar +`getwd()`. Se, por alguma razão, o seu diretório de trabalho não é o que +deveria ser, pode alterá-lo na interface do RStudio navegando nas pastas onde o seu diretório de trabalho deveria estar, e clicando +no ícone da engrenagem azul `Mais`, e selecionanando `Definir como Diretório de Trabalho`. +Alternativamente, você pode utilizar `setwd("/caminho/para/diretório de trabalho")` para +redefinir o seu diretório de trabalho. No entanto, os seus scripts não devem incluir +esta linha porque irá falhar no computador de outra pessoa. + +**Example** + +The schema below represents the working directory `bioc-intro` with the +`data` and `fig_output` sub-directories, and 2 files in the latter: + +``` +bioc-intro/data/ + /fig_output/fig1.pdf + /fig_output/fig2.png +``` + +If we were in the working directory, we could refer to the `fig1.pdf` +file using the relative path `bioc-intro/fig_output/fig1.pdf` or the +absolute path `/home/user/bioc-intro/fig_output/fig1.pdf`. + +Se estivéssemos no diretório `data`, utilizaríamos o caminho relativo +`../fig_output/fig1.pdf` ou o mesmo caminho absoluto +`/home/user/bioc-intro/fig_output/fig1.pdf`. + +## Interacting with R + +The basis of programming is that we write down instructions for the +computer to follow, and then we tell the computer to follow those +instructions. Escrevemos, ou _codificamos_, instruções em R porque é uma +linguagem comum que tanto o computador como nós podemos compreender. Chamamos +as instruções de _comandos_ e dizemos ao computador para seguir as instruções, _executando_ (também chamado de _running_) esses comandos. + +Existem duas formas principais de interagir com o R: utilizando a +**console** ou utilizando **scripts** (arquivos de texto simples que contêm +o seu código). O painel de console (em RStudio, o painel inferior esquerdo) é +o local onde comandos escritos no idioma R podem ser digitados e +são executados imediatamente pelo computador. É também onde os resultados +serão mostrados para os comandos que foram executados. Você pode escrever comandos +diretamente no console e pressionar `Enter` para executar esses comandos, mas estes serão esquecidos quando fechar a sessão. + +Uma vez que pretendemos que o nosso código e fluxo de trabalho sejam reprodutíveis, é melhor +escrever os comandos que pretendemos no editor de scripts e salvar o script. This way, there is a complete record of what we did, and +anyone (including our future selves!) can easily replicate the +results on their computer. Note, no entanto, que apenas digitar os comandos +no script não os _executa_ automaticamente - eles ainda precisam +ser enviados para o console para execução. + +O RStudio permite-lhe executar comandos diretamente a partir do editor de scripts +utilizando o atalho `Ctrl` + `Enter` (em Macs, `Cmd` + `Return` também +funciona). O comando na linha atual do script (indicado +pelo cursor) ou todos os comandos no texto atualmente selecionado +serão enviados para o console e executados quando você pressionar `Ctrl` + +`Enter`. Pode encontrar outros atalhos de teclado nesta Folha de dicas do RStudio +sobre o IDE RStudio. + +. Você pode digitar esses comandos e executar +eles diretamente no console. O RStudio fornece os atalhos `Ctrl` + `1` e +`Ctrl` + `2` que lhe permitem saltar entre o script e os painéis do console. + +Se R estiver pronto para aceitar comandos, o console R te mostra um prompt `>`. Se +receber um comando (digitando, copiando e colando ou enviando do editor de scripts +usando `Ctrl` + `Enter`), o R tentará executá-lo, e quando +estiver pronto, mostrará os resultados e voltará com um novo prompt `>` para +aguardar novos comandos. + +Se R ainda estiver esperando que você insira mais dados porque ele ainda não está +concluído, o console irá mostrar uma solicitação `+`. Significa que você +ainda não terminou de entrar com um comando completo. Isto se deve ao fato de não ter +"fechado" um parêntesis ou uma aspa, ou seja, não tem o mesmo +número de parêntesis à esquerda e de parêntesis à direita, ou o mesmo número de +aspas de abertura e de fechamento. Quando isso acontecer, e você +pensou que tinha terminado de digitar o comando, clique dentro da janela do console +e pressione `Esc`; isso cancelará o comando incompleto e +retornará ao prompt `>`. + +## How to learn more during and after the course? + +O material que abordamos durante este curso te dará uma +primeira ideia de como pode utilizar o R para analisar dados para a sua própria pesquisa. No entanto, terá de aprender mais para efetuar operações +avançadas, tais como limpar o seu conjunto de dados, utilizar métodos estatísticos, +ou criar gráficos bonitos[^inthiscoure]. A melhor forma de se tornar +proficiente e eficiente no R, como em qualquer outra ferramenta, é utilizá-lo para +responder às suas questões de investigação. Para um principiante, pode ser +assustador ter de escrever um script raiz e, dado que muitas +pessoas disponibilizam o seu código online, modificar o código existente para +se adequar ao seu objetivo pode facilitar no início. + +[^inthiscoure]: We will introduce most of these (except statistics) + here, but will only manage to scratch the surface of the wealth of + what is possible to do with R. + +```{r kitten, results="markup", echo=FALSE, purl=FALSE, out.width="400px", fig.align="center"} +knitr::include_graphics("fig/kitten-try-things.jpg") +``` + +## Seeking help + +### Use the built-in RStudio help interface to search for more information on R functions + +```{r rstudiohelp, fig.cap="RStudio help interface.", results="markup", echo=FALSE, purl=FALSE, out.width="70%", fig.align="center"} +knitr::include_graphics("fig/rstudiohelp.png") +``` + +One of the fastest ways to get help, is to use the RStudio help +interface. Este painel por padrão pode ser encontrado na parte inferior direita +painel do RStudio. Como visto na captura de tela, digitando a palavra +"Mean", O RStudio tenta também dar várias sugestões que você +pode estar interessado. A descrição é mostrada na janela +exibição. + +### I know the name of the function I want to use, but I'm not sure how to use it + +If you need help with a specific function, let's say `barplot()`, you +can type: + +```{r, eval=FALSE, purl=TRUE} +?barplot +``` + +If you just need to remind yourself of the names of the arguments, you can use: + +```{r, eval=FALSE, purl=TRUE} +args(lm) +``` + +### I want to use a function that does X, there must be a function for it but I don't know which one... + +Se estiver à procura de uma função para fazer uma tarefa específica, pode utilizar a função +`help.search()`, que é chamada pelo duplo ponto de interrogação `??`. +However, this only looks through the installed packages for help pages with a +match to your search request + +```{r, eval=FALSE, purl=TRUE} +??kruskal +``` + +If you can't find what you are looking for, you can use +the [rdocumentation.org](https://www.rdocumentation.org) website that searches +through the help files across all packages available. + +Finalmente, uma pesquisa genérica no Google ou na Internet "R \<task>" irá, muitas vezes, enviá-lo +para a documentação do pacote apropriado ou para um fórum útil onde alguém +já colocou a sua pergunta. + +### I am stuck... I get an error message that I don't understand + +Start by googling the error message. No entanto, isto nem sempre funciona muito bem +porque, muitas vezes, os programadores de pacotes confiam na captura de erros fornecida pelo R. Você acaba por +receber mensagens de erro genéricas que podem não ser muito úteis para diagnosticar um problema +(por exemplo, "subscrito fora dos limites"). Se a mensagem for muito genérica, +pode também incluir o nome da função ou do pacote que está utilizando na sua consulta. + +No entanto, você deve verificar o Stack Overflow. Pesquise usando a tag `[r]`. Most +questions have already been answered, but the challenge is to use the right +words in the search to find the +answers: + +[http://stackoverflow.com/questions/tagged/r](https://stackoverflow.com/questions/tagged/r) + +The [Introduction to R](https://cran.r-project.org/doc/manuals/R-intro.pdf) can +also be dense for people with little programming experience but it is a good +place to understand the underpinnings of the R language. + +O [R FAQ](https://cran.r-project.org/doc/FAQ/R-FAQ.html) é denso e técnico +mas está cheio de informações úteis. + +### Asking for help + +The key to receiving help from someone is for them to rapidly grasp +your problem. Você deve tornar o mais fácil possível identificar onde +o problema pode estar. + +Tente usar as palavras corretas para descrever seu problema. Por exemplo, um pacote +não é a mesma coisa que uma biblioteca. A maioria das pessoas irá +compreender o que quis dizer, mas outras têm sentimentos muito fortes +sobre a diferença de significado. A questão principal é que isso pode tornar +as coisas confusas para as pessoas que tentam ajudá-lo. Seja tão preciso quanto o +possível ao descrever o seu problema. + +Se possível, tente reduzir o que não funciona a um simples \*exemplo reprodutível +\*. Se conseguir reproduzir o problema utilizando uma amostra de dados +em vez do arquivo de 50000 linhas e 10000 colunas, forneça a +amostra com a descrição do seu problema. Se for caso disso, tente +generalizar o que está fazendo para que mesmo as pessoas que não estão na sua área +possam compreender a pergunta. Por exemplo, pode em vez de utilizar um subconjunto +do seu conjunto de dados real, criar um pequeno (3 colunas, 5 linhas) +arquivo genérico. Para mais informações sobre como escrever um exemplo reprodutível em, consulte [este artigo de Hadley +Wickham] (https\://adv-r.had.co.nz/Reproducibility.html). + +Para compartilhar um objeto com outra pessoa, se for relativamente pequeno, você +pode usar a função `dput()`. It will output R code that can be used +to recreate the exact same object as the one in memory: + +```{r, results="show", purl=TRUE} +## iris is an example data frame that comes with R and head() is a +## function that returns the first part of the data frame +dput(head(iris)) +``` + +If the object is larger, provide either the raw file (i.e., your CSV +file) with your script up to the point of the error (and after +removing everything that is not relevant to your +issue). Alternatively, in particular if your question is not related +to a data frame, you can save any R object to a file[^export]: + +```{r, eval=FALSE, purl=FALSE} +saveRDS(iris, file="/tmp/iris.rds") +``` + +The content of this file is however not human readable and cannot be +posted directly on Stack Overflow. Instead, it can be sent to someone +by email who can read it with the `readRDS()` command (here it is +assumed that the downloaded file is in a `Downloads` folder in the +user's home directory): + +```{r, eval=FALSE, purl=FALSE} +some_data <- readRDS(file="~/Downloads/iris.rds") +``` + +Last, but certainly not least, **always include the output of `sessionInfo()`** +as it provides critical information about your platform, the versions of R and +the packages that you are using, and other information that can be very helpful +to understand your problem. + +```{r, results="show", purl=TRUE} +sessionInfo() +``` + +### Where to ask for help? + +- The person sitting next to you during the course. Não hesite em + falar com seu vizinho durante a oficina, comparar suas respostas, + e pedir ajuda. +- Your friendly colleagues: if you know someone with more experience + than you, they might be able and willing to help you. +- [Stack Overflow](https://stackoverflow.com/questions/tagged/r): if + your question hasn't been answered before and is well crafted, + chances are you will get an answer in less than 5 min. Lembre-se de + seguir as diretrizes sobre [como fazer uma boa pergunta](https://stackoverflow.com/help/how-to-ask). +- The R-help mailing + list: it is read by a + lot of people (including most of the R core team), a lot of people + post to it, but the tone can be pretty dry, and it is not always + very welcoming to new users. Se a sua pergunta for válida, é + provável que obtenha uma resposta muito rapidamente, mas não espere que ela venha + com carinhas sorridentes. Além disso, aqui, mais do que em qualquer outro lugar, não se esqueça de + utilizar o vocabulário correto (caso contrário, poderá receber uma resposta que aponta + para a má utilização das suas palavras, em vez de responder à sua pergunta). Também terá mais sucesso se a sua pergunta for sobre + uma função do R base em vez de um pacote específico. +- If your question is about a specific package, see if there is a + mailing list for it. Usually it's included in the DESCRIPTION file + of the package that can be accessed using + `packageDescription("name-of-package")`. Você também pode tentar + enviar um e-mail diretamente ao autor do pacote ou abrir um problema no repositório de código + (por exemplo, GitHub). +- There are also some topic-specific mailing lists (GIS, + phylogenetics, etc...), the complete list is + [here](https://www.r-project.org/mail.html). + +### More resources + +- The [Posting Guide](https://www.r-project.org/posting-guide.html) for + the R mailing lists. + +- How to ask for R + help + useful guidelines. + +- This blog post by Jon + Skeet + has quite comprehensive advice on how to ask programming questions. + +- The [reprex](https://cran.rstudio.com/web/packages/reprex/) package + is very helpful to create reproducible examples when asking for + help. The rOpenSci community call "How to ask questions so they get + answered" (Github + link and video + recording) includes a presentation of + the reprex package and of its philosophy. + +## R packages + +### Loading packages + +As we have seen above, R packages play a fundamental role in R. The +make use of a package's functionality, assuming it is installed, we +first need to load it to be able to use it. This is done with the +`library()` function. Below, we load `ggplot2`. + +```{r loadp, eval=FALSE, purl=TRUE} +library("ggplot2") +``` + +### Installing packages + +The default package repository is The _Comprehensive R Archive +Network_ (CRAN), and any package that is available on CRAN can be +installed with the `install.packages()` function. Below, for example, +we install the `dplyr` package that we will learn about later. + +```{r craninstall, eval=FALSE, purl=TRUE} +install.packages("dplyr") +``` + +This command will install the `dplyr` package as well as all its +dependencies, i.e. all the packages that it relies on to function. + +Another major R package repository is maintained by Bioconductor. [Bioconductor packages](https://bioconductor.org/packages/release/BiocViews.html#___Software) are managed and installed using a dedicated package, +namely `BiocManager`, that can be installed from CRAN with + +```{r, eval=FALSE, purl=TRUE} +install.packages("BiocManager") +``` + +Individual packages such as `SummarizedExperiment` (we will use it +later), `DESeq2` (for RNA-Seq analysis), and any others from either Bioconductor or CRAN can then be +installed with `BiocManager::install`. + +```{r, eval=FALSE, purl=TRUE} +BiocManager::install("SummarizedExperiment") +BiocManager::install("DESeq2") +``` + +By default, `BiocManager::install()` will also check all your installed packages and see if there are newer versions available. If there are, it will show them to you and ask you if you want to `Update all/some/none? [a/s/n]:` and then wait for your answer. While you should strive to have the most up-to-date package versions, in practice we recommend only updating packages in a fresh R session before any packages are loaded. + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- Start using R and RStudio + +:::::::::::::::::::::::::::::::::::::::::::::::::: From 450edc3ece5b71413d9ded80cda047f54c6e92b5 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Fri, 29 Dec 2023 06:10:36 +0900 Subject: [PATCH 011/334] New translations 20-r-rstudio.md (Chinese Simplified) --- locale/zh/episodes/20-r-rstudio.Rmd | 665 ++++++++++++++++++++++++++++ 1 file changed, 665 insertions(+) create mode 100644 locale/zh/episodes/20-r-rstudio.Rmd diff --git a/locale/zh/episodes/20-r-rstudio.Rmd b/locale/zh/episodes/20-r-rstudio.Rmd new file mode 100644 index 000000000..ad0b73472 --- /dev/null +++ b/locale/zh/episodes/20-r-rstudio.Rmd @@ -0,0 +1,665 @@ +--- +source: Rmd +title: R and RStudio +teaching: 30 +exercises: 0 +--- + +```{r, include=FALSE} +``` + +::::::::::::::::::::::::::::::::::::::: objectives + +- Describe the purpose of the RStudio Script, Console, Environment, and Plots panes. +- Organise files and directories for a set of analyses as an R project, and understand the purpose of the working directory. +- Use the built-in RStudio help interface to search for more information on R functions. +- Demonstrate how to provide sufficient information for troubleshooting with the R user community. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- What are R and RStudio? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## What is R? What is RStudio? + +The term [R](https://www.r-project.org/) is used to refer to the +_programming language_, the _environment for statistical computing_ +and _the software_ that interprets the scripts written using it. + +[RStudio](https://rstudio.com) is currently a very popular way to not +only write your R scripts but also to interact with the R +software[^plainr]. To function correctly, RStudio needs R and +therefore both need to be installed on your computer. + +[^plainr]: As opposed to using R directly from the command line + console. There exist other software that interface and integrate + with R, but RStudio is particularly well suited for beginners + while providing numerous very advanced features. + +The RStudio IDE Cheat +Sheet +provides much more information than will be covered here, but can be +useful to learn keyboard shortcuts and discover new features. + +## Why learn R? + +### R does not involve lots of pointing and clicking, and that's a good thing + +The learning curve might be steeper than with other software, but with +R, the results of your analysis do not rely on remembering a +succession of pointing and clicking, but instead on a series of +written commands, and that's a good thing! So, if you want to redo +your analysis because you collected more data, you don't have to +remember which button you clicked in which order to obtain your +results; you just have to run your script again. + +Working with scripts makes the steps you used in your analysis clear, +and the code you write can be inspected by someone else who can give +you feedback and spot mistakes. + +Working with scripts forces you to have a deeper understanding of what +you are doing, and facilitates your learning and comprehension of the +methods you use. + +### R code is great for reproducibility + +Reproducibility means that someone else (including your future self) can +obtain the same results from the same dataset when using the same +analysis code. + +R integrates with other tools to generate manuscripts or reports from your +code. If you collect more data, or fix a mistake in your dataset, the +figures and the statistical tests in your manuscript or report are updated +automatically. + +An increasing number of journals and funding agencies expect analyses +to be reproducible, so knowing R will give you an edge with these +requirements. + +### R is interdisciplinary and extensible + +With 10000+ packages[^whatarepkgs] that can be installed to extend its +capabilities, R provides a framework that allows you to combine +statistical approaches from many scientific disciplines to best suit +the analytical framework you need to analyse your data. For instance, +R has packages for image analysis, GIS, time series, population +genetics, and a lot more. + +[^whatarepkgs]: i.e. add-ons that confer R with new functionality, + such as bioinformatics data analysis. + +```{r, fig.cap="Exponential increase of the number of packages available on [CRAN](https://cran.r-project.org/), the Comprehensive R Archive Network. From the R Journal, Volume 10/2, December 2018.", echo=FALSE, message=FALSE} +knitr::include_graphics("fig/cran.png") +``` + +### R works on data of all shapes and sizes + +The skills you learn with R scale easily with the size of your +dataset. Whether your dataset has hundreds or millions of lines, it +won't make much difference to you. + +R is designed for data analysis. It comes with special data structures +and data types that make handling of missing data and statistical +factors convenient. + +R can connect to spreadsheets, databases, and many other data formats, +on your computer or on the web. + +### R produces high-quality graphics + +The plotting functionalities in R are extensive, and allow you to adjust +any aspect of your graph to convey most effectively the message from +your data. + +### R has a large and welcoming community + +Thousands of people use R daily. Many of them are willing to help you +through mailing lists and websites such as Stack +Overflow, or on the RStudio +community. These broad user communities +extend to specialised areas such as bioinformatics. One such subset of the R community is [Bioconductor](https://bioconductor.org/), a scientific project for analysis and comprehension "of data from current and emerging biological assays." This workshop was developed by members of the Bioconductor community; for more information on Bioconductor, please see the companion workshop ["The Bioconductor Project"](https://carpentries-incubator.github.io/bioc-project/). + +### Not only is R free, but it is also open-source and cross-platform + +Anyone can inspect the source code to see how R works. Because of this +transparency, there is less chance for mistakes, and if you (or +someone else) find some, you can report and fix bugs. + +## Knowing your way around RStudio + +Let's start by learning about [RStudio](https://www.rstudio.com/), +which is an Integrated Development Environment (IDE) for working with +R. + +The RStudio IDE open-source product is free under the Affero General +Public License (AGPL) v3. +The RStudio IDE is also available with a commercial license and +priority email support from Posit, Inc. + +We will use the RStudio IDE to write code, navigate the files on our +computer, inspect the variables we are going to create, and visualise +the plots we will generate. RStudio can also be used for other things +(e.g., version control, developing packages, writing Shiny apps) that +we will not cover during the workshop. + +```{r, results="markup", fig.cap="RStudio interface screenshot. Clockwise from top left: Source, Environment/History, Files/Plots/Packages/Help/Viewer, Console.", echo=FALSE, purl=FALSE, out.width="100%", fig.align="center"} +knitr::include_graphics("fig/rstudio-screenshot.png") +``` + +The RStudio window is divided into 4 "Panes": + +- the **Source** for your scripts and documents (top-left, in the + default layout) +- your **Environment/History** (top-right), +- your **Files/Plots/Packages/Help/Viewer** (bottom-right), and +- the R **Console** (bottom-left). + +The placement of these panes and their content can be customised (see +menu, `Tools -> Global Options -> Pane Layout`). + +One of the advantages of using RStudio is that all the information you +need to write code is available in a single window. Additionally, with +many shortcuts, **autocompletion**, and **highlighting** for the major +file types you use while developing in R, RStudio will make typing +easier and less error-prone. + +## Getting set up + +It is good practice to keep a set of related data, analyses, and text +self-contained in a single folder, called the **working +directory**. All of the scripts within this folder can then use +**relative paths** to files that indicate where inside the project a +file is located (as opposed to absolute paths, which point to where a +file is on a specific computer). Working this way makes it a lot +easier to move your project around on your computer and share it with +others without worrying about whether or not the underlying scripts +will still work. + +RStudio provides a helpful set of tools to do this through its "Projects" +interface, which not only creates a working directory for you, but also remembers +its location (allowing you to quickly navigate to it) and optionally preserves +custom settings and open files to make it easier to resume work after a +break. Go through the steps for creating an "R Project" for this +tutorial below. + +1. Start RStudio. +2. Under the `File` menu, click on `New project`. Choose `New directory`, then + `New project`. +3. Enter a name for this new folder (or "directory"), and choose a + convenient location for it. This will be your **working directory** + for this session (or whole course) (e.g., `bioc-intro`). +4. Click on `Create project`. +5. (Optional) Set Preferences to 'Never' save workspace in RStudio. + +RStudio's default preferences generally work well, but saving a workspace to +.RData can be cumbersome, especially if you are working with larger datasets. +To turn that off, go to Tools --> 'Global Options' and select the 'Never' option +for 'Save workspace to .RData' on exit. + +```{r, results="markup", fig.cap="Set 'Save workspace to .RData on exit' to 'Never'", echo=FALSE, purl=FALSE, out.width="70%", fig.align="center"} +knitr::include_graphics("fig/rstudio-preferences.png") +``` + +To avoid character encoding issues between Windows and other operating +systems, we are +going to set UTF-8 by default: + +```{r, results="markup", fig.cap="Set the default text encoding to UTF-8 to save us headache in the coming future. (Figure from the link above).", echo=FALSE, purl=FALSE, out.width="70%", fig.align="center"} +knitr::include_graphics("fig/utf8.png") +``` + +### Organizing your working directory + +Using a consistent folder structure across your projects will help keep things +organised, and will also make it easy to find/file things in the future. This +can be especially helpful when you have multiple projects. In general, you may +create directories (folders) for **scripts**, **data**, and **documents**. + +- **`data/`** Use this folder to store your raw data and intermediate + datasets you may create for the need of a particular analysis. For + the sake of transparency and + [provenance](https://en.wikipedia.org/wiki/Provenance), you should + _always_ keep a copy of your raw data accessible and do as much of + your data cleanup and preprocessing programmatically (i.e., with + scripts, rather than manually) as possible. Separating raw data + from processed data is also a good idea. For example, you could + have files `data/raw/tree_survey.plot1.txt` and `...plot2.txt` kept + separate from a `data/processed/tree.survey.csv` file generated by + the `scripts/01.preprocess.tree_survey.R` script. +- **`documents/`** This would be a place to keep outlines, drafts, + and other text. +- **`scripts/`** (or `src`) This would be the location to keep your R + scripts for different analyses or plotting, and potentially a + separate folder for your functions (more on that later). + +You may want additional directories or subdirectories depending on +your project needs, but these should form the backbone of your working +directory. + +```{r, results="markup", fig.cap="Example of a working directory structure.", echo=FALSE, purl=FALSE, out.width="100%", fig.align="center"} +knitr::include_graphics("fig/working-directory-structure.png") +``` + +For this course, we will need a `data/` folder to store our raw data, +and we will use `data_output/` for when we learn how to export data as +CSV files, and `fig_output/` folder for the figures that we will save. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge: create your project directory structure + +Under the `Files` tab on the right of the screen, click on `New Folder` and +create a folder named `data` within your newly created working directory +(e.g., `~/bioc-intro/data`). (Alternatively, type `dir.create("data")` at +your R console.) Repeat these operations to create a `data_output/` and a +`fig_output` folders. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +We are going to keep the script in the root of our working directory +because we are only going to use one file and it will make things +easier. + +Your working directory should now look like this: + +```{r, results="markup", fig.cap="How it should look like at the beginning of this lesson", echo=FALSE, purl=FALSE, out.width="100%", fig.align="center"} +knitr::include_graphics("fig/r-starting-how-it-should-look-like.png") +``` + +**Project management** is also applicable to bioinformatics projects, +of course[^bioindatascience]. William Noble (@Noble:2009) proposes the +following directory structure: + +[^bioindatascience]: In this course, we consider bioinformatics as + data science applied to biological or bio-medical data. + +> Directory names are in large typeface, and filenames are in smaller +> typeface. Only a subset of the files are shown here. Note that the +> dates are formatted `<year>-<month>-<day>` so that they can be +> sorted in chronological order. The source code `src/ms-analysis.c` +> is compiled to create `bin/ms-analysis` and is documented in +> `doc/ms-analysis.html`. The `README` files in the data directories +> specify who downloaded the data files from what URL on what +> date. The driver script `results/2009-01-15/runall` automatically +> generates the three subdirectories split1, split2, and split3, +> corresponding to three cross-validation splits. The +> `bin/parse-sqt.py` script is called by both of the `runall` driver +> scripts. + +```{r bioinfoproj, fig.cap="Directory structure for a sample bioinformatics project.", out.width="100%", echo=FALSE} +knitr::include_graphics("fig/noble-bioinfo-project.png") +``` + +The most important aspect of a well defined and well documented +project directory is to enable someone unfamiliar with the +project[^futureself] to + +1. understand what the project is about, what data are available, what + analyses were run, and what results were produced and, most + importantly to + +2. repeat the analysis over again - with new data, or changing some + analysis parameters. + +[^futureself]: That someone could be, and very likely will be your + future self, a couple of months or years after the analyses were + run. + +### The working directory + +The working directory is an important concept to understand. It is the +place from where R will be looking for and saving the files. When you +write code for your project, it should refer to files in relation to +the root of your working directory and only need files within this +structure. + +Using RStudio projects makes this easy and ensures that your working +directory is set properly. If you need to check it, you can use +`getwd()`. If for some reason your working directory is not what it +should be, you can change it in the RStudio interface by navigating in +the file browser where your working directory should be, and clicking +on the blue gear icon `More`, and select `Set As Working Directory`. +Alternatively you can use `setwd("/path/to/working/directory")` to +reset your working directory. However, your scripts should not include +this line because it will fail on someone else's computer. + +**Example** + +The schema below represents the working directory `bioc-intro` with the +`data` and `fig_output` sub-directories, and 2 files in the latter: + +``` +bioc-intro/data/ + /fig_output/fig1.pdf + /fig_output/fig2.png +``` + +If we were in the working directory, we could refer to the `fig1.pdf` +file using the relative path `bioc-intro/fig_output/fig1.pdf` or the +absolute path `/home/user/bioc-intro/fig_output/fig1.pdf`. + +If we were in the `data` directory, we would use the relative path +`../fig_output/fig1.pdf` or the same absolute path +`/home/user/bioc-intro/fig_output/fig1.pdf`. + +## Interacting with R + +The basis of programming is that we write down instructions for the +computer to follow, and then we tell the computer to follow those +instructions. We write, or _code_, instructions in R because it is a +common language that both the computer and we can understand. We call +the instructions _commands_ and we tell the computer to follow the +instructions by _executing_ (also called _running_) those commands. + +There are two main ways of interacting with R: by using the +**console** or by using **scripts** (plain text files that contain +your code). The console pane (in RStudio, the bottom left panel) is +the place where commands written in the R language can be typed and +executed immediately by the computer. It is also where the results +will be shown for commands that have been executed. You can type +commands directly into the console and press `Enter` to execute those +commands, but they will be forgotten when you close the session. + +Because we want our code and workflow to be reproducible, it is better +to type the commands we want in the script editor, and save the +script. This way, there is a complete record of what we did, and +anyone (including our future selves!) can easily replicate the +results on their computer. Note, however, that merely typing the commands +in the script does not automatically _run_ them - they still need to +be sent to the console for execution. + +RStudio allows you to execute commands directly from the script editor +by using the `Ctrl` + `Enter` shortcut (on Macs, `Cmd` + `Return` will +work, too). The command on the current line in the script (indicated +by the cursor) or all of the commands in the currently selected text +will be sent to the console and executed when you press `Ctrl` + +`Enter`. You can find other keyboard shortcuts in this RStudio +cheatsheet about the RStudio +IDE. + +At some point in your analysis you may want to check the content of a +variable or the structure of an object, without necessarily keeping a +record of it in your script. You can type these commands and execute +them directly in the console. RStudio provides the `Ctrl` + `1` and +`Ctrl` + `2` shortcuts allow you to jump between the script and the +console panes. + +If R is ready to accept commands, the R console shows a `>` prompt. If +it receives a command (by typing, copy-pasting or sending from the script +editor using `Ctrl` + `Enter`), R will try to execute it, and when +ready, will show the results and come back with a new `>` prompt to +wait for new commands. + +If R is still waiting for you to enter more data because it isn't +complete yet, the console will show a `+` prompt. It means that you +haven't finished entering a complete command. This is because you have +not 'closed' a parenthesis or quotation, i.e. you don't have the same +number of left-parentheses as right-parentheses, or the same number of +opening and closing quotation marks. When this happens, and you +thought you finished typing your command, click inside the console +window and press `Esc`; this will cancel the incomplete command and +return you to the `>` prompt. + +## How to learn more during and after the course? + +The material we cover during this course will give you an initial +taste of how you can use R to analyse data for your own +research. However, you will need to learn more to do advanced +operations such as cleaning your dataset, using statistical methods, +or creating beautiful graphics[^inthiscoure]. The best way to become +proficient and efficient at R, as with any other tool, is to use it to +address your actual research questions. As a beginner, it can feel +daunting to have to write a script from scratch, and given that many +people make their code available online, modifying existing code to +suit your purpose might make it easier for you to get started. + +[^inthiscoure]: We will introduce most of these (except statistics) + here, but will only manage to scratch the surface of the wealth of + what is possible to do with R. + +```{r kitten, results="markup", echo=FALSE, purl=FALSE, out.width="400px", fig.align="center"} +knitr::include_graphics("fig/kitten-try-things.jpg") +``` + +## Seeking help + +### Use the built-in RStudio help interface to search for more information on R functions + +```{r rstudiohelp, fig.cap="RStudio help interface.", results="markup", echo=FALSE, purl=FALSE, out.width="70%", fig.align="center"} +knitr::include_graphics("fig/rstudiohelp.png") +``` + +One of the fastest ways to get help, is to use the RStudio help +interface. This panel by default can be found at the lower right hand +panel of RStudio. As seen in the screenshot, by typing the word +"Mean", RStudio tries to also give a number of suggestions that you +might be interested in. The description is then shown in the display +window. + +### I know the name of the function I want to use, but I'm not sure how to use it + +If you need help with a specific function, let's say `barplot()`, you +can type: + +```{r, eval=FALSE, purl=TRUE} +?barplot +``` + +If you just need to remind yourself of the names of the arguments, you can use: + +```{r, eval=FALSE, purl=TRUE} +args(lm) +``` + +### I want to use a function that does X, there must be a function for it but I don't know which one... + +If you are looking for a function to do a particular task, you can use the +`help.search()` function, which is called by the double question mark `??`. +However, this only looks through the installed packages for help pages with a +match to your search request + +```{r, eval=FALSE, purl=TRUE} +??kruskal +``` + +If you can't find what you are looking for, you can use +the [rdocumentation.org](https://www.rdocumentation.org) website that searches +through the help files across all packages available. + +Finally, a generic Google or internet search "R \<task>" will often either send +you to the appropriate package documentation or a helpful forum where someone +else has already asked your question. + +### I am stuck... I get an error message that I don't understand + +Start by googling the error message. However, this doesn't always work very well +because often, package developers rely on the error catching provided by R. You +end up with general error messages that might not be very helpful to diagnose a +problem (e.g. "subscript out of bounds"). If the message is very generic, you +might also include the name of the function or package you're using in your +query. + +However, you should check Stack Overflow. Search using the `[r]` tag. Most +questions have already been answered, but the challenge is to use the right +words in the search to find the +answers: + +[http://stackoverflow.com/questions/tagged/r](https://stackoverflow.com/questions/tagged/r) + +The [Introduction to R](https://cran.r-project.org/doc/manuals/R-intro.pdf) can +also be dense for people with little programming experience but it is a good +place to understand the underpinnings of the R language. + +The [R FAQ](https://cran.r-project.org/doc/FAQ/R-FAQ.html) is dense and technical +but it is full of useful information. + +### Asking for help + +The key to receiving help from someone is for them to rapidly grasp +your problem. You should make it as easy as possible to pinpoint where +the issue might be. + +Try to use the correct words to describe your problem. For instance, a +package is not the same thing as a library. Most people will +understand what you meant, but others have really strong feelings +about the difference in meaning. The key point is that it can make +things confusing for people trying to help you. Be as precise as +possible when describing your problem. + +If possible, try to reduce what doesn't work to a simple _reproducible +example_. If you can reproduce the problem using a very small data +frame instead of your 50000 rows and 10000 columns one, provide the +small one with the description of your problem. When appropriate, try +to generalise what you are doing so even people who are not in your +field can understand the question. For instance instead of using a +subset of your real dataset, create a small (3 columns, 5 rows) +generic one. For more information on how to write a reproducible +example see this article by Hadley +Wickham. + +To share an object with someone else, if it's relatively small, you +can use the function `dput()`. It will output R code that can be used +to recreate the exact same object as the one in memory: + +```{r, results="show", purl=TRUE} +## iris is an example data frame that comes with R and head() is a +## function that returns the first part of the data frame +dput(head(iris)) +``` + +If the object is larger, provide either the raw file (i.e., your CSV +file) with your script up to the point of the error (and after +removing everything that is not relevant to your +issue). Alternatively, in particular if your question is not related +to a data frame, you can save any R object to a file[^export]: + +```{r, eval=FALSE, purl=FALSE} +saveRDS(iris, file="/tmp/iris.rds") +``` + +The content of this file is however not human readable and cannot be +posted directly on Stack Overflow. Instead, it can be sent to someone +by email who can read it with the `readRDS()` command (here it is +assumed that the downloaded file is in a `Downloads` folder in the +user's home directory): + +```{r, eval=FALSE, purl=FALSE} +some_data <- readRDS(file="~/Downloads/iris.rds") +``` + +Last, but certainly not least, **always include the output of `sessionInfo()`** +as it provides critical information about your platform, the versions of R and +the packages that you are using, and other information that can be very helpful +to understand your problem. + +```{r, results="show", purl=TRUE} +sessionInfo() +``` + +### Where to ask for help? + +- The person sitting next to you during the course. Don't hesitate to + talk to your neighbour during the workshop, compare your answers, + and ask for help. +- Your friendly colleagues: if you know someone with more experience + than you, they might be able and willing to help you. +- [Stack Overflow](https://stackoverflow.com/questions/tagged/r): if + your question hasn't been answered before and is well crafted, + chances are you will get an answer in less than 5 min. Remember to + follow their guidelines on how to ask a good + question. +- The R-help mailing + list: it is read by a + lot of people (including most of the R core team), a lot of people + post to it, but the tone can be pretty dry, and it is not always + very welcoming to new users. If your question is valid, you are + likely to get an answer very fast but don't expect that it will come + with smiley faces. Also, here more than anywhere else, be sure to + use correct vocabulary (otherwise you might get an answer pointing + to the misuse of your words rather than answering your + question). You will also have more success if your question is about + a base function rather than a specific package. +- If your question is about a specific package, see if there is a + mailing list for it. Usually it's included in the DESCRIPTION file + of the package that can be accessed using + `packageDescription("name-of-package")`. You may also want to try to + email the author of the package directly, or open an issue on the + code repository (e.g., GitHub). +- There are also some topic-specific mailing lists (GIS, + phylogenetics, etc...), the complete list is + [here](https://www.r-project.org/mail.html). + +### More resources + +- The [Posting Guide](https://www.r-project.org/posting-guide.html) for + the R mailing lists. + +- How to ask for R + help + useful guidelines. + +- This blog post by Jon + Skeet + has quite comprehensive advice on how to ask programming questions. + +- The [reprex](https://cran.rstudio.com/web/packages/reprex/) package + is very helpful to create reproducible examples when asking for + help. The rOpenSci community call "How to ask questions so they get + answered" (Github + link and video + recording) includes a presentation of + the reprex package and of its philosophy. + +## R packages + +### Loading packages + +As we have seen above, R packages play a fundamental role in R. The +make use of a package's functionality, assuming it is installed, we +first need to load it to be able to use it. This is done with the +`library()` function. Below, we load `ggplot2`. + +```{r loadp, eval=FALSE, purl=TRUE} +library("ggplot2") +``` + +### Installing packages + +The default package repository is The _Comprehensive R Archive +Network_ (CRAN), and any package that is available on CRAN can be +installed with the `install.packages()` function. Below, for example, +we install the `dplyr` package that we will learn about later. + +```{r craninstall, eval=FALSE, purl=TRUE} +install.packages("dplyr") +``` + +This command will install the `dplyr` package as well as all its +dependencies, i.e. all the packages that it relies on to function. + +Another major R package repository is maintained by Bioconductor. [Bioconductor packages](https://bioconductor.org/packages/release/BiocViews.html#___Software) are managed and installed using a dedicated package, +namely `BiocManager`, that can be installed from CRAN with + +```{r, eval=FALSE, purl=TRUE} +install.packages("BiocManager") +``` + +Individual packages such as `SummarizedExperiment` (we will use it +later), `DESeq2` (for RNA-Seq analysis), and any others from either Bioconductor or CRAN can then be +installed with `BiocManager::install`. + +```{r, eval=FALSE, purl=TRUE} +BiocManager::install("SummarizedExperiment") +BiocManager::install("DESeq2") +``` + +By default, `BiocManager::install()` will also check all your installed packages and see if there are newer versions available. If there are, it will show them to you and ask you if you want to `Update all/some/none? [a/s/n]:` and then wait for your answer. While you should strive to have the most up-to-date package versions, in practice we recommend only updating packages in a fresh R session before any packages are loaded. + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- Start using R and RStudio + +:::::::::::::::::::::::::::::::::::::::::::::::::: From 971fc84fea5c5c43e6a93c4e1c45bbf0b7fffa87 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Fri, 29 Dec 2023 06:10:38 +0900 Subject: [PATCH 012/334] New translations 23-starting-with-r.md (French) --- locale/fr/episodes/23-starting-with-r.Rmd | 921 ++++++++++++++++++++++ 1 file changed, 921 insertions(+) create mode 100644 locale/fr/episodes/23-starting-with-r.Rmd diff --git a/locale/fr/episodes/23-starting-with-r.Rmd b/locale/fr/episodes/23-starting-with-r.Rmd new file mode 100644 index 000000000..47ac62388 --- /dev/null +++ b/locale/fr/episodes/23-starting-with-r.Rmd @@ -0,0 +1,921 @@ +--- +source: Rmd +title: Introduction to R +teaching: 60 +exercises: 60 +--- + +```{r, include=FALSE} +``` + +::::::::::::::::::::::::::::::::::::::: objectives + +- Define the following terms as they relate to R: object, assign, call, function, arguments, options. +- Assign values to objects in R. +- Learn how to _name_ objects +- Use comments to inform script. +- Solve simple arithmetic operations in R. +- Call functions and use arguments to change their default options. +- Inspect the content of vectors and manipulate their content. +- Subset and extract values from vectors. +- Analyze vectors with missing data. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- First commands in R + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Creating objects in R + +You can get output from R simply by typing math in the console: + +```{r, purl=TRUE} +3 + 5 +12 / 7 +``` + +However, to do useful and interesting things, we need to assign _values_ to +_objects_. To create an object, we need to give it a name followed by the +assignment operator `<-`, and the value we want to give it: + +```{r, purl=TRUE} +weight_kg <- 55 +``` + +`<-` is the assignment operator. It assigns values on the right to +objects on the left. So, after executing `x <- 3`, the value of `x` is +`3`. The arrow can be read as 3 **goes into** `x`. For historical +reasons, you can also use `=` for assignments, but not in every +context. Because of the +[slight differences](https://blog.revolutionanalytics.com/2008/12/use-equals-or-arrow-for-assignment.html) +in syntax, it is good practice to always use `<-` for assignments. + +In RStudio, typing <kbd>Alt</kbd> + <kbd>-</kbd> (push <kbd>Alt</kbd> +at the same time as the <kbd>-</kbd> key) will write `<-` in a single +keystroke in a PC, while typing <kbd>Option</kbd> + <kbd>-</kbd> (push <kbd>Option</kbd> at the same time as the <kbd>-</kbd> key) does the +same in a Mac. + +### Naming variables + +Objects can be given any name such as `x`, `current_temperature`, or +`subject_id`. You want your object names to be explicit and not too +long. They cannot start with a number (`2x` is not valid, but `x2` +is). R is case sensitive (e.g., `weight_kg` is different from +`Weight_kg`). There are some names that cannot be used because they +are the names of fundamental functions in R (e.g., `if`, `else`, +`for`, see +[here](https://stat.ethz.ch/R-manual/R-devel/library/base/html/Reserved.html) +for a complete list). In general, even if it's allowed, it's best to +not use other function names (e.g., `c`, `T`, `mean`, `data`, `df`, +`weights`). If in doubt, check the help to see if the name is already +in use. It's also best to avoid dots (`.`) within an object name as in +`my.dataset`. There are many functions in R with dots in their names +for historical reasons, but because dots have a special meaning in R +(for methods) and other programming languages, it's best to avoid +them. It is also recommended to use nouns for object names, and verbs +for function names. It's important to be consistent in the styling of +your code (where you put spaces, how you name objects, etc.). Using a +consistent coding style makes your code clearer to read for your +future self and your collaborators. In R, some popular style guides +are [Google's](https://google.github.io/styleguide/Rguide.xml), the +[tidyverse's](https://style.tidyverse.org/) style and the Bioconductor +style +guide. The +tidyverse's is very comprehensive and may seem overwhelming at +first. You can install the +[**`lintr`**](https://github.com/jimhester/lintr) package to +automatically check for issues in the styling of your code. + +> **Objects vs. variables**: What are known as `objects` in `R` are +> known as `variables` in many other programming languages. Depending +> on the context, `object` and `variable` can have drastically +> different meanings. However, in this lesson, the two words are used +> synonymously. For more information +> [see here.](https://cran.r-project.org/doc/manuals/r-release/R-lang.html#Objects) + +When assigning a value to an object, R does not print anything. You +can force R to print the value by using parentheses or by typing the +object name: + +```{r, purl=TRUE} +weight_kg <- 55 # doesn't print anything +(weight_kg <- 55) # but putting parenthesis around the call prints the value of `weight_kg` +weight_kg # and so does typing the name of the object +``` + +Now that R has `weight_kg` in memory, we can do arithmetic with it. For +instance, we may want to convert this weight into pounds (weight in pounds is 2.2 times the weight in kg): + +```{r, purl=TRUE} +2.2 * weight_kg +``` + +We can also change an object's value by assigning it a new one: + +```{r, purl=TRUE} +weight_kg <- 57.5 +2.2 * weight_kg +``` + +This means that assigning a value to one object does not change the values of +other objects For example, let's store the animal's weight in pounds in a new +object, `weight_lb`: + +```{r, purl=TRUE} +weight_lb <- 2.2 * weight_kg +``` + +and then change `weight_kg` to 100. + +```{r} +weight_kg <- 100 +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge: + +What do you think is the current content of the object `weight_lb`? +126.5 or 220? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Comments + +The comment character in R is `#`, anything to the right of a `#` in a +script will be ignored by R. It is useful to leave notes, and +explanations in your scripts. + +RStudio makes it easy to comment or uncomment a paragraph: after +selecting the lines you want to comment, press at the same time on +your keyboard <kbd>Ctrl</kbd> + <kbd>Shift</kbd> + <kbd>C</kbd>. If +you only want to comment out one line, you can put the cursor at any +location of that line (i.e. no need to select the whole line), then +press <kbd>Ctrl</kbd> + <kbd>Shift</kbd> + <kbd>C</kbd>. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge + +What are the values after each statement in the following? + +```{r, purl=TRUE} +mass <- 47.5 # mass? +age <- 122 # age? +mass <- mass * 2.0 # mass? +age <- age - 20 # age? +mass_index <- mass/age # mass_index? +``` + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Functions and their arguments + +Functions are "canned scripts" that automate more complicated sets of commands +including operations assignments, etc. Many functions are predefined, or can be +made available by importing R _packages_ (more on that later). A function +usually gets one or more inputs called _arguments_. Functions often (but not +always) return a _value_. A typical example would be the function `sqrt()`. The +input (the argument) must be a number, and the return value (in fact, the +output) is the square root of that number. Executing a function ('running it') +is called _calling_ the function. An example of a function call is: + +```{r, eval=FALSE, purl=FALSE} +b <- sqrt(a) +``` + +Here, the value of `a` is given to the `sqrt()` function, the `sqrt()` function +calculates the square root, and returns the value which is then assigned to +the object `b`. This function is very simple, because it takes just one argument. + +The return 'value' of a function need not be numerical (like that of `sqrt()`), +and it also does not need to be a single item: it can be a set of things, or +even a dataset. We'll see that when we read data files into R. + +Arguments can be anything, not only numbers or filenames, but also other +objects. Exactly what each argument means differs per function, and must be +looked up in the documentation (see below). Some functions take arguments which +may either be specified by the user, or, if left out, take on a _default_ value: +these are called _options_. Options are typically used to alter the way the +function operates, such as whether it ignores 'bad values', or what symbol to +use in a plot. However, if you want something specific, you can specify a value +of your choice which will be used instead of the default. + +Let's try a function that can take multiple arguments: `round()`. + +```{r, results="show", purl=TRUE} +round(3.14159) +``` + +Here, we've called `round()` with just one argument, `3.14159`, and it has +returned the value `3`. That's because the default is to round to the nearest +whole number. If we want more digits we can see how to do that by getting +information about the `round` function. We can use `args(round)` or look at the +help for this function using `?round`. + +```{r, results="show", purl=TRUE} +args(round) +``` + +```{r, eval=FALSE, purl=TRUE} +?round +``` + +We see that if we want a different number of digits, we can +type `digits=2` or however many we want. + +```{r, results="show", purl=TRUE} +round(3.14159, digits = 2) +``` + +If you provide the arguments in the exact same order as they are defined you +don't have to name them: + +```{r, results="show", purl=TRUE} +round(3.14159, 2) +``` + +And if you do name the arguments, you can switch their order: + +```{r, results="show", purl=TRUE} +round(digits = 2, x = 3.14159) +``` + +It's good practice to put the non-optional arguments (like the number you're +rounding) first in your function call, and to specify the names of all optional +arguments. If you don't, someone reading your code might have to look up the +definition of a function with unfamiliar arguments to understand what you're +doing. By specifying the name of the arguments you are also safeguarding +against possible future changes in the function interface, which may +potentially add new arguments in between the existing ones. + +## Vectors and data types + +A vector is the most common and basic data type in R, and is pretty much +the workhorse of R. A vector is composed by a series of values, such as +numbers or characters. We can assign a series of values to a vector using +the `c()` function. For example we can create a vector of animal weights and assign +it to a new object `weight_g`: + +```{r, purl=TRUE} +weight_g <- c(50, 60, 65, 82) +weight_g +``` + +A vector can also contain characters: + +```{r, purl=TRUE} +molecules <- c("dna", "rna", "protein") +molecules +``` + +The quotes around "dna", "rna", etc. are essential here. Without the +quotes R will assume there are objects called `dna`, `rna` and +`protein`. As these objects don't exist in R's memory, there will be +an error message. + +There are many functions that allow you to inspect the content of a +vector. `length()` tells you how many elements are in a particular vector: + +```{r, purl=TRUE} +length(weight_g) +length(molecules) +``` + +An important feature of a vector, is that all of the elements are the +same type of data. The function `class()` indicates the class (the +type of element) of an object: + +```{r, purl=TRUE} +class(weight_g) +class(molecules) +``` + +The function `str()` provides an overview of the structure of an +object and its elements. It is a useful function when working with +large and complex objects: + +```{r, purl=TRUE} +str(weight_g) +str(molecules) +``` + +You can use the `c()` function to add other elements to your vector: + +```{r} +weight_g <- c(weight_g, 90) # add to the end of the vector +weight_g <- c(30, weight_g) # add to the beginning of the vector +weight_g +``` + +In the first line, we take the original vector `weight_g`, add the +value `90` to the end of it, and save the result back into +`weight_g`. Then we add the value `30` to the beginning, again saving +the result back into `weight_g`. + +We can do this over and over again to grow a vector, or assemble a +dataset. As we program, this may be useful to add results that we are +collecting or calculating. + +An **atomic vector** is the simplest R **data type** and is a linear +vector of a single type. Above, we saw 2 of the 6 main **atomic +vector** types that R uses: `"character"` and `"numeric"` (or +`"double"`). These are the basic building blocks that all R objects +are built from. The other 4 **atomic vector** types are: + +- `"logical"` for `TRUE` and `FALSE` (the boolean data type) +- `"integer"` for integer numbers (e.g., `2L`, the `L` indicates to R + that it's an integer) +- `"complex"` to represent complex numbers with real and imaginary + parts (e.g., `1 + 4i`) and that's all we're going to say about them +- `"raw"` for bitstreams that we won't discuss further + +You can check the type of your vector using the `typeof()` function +and inputting your vector as the argument. + +Vectors are one of the many **data structures** that R uses. Other +important ones are lists (`list`), matrices (`matrix`), data frames +(`data.frame`), factors (`factor`) and arrays (`array`). + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge: + +We've seen that atomic vectors can be of type character, numeric (or +double), integer, and logical. But what happens if we try to mix +these types in a single vector? + +::::::::::::::: solution + +## Solution + +R implicitly converts them to all be the same type + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge: + +What will happen in each of these examples? (hint: use `class()` to +check the data type of your objects and type in their names to see what happens): + +```{r, eval=TRUE} +num_char <- c(1, 2, 3, "a") +num_logical <- c(1, 2, 3, TRUE, FALSE) +char_logical <- c("a", "b", "c", TRUE) +tricky <- c(1, 2, 3, "4") +``` + +::::::::::::::: solution + +## Solution + +```{r, purl=TRUE} +class(num_char) +num_char +class(num_logical) +num_logical +class(char_logical) +char_logical +class(tricky) +tricky +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge: + +Why do you think it happens? + +::::::::::::::: solution + +## Solution + +Vectors can be of only one data type. R tries to convert (coerce) +the content of this vector to find a _common denominator_ that +doesn't lose any information. + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge: + +How many values in `combined_logical` are `"TRUE"` (as a character) +in the following example: + +```{r, eval=TRUE} +num_logical <- c(1, 2, 3, TRUE) +char_logical <- c("a", "b", "c", TRUE) +combined_logical <- c(num_logical, char_logical) +``` + +::::::::::::::: solution + +## Solution + +Only one. There is no memory of past data types, and the coercion +happens the first time the vector is evaluated. Therefore, the `TRUE` +in `num_logical` gets converted into a `1` before it gets converted +into `"1"` in `combined_logical`. + +```{r} +combined_logical +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge: + +In R, we call converting objects from one class into another class +_coercion_. These conversions happen according to a hierarchy, +whereby some types get preferentially coerced into other types. Can +you draw a diagram that represents the hierarchy of how these data +types are coerced? + +::::::::::::::: solution + +## Solution + +logical → numeric → character ← logical + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +```{r, echo=FALSE, eval=FALSE, purl=TRUE} +## We've seen that atomic vectors can be of type character, numeric, integer, and +## logical. But what happens if we try to mix these types in a single +## vector? + +## What will happen in each of these examples? (hint: use `class()` to +## check the data type of your object) +num_char <- c(1, 2, 3, "a") + +num_logical <- c(1, 2, 3, TRUE) + +char_logical <- c("a", "b", "c", TRUE) + +tricky <- c(1, 2, 3, "4") + +## Why do you think it happens? + +## You've probably noticed that objects of different types get +## converted into a single, shared type within a vector. In R, we call +## converting objects from one class into another class +## _coercion_. These conversions happen according to a hierarchy, +## whereby some types get preferentially coerced into other types. Can +## you draw a diagram that represents the hierarchy of how these data +## types are coerced? +``` + +## Subsetting vectors + +If we want to extract one or several values from a vector, we must +provide one or several indices in square brackets. For instance: + +```{r, results="show", purl=TRUE} +molecules <- c("dna", "rna", "peptide", "protein") +molecules[2] +molecules[c(3, 2)] +``` + +We can also repeat the indices to create an object with more elements +than the original one: + +```{r, results="show", purl=TRUE} +more_molecules <- molecules[c(1, 2, 3, 2, 1, 4)] +more_molecules +``` + +R indices start at 1. Programming languages like Fortran, MATLAB, +Julia, and R start counting at 1, because that's what human beings +typically do. Languages in the C family (including C++, Java, Perl, +and Python) count from 0 because that's simpler for computers to do. + +Finally, it is also possible to get all the elements of a vector +except some specified elements using negative indices: + +```{r} +molecules ## all molecules +molecules[-1] ## all but the first one +molecules[-c(1, 3)] ## all but 1st/3rd ones +molecules[c(-1, -3)] ## all but 1st/3rd ones +``` + +## Conditional subsetting + +Another common way of subsetting is by using a logical vector. `TRUE` will +select the element with the same index, while `FALSE` will not: + +```{r, purl=TRUE} +weight_g <- c(21, 34, 39, 54, 55) +weight_g[c(TRUE, FALSE, TRUE, TRUE, FALSE)] +``` + +Typically, these logical vectors are not typed by hand, but are the +output of other functions or logical tests. For instance, if you +wanted to select only the values above 50: + +```{r, purl=TRUE} +## will return logicals with TRUE for the indices that meet +## the condition +weight_g > 50 +## so we can use this to select only the values above 50 +weight_g[weight_g > 50] +``` + +You can combine multiple tests using `&` (both conditions are true, +AND) or `|` (at least one of the conditions is true, OR): + +```{r, results="show", purl=TRUE} +weight_g[weight_g < 30 | weight_g > 50] +weight_g[weight_g >= 30 & weight_g == 21] +``` + +Here, `<` stands for "less than", `>` for "greater than", `>=` for +"greater than or equal to", and `==` for "equal to". The double equal +sign `==` is a test for numerical equality between the left and right +hand sides, and should not be confused with the single `=` sign, which +performs variable assignment (similar to `<-`). + +A common task is to search for certain strings in a vector. One could +use the "or" operator `|` to test for equality to multiple values, but +this can quickly become tedious. The function `%in%` allows you to +test if any of the elements of a search vector are found: + +```{r, purl=TRUE} +molecules <- c("dna", "rna", "protein", "peptide") +molecules[molecules == "rna" | molecules == "dna"] # returns both rna and dna +molecules %in% c("rna", "dna", "metabolite", "peptide", "glycerol") +molecules[molecules %in% c("rna", "dna", "metabolite", "peptide", "glycerol")] +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge: + +Can you figure out why `"four" > "five"` returns `TRUE`? + +::::::::::::::: solution + +## Solution + +```{r} +"four" > "five" +``` + +When using `>` or `<` on strings, R compares their alphabetical order. +Here `"four"` comes after `"five"`, and therefore is _greater than_ +it. + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Names + +It is possible to name each element of a vector. The code chunk below +shows an initial vector without any names, how names are set, and +retrieved. + +```{r} +x <- c(1, 5, 3, 5, 10) +names(x) ## no names +names(x) <- c("A", "B", "C", "D", "E") +names(x) ## now we have names +``` + +When a vector has names, it is possible to access elements by their +name, in addition to their index. + +```{r} +x[c(1, 3)] +x[c("A", "C")] +``` + +## Missing data + +As R was designed to analyze datasets, it includes the concept of +missing data (which is uncommon in other programming +languages). Missing data are represented in vectors as `NA`. + +When doing operations on numbers, most functions will return `NA` if +the data you are working with include missing values. This feature +makes it harder to overlook the cases where you are dealing with +missing data. You can add the argument `na.rm = TRUE` to calculate +the result while ignoring the missing values. + +```{r} +heights <- c(2, 4, 4, NA, 6) +mean(heights) +max(heights) +mean(heights, na.rm = TRUE) +max(heights, na.rm = TRUE) +``` + +If your data include missing values, you may want to become familiar +with the functions `is.na()`, `na.omit()`, and `complete.cases()`. See +below for examples. + +```{r} +## Extract those elements which are not missing values. +heights[!is.na(heights)] + +## Returns the object with incomplete cases removed. +## The returned object is an atomic vector of type `"numeric"` +## (or `"double"`). +na.omit(heights) + +## Extract those elements which are complete cases. +## The returned object is an atomic vector of type `"numeric"` +## (or `"double"`). +heights[complete.cases(heights)] +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge: + +1. Using this vector of heights in inches, create a new vector with the NAs removed. + +```{r} +heights <- c(63, 69, 60, 65, NA, 68, 61, 70, 61, 59, 64, 69, 63, 63, NA, 72, 65, 64, 70, 63, 65) +``` + +2. Use the function `median()` to calculate the median of the `heights` vector. +3. Use R to figure out how many people in the set are taller than 67 inches. + +::::::::::::::: solution + +## Solution + +```{r, purl=TRUE} +heights_no_na <- heights[!is.na(heights)] +## or +heights_no_na <- na.omit(heights) +``` + +```{r, purl=TRUE} +median(heights, na.rm = TRUE) +``` + +```{r, purl=TRUE} +heights_above_67 <- heights_no_na[heights_no_na > 67] +length(heights_above_67) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Generating vectors {#sec:genvec} + +```{r, echo=FALSE} +set.seed(1) +``` + +### Constructors + +There exists some functions to generate vectors of different type. To +generate a vector of numerics, one can use the `numeric()` +constructor, providing the length of the output vector as +parameter. The values will be initialised with 0. + +```{r, purl=TRUE} +numeric(3) +numeric(10) +``` + +Note that if we ask for a vector of numerics of length 0, we obtain +exactly that: + +```{r, purl=TRUE} +numeric(0) +``` + +There are similar constructors for characters and logicals, named +`character()` and `logical()` respectively. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge: + +What are the defaults for character and logical vectors? + +::::::::::::::: solution + +## Solution + +```{r, purl=TRUE} +character(2) ## the empty character +logical(2) ## FALSE +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +### Replicate elements + +The `rep` function allow to repeat a value a certain number of +times. If we want to initiate a vector of numerics of length 5 with +the value -1, for example, we could do the following: + +```{r, purl=TRUE} +rep(-1, 5) +``` + +Similarly, to generate a vector populated with missing values, which +is often a good way to start, without setting assumptions on the data +to be collected: + +```{r, purl=TRUE} +rep(NA, 5) +``` + +`rep` can take vectors of any length as input (above, we used vectors +of length 1) and any type. For example, if we want to repeat the +values 1, 2 and 3 five times, we would do the following: + +```{r, purl=TRUE} +rep(c(1, 2, 3), 5) +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge: + +What if we wanted to repeat the values 1, 2 and 3 five times, but +obtain five 1s, five 2s and five 3s in that order? There are two +possibilities - see `?rep` or `?sort` for help. + +::::::::::::::: solution + +## Solution + +```{r, purl=TRUE} +rep(c(1, 2, 3), each = 5) +sort(rep(c(1, 2, 3), 5)) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +### Sequence generation + +Another very useful function is `seq`, to generate a sequence of +numbers. For example, to generate a sequence of integers from 1 to 20 +by steps of 2, one would use: + +```{r, purl=TRUE} +seq(from = 1, to = 20, by = 2) +``` + +The default value of `by` is 1 and, given that the generation of a +sequence of one value to another with steps of 1 is frequently used, +there's a shortcut: + +```{r, purl=TRUE} +seq(1, 5, 1) +seq(1, 5) ## default by +1:5 +``` + +To generate a sequence of numbers from 1 to 20 of final length of 3, +one would use: + +```{r, purl=TRUE} +seq(from = 1, to = 20, length.out = 3) +``` + +### Random samples and permutations + +A last group of useful functions are those that generate random +data. The first one, `sample`, generates a random permutation of +another vector. For example, to draw a random order to 10 students +oral exam, I first assign each student a number from 1 to ten (for +instance based on the alphabetic order of their name) and then: + +```{r, purl=TRUE} +sample(1:10) +``` + +Without further arguments, `sample` will return a permutation of all +elements of the vector. If I want a random sample of a certain size, I +would set this value as the second argument. Below, I sample 5 random +letters from the alphabet contained in the pre-defined `letters` vector: + +```{r, purl=TRUE} +sample(letters, 5) +``` + +If I wanted an output larger than the input vector, or being able to +draw some elements multiple times, I would need to set the `replace` +argument to `TRUE`: + +```{r, purl=TRUE} +sample(1:5, 10, replace = TRUE) +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge: + +When trying the functions above out, you will have realised that the +samples are indeed random and that one doesn't get the same +permutation twice. To be able to reproduce these random draws, one can +set the random number generation seed manually with `set.seed()` +before drawing the random sample. + +Test this feature with your neighbour. First draw two random +permutations of `1:10` independently and observe that you get +different results. + +Now set the seed with, for example, `set.seed(123)` and repeat the +random draw. Observe that you now get the same random draws. + +Repeat by setting a different seed. + +::::::::::::::: solution + +## Solution + +Different permutations + +```{r, purl=TRUE} +sample(1:10) +sample(1:10) +``` + +Same permutations with seed 123 + +```{r, purl=TRUE} +set.seed(123) +sample(1:10) +set.seed(123) +sample(1:10) +``` + +A different seed + +```{r, purl=TRUE} +set.seed(1) +sample(1:10) +set.seed(1) +sample(1:10) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +### Drawing samples from a normal distribution + +The last function we are going to see is `rnorm`, that draws a random +sample from a normal distribution. Two normal distributions of means 0 +and 100 and standard deviations 1 and 5, noted _N(0, 1)_ and +_N(100, 5)_, are shown below. + +```{r, echo=FALSE, fig.width=12, fig.height=6, fig.cap="Two normal distributions: *N(0, 1)* on the left and *N(100, 5)* on the right."} +par(mfrow = c(1, 2)) +plot(density(rnorm(1000)), main = "", sub = "N(0, 1)") +plot(density(rnorm(1000, 100, 5)), main = "", sub = "N(100, 5)") +``` + +The three arguments, `n`, `mean` and `sd`, define the size of the +sample, and the parameters of the normal distribution, i.e the mean +and its standard deviation. The defaults of the latter are 0 and 1. + +```{r, purl=TRUE} +rnorm(5) +rnorm(5, 2, 2) +rnorm(5, 100, 5) +``` + +Now that we have learned how to write scripts, and the basics of R's +data structures, we are ready to start working with larger data, and +learn about data frames. + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- How to interact with R + +:::::::::::::::::::::::::::::::::::::::::::::::::: From 30153c6f3569205f21710a0d316c411a2455e7bb Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Fri, 29 Dec 2023 06:10:40 +0900 Subject: [PATCH 013/334] New translations 23-starting-with-r.md (Spanish) --- locale/es/episodes/23-starting-with-r.Rmd | 921 ++++++++++++++++++++++ 1 file changed, 921 insertions(+) create mode 100644 locale/es/episodes/23-starting-with-r.Rmd diff --git a/locale/es/episodes/23-starting-with-r.Rmd b/locale/es/episodes/23-starting-with-r.Rmd new file mode 100644 index 000000000..88f9cfc4d --- /dev/null +++ b/locale/es/episodes/23-starting-with-r.Rmd @@ -0,0 +1,921 @@ +--- +source: Rmd +title: Introduction to R +teaching: 60 +exercises: 60 +--- + +```{r, include=FALSE} +``` + +::::::::::::::::::::::::::::::::::::::: objetivos + +- Define the following terms as they relate to R: object, assign, call, function, arguments, options. +- Assign values to objects in R. +- Learn how to _name_ objects +- Use comments to inform script. +- Solve simple arithmetic operations in R. +- Call functions and use arguments to change their default options. +- Inspect the content of vectors and manipulate their content. +- Subset and extract values from vectors. +- Analyze vectors with missing data. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- First commands in R + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Creating objects in R + +You can get output from R simply by typing math in the console: + +```{r, purl=TRUE} +3 + 5 +12 / 7 +``` + +However, to do useful and interesting things, we need to assign _values_ to +_objects_. To create an object, we need to give it a name followed by the +assignment operator `<-`, and the value we want to give it: + +```{r, purl=TRUE} +weight_kg <- 55 +``` + +`<-` is the assignment operator. It assigns values on the right to +objects on the left. So, after executing `x <- 3`, the value of `x` is +`3`. The arrow can be read as 3 **goes into** `x`. For historical +reasons, you can also use `=` for assignments, but not in every +context. Because of the +[slight differences](https://blog.revolutionanalytics.com/2008/12/use-equals-or-arrow-for-assignment.html) +in syntax, it is good practice to always use `<-` for assignments. + +In RStudio, typing <kbd>Alt</kbd> + <kbd>-</kbd> (push <kbd>Alt</kbd> +at the same time as the <kbd>-</kbd> key) will write `<-` in a single +keystroke in a PC, while typing <kbd>Option</kbd> + <kbd>-</kbd> (push <kbd>Option</kbd> at the same time as the <kbd>-</kbd> key) does the +same in a Mac. + +### Naming variables + +Objects can be given any name such as `x`, `current_temperature`, or +`subject_id`. You want your object names to be explicit and not too +long. They cannot start with a number (`2x` is not valid, but `x2` +is). R is case sensitive (e.g., `weight_kg` is different from +`Weight_kg`). There are some names that cannot be used because they +are the names of fundamental functions in R (e.g., `if`, `else`, +`for`, see +[here](https://stat.ethz.ch/R-manual/R-devel/library/base/html/Reserved.html) +for a complete list). In general, even if it's allowed, it's best to +not use other function names (e.g., `c`, `T`, `mean`, `data`, `df`, +`weights`). If in doubt, check the help to see if the name is already +in use. It's also best to avoid dots (`.`) within an object name as in +`my.dataset`. There are many functions in R with dots in their names +for historical reasons, but because dots have a special meaning in R +(for methods) and other programming languages, it's best to avoid +them. It is also recommended to use nouns for object names, and verbs +for function names. It's important to be consistent in the styling of +your code (where you put spaces, how you name objects, etc.). Using a +consistent coding style makes your code clearer to read for your +future self and your collaborators. In R, some popular style guides +are [Google's](https://google.github.io/styleguide/Rguide.xml), the +[tidyverse's](https://style.tidyverse.org/) style and the Bioconductor +style +guide. The +tidyverse's is very comprehensive and may seem overwhelming at +first. You can install the +[**`lintr`**](https://github.com/jimhester/lintr) package to +automatically check for issues in the styling of your code. + +> **Objects vs. variables**: What are known as `objects` in `R` are +> known as `variables` in many other programming languages. Depending +> on the context, `object` and `variable` can have drastically +> different meanings. However, in this lesson, the two words are used +> synonymously. For more information +> [see here.](https://cran.r-project.org/doc/manuals/r-release/R-lang.html#Objects) + +When assigning a value to an object, R does not print anything. You +can force R to print the value by using parentheses or by typing the +object name: + +```{r, purl=TRUE} +weight_kg <- 55 # doesn't print anything +(weight_kg <- 55) # but putting parenthesis around the call prints the value of `weight_kg` +weight_kg # and so does typing the name of the object +``` + +Now that R has `weight_kg` in memory, we can do arithmetic with it. For +instance, we may want to convert this weight into pounds (weight in pounds is 2.2 times the weight in kg): + +```{r, purl=TRUE} +2.2 * weight_kg +``` + +We can also change an object's value by assigning it a new one: + +```{r, purl=TRUE} +weight_kg <- 57.5 +2.2 * weight_kg +``` + +This means that assigning a value to one object does not change the values of +other objects For example, let's store the animal's weight in pounds in a new +object, `weight_lb`: + +```{r, purl=TRUE} +weight_lb <- 2.2 * weight_kg +``` + +and then change `weight_kg` to 100. + +```{r} +weight_kg <- 100 +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge: + +What do you think is the current content of the object `weight_lb`? +126.5 or 220? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Comments + +The comment character in R is `#`, anything to the right of a `#` in a +script will be ignored by R. It is useful to leave notes, and +explanations in your scripts. + +RStudio makes it easy to comment or uncomment a paragraph: after +selecting the lines you want to comment, press at the same time on +your keyboard <kbd>Ctrl</kbd> + <kbd>Shift</kbd> + <kbd>C</kbd>. If +you only want to comment out one line, you can put the cursor at any +location of that line (i.e. no need to select the whole line), then +press <kbd>Ctrl</kbd> + <kbd>Shift</kbd> + <kbd>C</kbd>. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge + +What are the values after each statement in the following? + +```{r, purl=TRUE} +mass <- 47.5 # mass? +age <- 122 # age? +mass <- mass * 2.0 # mass? +age <- age - 20 # age? +mass_index <- mass/age # mass_index? +``` + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Functions and their arguments + +Functions are "canned scripts" that automate more complicated sets of commands +including operations assignments, etc. Many functions are predefined, or can be +made available by importing R _packages_ (more on that later). A function +usually gets one or more inputs called _arguments_. Functions often (but not +always) return a _value_. A typical example would be the function `sqrt()`. The +input (the argument) must be a number, and the return value (in fact, the +output) is the square root of that number. Executing a function ('running it') +is called _calling_ the function. An example of a function call is: + +```{r, eval=FALSE, purl=FALSE} +b <- sqrt(a) +``` + +Here, the value of `a` is given to the `sqrt()` function, the `sqrt()` function +calculates the square root, and returns the value which is then assigned to +the object `b`. This function is very simple, because it takes just one argument. + +The return 'value' of a function need not be numerical (like that of `sqrt()`), +and it also does not need to be a single item: it can be a set of things, or +even a dataset. We'll see that when we read data files into R. + +Arguments can be anything, not only numbers or filenames, but also other +objects. Exactly what each argument means differs per function, and must be +looked up in the documentation (see below). Some functions take arguments which +may either be specified by the user, or, if left out, take on a _default_ value: +these are called _options_. Options are typically used to alter the way the +function operates, such as whether it ignores 'bad values', or what symbol to +use in a plot. However, if you want something specific, you can specify a value +of your choice which will be used instead of the default. + +Let's try a function that can take multiple arguments: `round()`. + +```{r, results="show", purl=TRUE} +round(3.14159) +``` + +Here, we've called `round()` with just one argument, `3.14159`, and it has +returned the value `3`. That's because the default is to round to the nearest +whole number. If we want more digits we can see how to do that by getting +information about the `round` function. We can use `args(round)` or look at the +help for this function using `?round`. + +```{r, results="show", purl=TRUE} +args(round) +``` + +```{r, eval=FALSE, purl=TRUE} +?round +``` + +We see that if we want a different number of digits, we can +type `digits=2` or however many we want. + +```{r, results="show", purl=TRUE} +round(3.14159, digits = 2) +``` + +If you provide the arguments in the exact same order as they are defined you +don't have to name them: + +```{r, results="show", purl=TRUE} +round(3.14159, 2) +``` + +And if you do name the arguments, you can switch their order: + +```{r, results="show", purl=TRUE} +round(digits = 2, x = 3.14159) +``` + +It's good practice to put the non-optional arguments (like the number you're +rounding) first in your function call, and to specify the names of all optional +arguments. If you don't, someone reading your code might have to look up the +definition of a function with unfamiliar arguments to understand what you're +doing. By specifying the name of the arguments you are also safeguarding +against possible future changes in the function interface, which may +potentially add new arguments in between the existing ones. + +## Vectors and data types + +A vector is the most common and basic data type in R, and is pretty much +the workhorse of R. A vector is composed by a series of values, such as +numbers or characters. We can assign a series of values to a vector using +the `c()` function. For example we can create a vector of animal weights and assign +it to a new object `weight_g`: + +```{r, purl=TRUE} +weight_g <- c(50, 60, 65, 82) +weight_g +``` + +A vector can also contain characters: + +```{r, purl=TRUE} +molecules <- c("dna", "rna", "protein") +molecules +``` + +The quotes around "dna", "rna", etc. are essential here. Without the +quotes R will assume there are objects called `dna`, `rna` and +`protein`. As these objects don't exist in R's memory, there will be +an error message. + +There are many functions that allow you to inspect the content of a +vector. `length()` tells you how many elements are in a particular vector: + +```{r, purl=TRUE} +length(weight_g) +length(molecules) +``` + +An important feature of a vector, is that all of the elements are the +same type of data. The function `class()` indicates the class (the +type of element) of an object: + +```{r, purl=TRUE} +class(weight_g) +class(molecules) +``` + +The function `str()` provides an overview of the structure of an +object and its elements. It is a useful function when working with +large and complex objects: + +```{r, purl=TRUE} +str(weight_g) +str(molecules) +``` + +You can use the `c()` function to add other elements to your vector: + +```{r} +weight_g <- c(weight_g, 90) # add to the end of the vector +weight_g <- c(30, weight_g) # add to the beginning of the vector +weight_g +``` + +In the first line, we take the original vector `weight_g`, add the +value `90` to the end of it, and save the result back into +`weight_g`. Then we add the value `30` to the beginning, again saving +the result back into `weight_g`. + +We can do this over and over again to grow a vector, or assemble a +dataset. As we program, this may be useful to add results that we are +collecting or calculating. + +An **atomic vector** is the simplest R **data type** and is a linear +vector of a single type. Above, we saw 2 of the 6 main **atomic +vector** types that R uses: `"character"` and `"numeric"` (or +`"double"`). These are the basic building blocks that all R objects +are built from. The other 4 **atomic vector** types are: + +- `"logical"` for `TRUE` and `FALSE` (the boolean data type) +- `"integer"` for integer numbers (e.g., `2L`, the `L` indicates to R + that it's an integer) +- `"complex"` to represent complex numbers with real and imaginary + parts (e.g., `1 + 4i`) and that's all we're going to say about them +- `"raw"` for bitstreams that we won't discuss further + +You can check the type of your vector using the `typeof()` function +and inputting your vector as the argument. + +Vectors are one of the many **data structures** that R uses. Other +important ones are lists (`list`), matrices (`matrix`), data frames +(`data.frame`), factors (`factor`) and arrays (`array`). + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge: + +We've seen that atomic vectors can be of type character, numeric (or +double), integer, and logical. But what happens if we try to mix +these types in a single vector? + +::::::::::::::: solution + +## Solution + +R implicitly converts them to all be the same type + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge: + +What will happen in each of these examples? (hint: use `class()` to +check the data type of your objects and type in their names to see what happens): + +```{r, eval=TRUE} +num_char <- c(1, 2, 3, "a") +num_logical <- c(1, 2, 3, TRUE, FALSE) +char_logical <- c("a", "b", "c", TRUE) +tricky <- c(1, 2, 3, "4") +``` + +::::::::::::::: solution + +## Solution + +```{r, purl=TRUE} +class(num_char) +num_char +class(num_logical) +num_logical +class(char_logical) +char_logical +class(tricky) +tricky +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge: + +Why do you think it happens? + +::::::::::::::: solution + +## Solution + +Vectors can be of only one data type. R tries to convert (coerce) +the content of this vector to find a _common denominator_ that +doesn't lose any information. + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge: + +How many values in `combined_logical` are `"TRUE"` (as a character) +in the following example: + +```{r, eval=TRUE} +num_logical <- c(1, 2, 3, TRUE) +char_logical <- c("a", "b", "c", TRUE) +combined_logical <- c(num_logical, char_logical) +``` + +::::::::::::::: solution + +## Solution + +Only one. There is no memory of past data types, and the coercion +happens the first time the vector is evaluated. Therefore, the `TRUE` +in `num_logical` gets converted into a `1` before it gets converted +into `"1"` in `combined_logical`. + +```{r} +combined_logical +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge: + +In R, we call converting objects from one class into another class +_coercion_. These conversions happen according to a hierarchy, +whereby some types get preferentially coerced into other types. Can +you draw a diagram that represents the hierarchy of how these data +types are coerced? + +::::::::::::::: solution + +## Solution + +logical → numeric → character ← logical + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +```{r, echo=FALSE, eval=FALSE, purl=TRUE} +## We've seen that atomic vectors can be of type character, numeric, integer, and +## logical. But what happens if we try to mix these types in a single +## vector? + +## What will happen in each of these examples? (hint: use `class()` to +## check the data type of your object) +num_char <- c(1, 2, 3, "a") + +num_logical <- c(1, 2, 3, TRUE) + +char_logical <- c("a", "b", "c", TRUE) + +tricky <- c(1, 2, 3, "4") + +## Why do you think it happens? + +## You've probably noticed that objects of different types get +## converted into a single, shared type within a vector. In R, we call +## converting objects from one class into another class +## _coercion_. These conversions happen according to a hierarchy, +## whereby some types get preferentially coerced into other types. Can +## you draw a diagram that represents the hierarchy of how these data +## types are coerced? +``` + +## Subsetting vectors + +If we want to extract one or several values from a vector, we must +provide one or several indices in square brackets. For instance: + +```{r, results="show", purl=TRUE} +molecules <- c("dna", "rna", "peptide", "protein") +molecules[2] +molecules[c(3, 2)] +``` + +We can also repeat the indices to create an object with more elements +than the original one: + +```{r, results="show", purl=TRUE} +more_molecules <- molecules[c(1, 2, 3, 2, 1, 4)] +more_molecules +``` + +R indices start at 1. Programming languages like Fortran, MATLAB, +Julia, and R start counting at 1, because that's what human beings +typically do. Languages in the C family (including C++, Java, Perl, +and Python) count from 0 because that's simpler for computers to do. + +Finally, it is also possible to get all the elements of a vector +except some specified elements using negative indices: + +```{r} +molecules ## all molecules +molecules[-1] ## all but the first one +molecules[-c(1, 3)] ## all but 1st/3rd ones +molecules[c(-1, -3)] ## all but 1st/3rd ones +``` + +## Conditional subsetting + +Another common way of subsetting is by using a logical vector. `TRUE` will +select the element with the same index, while `FALSE` will not: + +```{r, purl=TRUE} +weight_g <- c(21, 34, 39, 54, 55) +weight_g[c(TRUE, FALSE, TRUE, TRUE, FALSE)] +``` + +Typically, these logical vectors are not typed by hand, but are the +output of other functions or logical tests. For instance, if you +wanted to select only the values above 50: + +```{r, purl=TRUE} +## will return logicals with TRUE for the indices that meet +## the condition +weight_g > 50 +## so we can use this to select only the values above 50 +weight_g[weight_g > 50] +``` + +You can combine multiple tests using `&` (both conditions are true, +AND) or `|` (at least one of the conditions is true, OR): + +```{r, results="show", purl=TRUE} +weight_g[weight_g < 30 | weight_g > 50] +weight_g[weight_g >= 30 & weight_g == 21] +``` + +Here, `<` stands for "less than", `>` for "greater than", `>=` for +"greater than or equal to", and `==` for "equal to". The double equal +sign `==` is a test for numerical equality between the left and right +hand sides, and should not be confused with the single `=` sign, which +performs variable assignment (similar to `<-`). + +A common task is to search for certain strings in a vector. One could +use the "or" operator `|` to test for equality to multiple values, but +this can quickly become tedious. The function `%in%` allows you to +test if any of the elements of a search vector are found: + +```{r, purl=TRUE} +molecules <- c("dna", "rna", "protein", "peptide") +molecules[molecules == "rna" | molecules == "dna"] # returns both rna and dna +molecules %in% c("rna", "dna", "metabolite", "peptide", "glycerol") +molecules[molecules %in% c("rna", "dna", "metabolite", "peptide", "glycerol")] +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge: + +Can you figure out why `"four" > "five"` returns `TRUE`? + +::::::::::::::: solution + +## Solution + +```{r} +"four" > "five" +``` + +When using `>` or `<` on strings, R compares their alphabetical order. +Here `"four"` comes after `"five"`, and therefore is _greater than_ +it. + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Names + +It is possible to name each element of a vector. The code chunk below +shows an initial vector without any names, how names are set, and +retrieved. + +```{r} +x <- c(1, 5, 3, 5, 10) +names(x) ## no names +names(x) <- c("A", "B", "C", "D", "E") +names(x) ## now we have names +``` + +When a vector has names, it is possible to access elements by their +name, in addition to their index. + +```{r} +x[c(1, 3)] +x[c("A", "C")] +``` + +## Missing data + +As R was designed to analyze datasets, it includes the concept of +missing data (which is uncommon in other programming +languages). Missing data are represented in vectors as `NA`. + +When doing operations on numbers, most functions will return `NA` if +the data you are working with include missing values. This feature +makes it harder to overlook the cases where you are dealing with +missing data. You can add the argument `na.rm = TRUE` to calculate +the result while ignoring the missing values. + +```{r} +heights <- c(2, 4, 4, NA, 6) +mean(heights) +max(heights) +mean(heights, na.rm = TRUE) +max(heights, na.rm = TRUE) +``` + +If your data include missing values, you may want to become familiar +with the functions `is.na()`, `na.omit()`, and `complete.cases()`. See +below for examples. + +```{r} +## Extract those elements which are not missing values. +heights[!is.na(heights)] + +## Returns the object with incomplete cases removed. +## The returned object is an atomic vector of type `"numeric"` +## (or `"double"`). +na.omit(heights) + +## Extract those elements which are complete cases. +## The returned object is an atomic vector of type `"numeric"` +## (or `"double"`). +heights[complete.cases(heights)] +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge: + +1. Using this vector of heights in inches, create a new vector with the NAs removed. + +```{r} +heights <- c(63, 69, 60, 65, NA, 68, 61, 70, 61, 59, 64, 69, 63, 63, NA, 72, 65, 64, 70, 63, 65) +``` + +2. Use the function `median()` to calculate the median of the `heights` vector. +3. Use R to figure out how many people in the set are taller than 67 inches. + +::::::::::::::: solution + +## Solution + +```{r, purl=TRUE} +heights_no_na <- heights[!is.na(heights)] +## or +heights_no_na <- na.omit(heights) +``` + +```{r, purl=TRUE} +median(heights, na.rm = TRUE) +``` + +```{r, purl=TRUE} +heights_above_67 <- heights_no_na[heights_no_na > 67] +length(heights_above_67) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Generating vectors {#sec:genvec} + +```{r, echo=FALSE} +set.seed(1) +``` + +### Constructors + +There exists some functions to generate vectors of different type. To +generate a vector of numerics, one can use the `numeric()` +constructor, providing the length of the output vector as +parameter. The values will be initialised with 0. + +```{r, purl=TRUE} +numeric(3) +numeric(10) +``` + +Note that if we ask for a vector of numerics of length 0, we obtain +exactly that: + +```{r, purl=TRUE} +numeric(0) +``` + +There are similar constructors for characters and logicals, named +`character()` and `logical()` respectively. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge: + +What are the defaults for character and logical vectors? + +::::::::::::::: solution + +## Solution + +```{r, purl=TRUE} +character(2) ## the empty character +logical(2) ## FALSE +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +### Replicate elements + +The `rep` function allow to repeat a value a certain number of +times. If we want to initiate a vector of numerics of length 5 with +the value -1, for example, we could do the following: + +```{r, purl=TRUE} +rep(-1, 5) +``` + +Similarly, to generate a vector populated with missing values, which +is often a good way to start, without setting assumptions on the data +to be collected: + +```{r, purl=TRUE} +rep(NA, 5) +``` + +`rep` can take vectors of any length as input (above, we used vectors +of length 1) and any type. For example, if we want to repeat the +values 1, 2 and 3 five times, we would do the following: + +```{r, purl=TRUE} +rep(c(1, 2, 3), 5) +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge: + +What if we wanted to repeat the values 1, 2 and 3 five times, but +obtain five 1s, five 2s and five 3s in that order? There are two +possibilities - see `?rep` or `?sort` for help. + +::::::::::::::: solution + +## Solution + +```{r, purl=TRUE} +rep(c(1, 2, 3), each = 5) +sort(rep(c(1, 2, 3), 5)) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +### Sequence generation + +Another very useful function is `seq`, to generate a sequence of +numbers. For example, to generate a sequence of integers from 1 to 20 +by steps of 2, one would use: + +```{r, purl=TRUE} +seq(from = 1, to = 20, by = 2) +``` + +The default value of `by` is 1 and, given that the generation of a +sequence of one value to another with steps of 1 is frequently used, +there's a shortcut: + +```{r, purl=TRUE} +seq(1, 5, 1) +seq(1, 5) ## default by +1:5 +``` + +To generate a sequence of numbers from 1 to 20 of final length of 3, +one would use: + +```{r, purl=TRUE} +seq(from = 1, to = 20, length.out = 3) +``` + +### Random samples and permutations + +A last group of useful functions are those that generate random +data. The first one, `sample`, generates a random permutation of +another vector. For example, to draw a random order to 10 students +oral exam, I first assign each student a number from 1 to ten (for +instance based on the alphabetic order of their name) and then: + +```{r, purl=TRUE} +sample(1:10) +``` + +Without further arguments, `sample` will return a permutation of all +elements of the vector. If I want a random sample of a certain size, I +would set this value as the second argument. Below, I sample 5 random +letters from the alphabet contained in the pre-defined `letters` vector: + +```{r, purl=TRUE} +sample(letters, 5) +``` + +If I wanted an output larger than the input vector, or being able to +draw some elements multiple times, I would need to set the `replace` +argument to `TRUE`: + +```{r, purl=TRUE} +sample(1:5, 10, replace = TRUE) +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge: + +When trying the functions above out, you will have realised that the +samples are indeed random and that one doesn't get the same +permutation twice. To be able to reproduce these random draws, one can +set the random number generation seed manually with `set.seed()` +before drawing the random sample. + +Test this feature with your neighbour. First draw two random +permutations of `1:10` independently and observe that you get +different results. + +Now set the seed with, for example, `set.seed(123)` and repeat the +random draw. Observe that you now get the same random draws. + +Repeat by setting a different seed. + +::::::::::::::: solution + +## Solution + +Different permutations + +```{r, purl=TRUE} +sample(1:10) +sample(1:10) +``` + +Same permutations with seed 123 + +```{r, purl=TRUE} +set.seed(123) +sample(1:10) +set.seed(123) +sample(1:10) +``` + +A different seed + +```{r, purl=TRUE} +set.seed(1) +sample(1:10) +set.seed(1) +sample(1:10) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +### Drawing samples from a normal distribution + +The last function we are going to see is `rnorm`, that draws a random +sample from a normal distribution. Two normal distributions of means 0 +and 100 and standard deviations 1 and 5, noted _N(0, 1)_ and +_N(100, 5)_, are shown below. + +```{r, echo=FALSE, fig.width=12, fig.height=6, fig.cap="Two normal distributions: *N(0, 1)* on the left and *N(100, 5)* on the right."} +par(mfrow = c(1, 2)) +plot(density(rnorm(1000)), main = "", sub = "N(0, 1)") +plot(density(rnorm(1000, 100, 5)), main = "", sub = "N(100, 5)") +``` + +The three arguments, `n`, `mean` and `sd`, define the size of the +sample, and the parameters of the normal distribution, i.e the mean +and its standard deviation. The defaults of the latter are 0 and 1. + +```{r, purl=TRUE} +rnorm(5) +rnorm(5, 2, 2) +rnorm(5, 100, 5) +``` + +Now that we have learned how to write scripts, and the basics of R's +data structures, we are ready to start working with larger data, and +learn about data frames. + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- How to interact with R + +:::::::::::::::::::::::::::::::::::::::::::::::::: From 2fef15526053b5bb28ee37f5c8e8da85909515a4 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Fri, 29 Dec 2023 06:10:42 +0900 Subject: [PATCH 014/334] New translations 23-starting-with-r.md (Japanese) --- locale/ja/episodes/23-starting-with-r.Rmd | 921 ++++++++++++++++++++++ 1 file changed, 921 insertions(+) create mode 100644 locale/ja/episodes/23-starting-with-r.Rmd diff --git a/locale/ja/episodes/23-starting-with-r.Rmd b/locale/ja/episodes/23-starting-with-r.Rmd new file mode 100644 index 000000000..981b7aa59 --- /dev/null +++ b/locale/ja/episodes/23-starting-with-r.Rmd @@ -0,0 +1,921 @@ +--- +source: Rmd +title: Introduction to R +teaching: 60 +exercises: 60 +--- + +```{r, include=FALSE} +``` + +::::::::::::::::::::::::::::::::::::::: 目的 + +- Define the following terms as they relate to R: object, assign, call, function, arguments, options. +- Assign values to objects in R. +- Learn how to _name_ objects +- Use comments to inform script. +- Solve simple arithmetic operations in R. +- Call functions and use arguments to change their default options. +- Inspect the content of vectors and manipulate their content. +- Subset and extract values from vectors. +- Analyze vectors with missing data. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- First commands in R + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Creating objects in R + +You can get output from R simply by typing math in the console: + +```{r, purl=TRUE} +3 + 5 +12 / 7 +``` + +However, to do useful and interesting things, we need to assign _values_ to +_objects_. To create an object, we need to give it a name followed by the +assignment operator `<-`, and the value we want to give it: + +```{r, purl=TRUE} +weight_kg <- 55 +``` + +`<-` is the assignment operator. It assigns values on the right to +objects on the left. So, after executing `x <- 3`, the value of `x` is +`3`. The arrow can be read as 3 **goes into** `x`. For historical +reasons, you can also use `=` for assignments, but not in every +context. Because of the +[slight differences](https://blog.revolutionanalytics.com/2008/12/use-equals-or-arrow-for-assignment.html) +in syntax, it is good practice to always use `<-` for assignments. + +In RStudio, typing <kbd>Alt</kbd> + <kbd>-</kbd> (push <kbd>Alt</kbd> +at the same time as the <kbd>-</kbd> key) will write `<-` in a single +keystroke in a PC, while typing <kbd>Option</kbd> + <kbd>-</kbd> (push <kbd>Option</kbd> at the same time as the <kbd>-</kbd> key) does the +same in a Mac. + +### Naming variables + +Objects can be given any name such as `x`, `current_temperature`, or +`subject_id`. You want your object names to be explicit and not too +long. They cannot start with a number (`2x` is not valid, but `x2` +is). R is case sensitive (e.g., `weight_kg` is different from +`Weight_kg`). There are some names that cannot be used because they +are the names of fundamental functions in R (e.g., `if`, `else`, +`for`, see +[here](https://stat.ethz.ch/R-manual/R-devel/library/base/html/Reserved.html) +for a complete list). In general, even if it's allowed, it's best to +not use other function names (e.g., `c`, `T`, `mean`, `data`, `df`, +`weights`). If in doubt, check the help to see if the name is already +in use. It's also best to avoid dots (`.`) within an object name as in +`my.dataset`. There are many functions in R with dots in their names +for historical reasons, but because dots have a special meaning in R +(for methods) and other programming languages, it's best to avoid +them. It is also recommended to use nouns for object names, and verbs +for function names. It's important to be consistent in the styling of +your code (where you put spaces, how you name objects, etc.). Using a +consistent coding style makes your code clearer to read for your +future self and your collaborators. In R, some popular style guides +are [Google's](https://google.github.io/styleguide/Rguide.xml), the +[tidyverse's](https://style.tidyverse.org/) style and the Bioconductor +style +guide. The +tidyverse's is very comprehensive and may seem overwhelming at +first. You can install the +[**`lintr`**](https://github.com/jimhester/lintr) package to +automatically check for issues in the styling of your code. + +> **Objects vs. variables**: What are known as `objects` in `R` are +> known as `variables` in many other programming languages. Depending +> on the context, `object` and `variable` can have drastically +> different meanings. However, in this lesson, the two words are used +> synonymously. For more information +> [see here.](https://cran.r-project.org/doc/manuals/r-release/R-lang.html#Objects) + +When assigning a value to an object, R does not print anything. You +can force R to print the value by using parentheses or by typing the +object name: + +```{r, purl=TRUE} +weight_kg <- 55 # doesn't print anything +(weight_kg <- 55) # but putting parenthesis around the call prints the value of `weight_kg` +weight_kg # and so does typing the name of the object +``` + +Now that R has `weight_kg` in memory, we can do arithmetic with it. For +instance, we may want to convert this weight into pounds (weight in pounds is 2.2 times the weight in kg): + +```{r, purl=TRUE} +2.2 * weight_kg +``` + +We can also change an object's value by assigning it a new one: + +```{r, purl=TRUE} +weight_kg <- 57.5 +2.2 * weight_kg +``` + +This means that assigning a value to one object does not change the values of +other objects For example, let's store the animal's weight in pounds in a new +object, `weight_lb`: + +```{r, purl=TRUE} +weight_lb <- 2.2 * weight_kg +``` + +and then change `weight_kg` to 100. + +```{r} +weight_kg <- 100 +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge: + +What do you think is the current content of the object `weight_lb`? +126.5 or 220? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Comments + +The comment character in R is `#`, anything to the right of a `#` in a +script will be ignored by R. It is useful to leave notes, and +explanations in your scripts. + +RStudio makes it easy to comment or uncomment a paragraph: after +selecting the lines you want to comment, press at the same time on +your keyboard <kbd>Ctrl</kbd> + <kbd>Shift</kbd> + <kbd>C</kbd>. If +you only want to comment out one line, you can put the cursor at any +location of that line (i.e. no need to select the whole line), then +press <kbd>Ctrl</kbd> + <kbd>Shift</kbd> + <kbd>C</kbd>. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge + +What are the values after each statement in the following? + +```{r, purl=TRUE} +mass <- 47.5 # mass? +age <- 122 # age? +mass <- mass * 2.0 # mass? +age <- age - 20 # age? +mass_index <- mass/age # mass_index? +``` + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Functions and their arguments + +Functions are "canned scripts" that automate more complicated sets of commands +including operations assignments, etc. Many functions are predefined, or can be +made available by importing R _packages_ (more on that later). A function +usually gets one or more inputs called _arguments_. Functions often (but not +always) return a _value_. A typical example would be the function `sqrt()`. The +input (the argument) must be a number, and the return value (in fact, the +output) is the square root of that number. Executing a function ('running it') +is called _calling_ the function. An example of a function call is: + +```{r, eval=FALSE, purl=FALSE} +b <- sqrt(a) +``` + +Here, the value of `a` is given to the `sqrt()` function, the `sqrt()` function +calculates the square root, and returns the value which is then assigned to +the object `b`. This function is very simple, because it takes just one argument. + +The return 'value' of a function need not be numerical (like that of `sqrt()`), +and it also does not need to be a single item: it can be a set of things, or +even a dataset. We'll see that when we read data files into R. + +Arguments can be anything, not only numbers or filenames, but also other +objects. Exactly what each argument means differs per function, and must be +looked up in the documentation (see below). Some functions take arguments which +may either be specified by the user, or, if left out, take on a _default_ value: +these are called _options_. Options are typically used to alter the way the +function operates, such as whether it ignores 'bad values', or what symbol to +use in a plot. However, if you want something specific, you can specify a value +of your choice which will be used instead of the default. + +Let's try a function that can take multiple arguments: `round()`. + +```{r, results="show", purl=TRUE} +round(3.14159) +``` + +Here, we've called `round()` with just one argument, `3.14159`, and it has +returned the value `3`. That's because the default is to round to the nearest +whole number. If we want more digits we can see how to do that by getting +information about the `round` function. We can use `args(round)` or look at the +help for this function using `?round`. + +```{r, results="show", purl=TRUE} +args(round) +``` + +```{r, eval=FALSE, purl=TRUE} +?round +``` + +We see that if we want a different number of digits, we can +type `digits=2` or however many we want. + +```{r, results="show", purl=TRUE} +round(3.14159, digits = 2) +``` + +If you provide the arguments in the exact same order as they are defined you +don't have to name them: + +```{r, results="show", purl=TRUE} +round(3.14159, 2) +``` + +And if you do name the arguments, you can switch their order: + +```{r, results="show", purl=TRUE} +round(digits = 2, x = 3.14159) +``` + +It's good practice to put the non-optional arguments (like the number you're +rounding) first in your function call, and to specify the names of all optional +arguments. If you don't, someone reading your code might have to look up the +definition of a function with unfamiliar arguments to understand what you're +doing. By specifying the name of the arguments you are also safeguarding +against possible future changes in the function interface, which may +potentially add new arguments in between the existing ones. + +## Vectors and data types + +A vector is the most common and basic data type in R, and is pretty much +the workhorse of R. A vector is composed by a series of values, such as +numbers or characters. We can assign a series of values to a vector using +the `c()` function. For example we can create a vector of animal weights and assign +it to a new object `weight_g`: + +```{r, purl=TRUE} +weight_g <- c(50, 60, 65, 82) +weight_g +``` + +A vector can also contain characters: + +```{r, purl=TRUE} +molecules <- c("dna", "rna", "protein") +molecules +``` + +The quotes around "dna", "rna", etc. are essential here. Without the +quotes R will assume there are objects called `dna`, `rna` and +`protein`. As these objects don't exist in R's memory, there will be +an error message. + +There are many functions that allow you to inspect the content of a +vector. `length()` tells you how many elements are in a particular vector: + +```{r, purl=TRUE} +length(weight_g) +length(molecules) +``` + +An important feature of a vector, is that all of the elements are the +same type of data. The function `class()` indicates the class (the +type of element) of an object: + +```{r, purl=TRUE} +class(weight_g) +class(molecules) +``` + +The function `str()` provides an overview of the structure of an +object and its elements. It is a useful function when working with +large and complex objects: + +```{r, purl=TRUE} +str(weight_g) +str(molecules) +``` + +You can use the `c()` function to add other elements to your vector: + +```{r} +weight_g <- c(weight_g, 90) # add to the end of the vector +weight_g <- c(30, weight_g) # add to the beginning of the vector +weight_g +``` + +In the first line, we take the original vector `weight_g`, add the +value `90` to the end of it, and save the result back into +`weight_g`. Then we add the value `30` to the beginning, again saving +the result back into `weight_g`. + +We can do this over and over again to grow a vector, or assemble a +dataset. As we program, this may be useful to add results that we are +collecting or calculating. + +An **atomic vector** is the simplest R **data type** and is a linear +vector of a single type. Above, we saw 2 of the 6 main **atomic +vector** types that R uses: `"character"` and `"numeric"` (or +`"double"`). These are the basic building blocks that all R objects +are built from. The other 4 **atomic vector** types are: + +- `"logical"` for `TRUE` and `FALSE` (the boolean data type) +- `"integer"` for integer numbers (e.g., `2L`, the `L` indicates to R + that it's an integer) +- `"complex"` to represent complex numbers with real and imaginary + parts (e.g., `1 + 4i`) and that's all we're going to say about them +- `"raw"` for bitstreams that we won't discuss further + +You can check the type of your vector using the `typeof()` function +and inputting your vector as the argument. + +Vectors are one of the many **data structures** that R uses. Other +important ones are lists (`list`), matrices (`matrix`), data frames +(`data.frame`), factors (`factor`) and arrays (`array`). + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge: + +We've seen that atomic vectors can be of type character, numeric (or +double), integer, and logical. But what happens if we try to mix +these types in a single vector? + +::::::::::::::: solution + +## Solution + +R implicitly converts them to all be the same type + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge: + +What will happen in each of these examples? (hint: use `class()` to +check the data type of your objects and type in their names to see what happens): + +```{r, eval=TRUE} +num_char <- c(1, 2, 3, "a") +num_logical <- c(1, 2, 3, TRUE, FALSE) +char_logical <- c("a", "b", "c", TRUE) +tricky <- c(1, 2, 3, "4") +``` + +::::::::::::::: solution + +## Solution + +```{r, purl=TRUE} +class(num_char) +num_char +class(num_logical) +num_logical +class(char_logical) +char_logical +class(tricky) +tricky +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge: + +Why do you think it happens? + +::::::::::::::: solution + +## Solution + +Vectors can be of only one data type. R tries to convert (coerce) +the content of this vector to find a _common denominator_ that +doesn't lose any information. + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge: + +How many values in `combined_logical` are `"TRUE"` (as a character) +in the following example: + +```{r, eval=TRUE} +num_logical <- c(1, 2, 3, TRUE) +char_logical <- c("a", "b", "c", TRUE) +combined_logical <- c(num_logical, char_logical) +``` + +::::::::::::::: solution + +## Solution + +Only one. There is no memory of past data types, and the coercion +happens the first time the vector is evaluated. Therefore, the `TRUE` +in `num_logical` gets converted into a `1` before it gets converted +into `"1"` in `combined_logical`. + +```{r} +combined_logical +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge: + +In R, we call converting objects from one class into another class +_coercion_. These conversions happen according to a hierarchy, +whereby some types get preferentially coerced into other types. Can +you draw a diagram that represents the hierarchy of how these data +types are coerced? + +::::::::::::::: solution + +## Solution + +logical → numeric → character ← logical + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +```{r, echo=FALSE, eval=FALSE, purl=TRUE} +## We've seen that atomic vectors can be of type character, numeric, integer, and +## logical. But what happens if we try to mix these types in a single +## vector? + +## What will happen in each of these examples? (hint: use `class()` to +## check the data type of your object) +num_char <- c(1, 2, 3, "a") + +num_logical <- c(1, 2, 3, TRUE) + +char_logical <- c("a", "b", "c", TRUE) + +tricky <- c(1, 2, 3, "4") + +## Why do you think it happens? + +## You've probably noticed that objects of different types get +## converted into a single, shared type within a vector. In R, we call +## converting objects from one class into another class +## _coercion_. These conversions happen according to a hierarchy, +## whereby some types get preferentially coerced into other types. Can +## you draw a diagram that represents the hierarchy of how these data +## types are coerced? +``` + +## Subsetting vectors + +If we want to extract one or several values from a vector, we must +provide one or several indices in square brackets. For instance: + +```{r, results="show", purl=TRUE} +molecules <- c("dna", "rna", "peptide", "protein") +molecules[2] +molecules[c(3, 2)] +``` + +We can also repeat the indices to create an object with more elements +than the original one: + +```{r, results="show", purl=TRUE} +more_molecules <- molecules[c(1, 2, 3, 2, 1, 4)] +more_molecules +``` + +R indices start at 1. Programming languages like Fortran, MATLAB, +Julia, and R start counting at 1, because that's what human beings +typically do. Languages in the C family (including C++, Java, Perl, +and Python) count from 0 because that's simpler for computers to do. + +Finally, it is also possible to get all the elements of a vector +except some specified elements using negative indices: + +```{r} +molecules ## all molecules +molecules[-1] ## all but the first one +molecules[-c(1, 3)] ## all but 1st/3rd ones +molecules[c(-1, -3)] ## all but 1st/3rd ones +``` + +## Conditional subsetting + +Another common way of subsetting is by using a logical vector. `TRUE` will +select the element with the same index, while `FALSE` will not: + +```{r, purl=TRUE} +weight_g <- c(21, 34, 39, 54, 55) +weight_g[c(TRUE, FALSE, TRUE, TRUE, FALSE)] +``` + +Typically, these logical vectors are not typed by hand, but are the +output of other functions or logical tests. For instance, if you +wanted to select only the values above 50: + +```{r, purl=TRUE} +## will return logicals with TRUE for the indices that meet +## the condition +weight_g > 50 +## so we can use this to select only the values above 50 +weight_g[weight_g > 50] +``` + +You can combine multiple tests using `&` (both conditions are true, +AND) or `|` (at least one of the conditions is true, OR): + +```{r, results="show", purl=TRUE} +weight_g[weight_g < 30 | weight_g > 50] +weight_g[weight_g >= 30 & weight_g == 21] +``` + +Here, `<` stands for "less than", `>` for "greater than", `>=` for +"greater than or equal to", and `==` for "equal to". The double equal +sign `==` is a test for numerical equality between the left and right +hand sides, and should not be confused with the single `=` sign, which +performs variable assignment (similar to `<-`). + +A common task is to search for certain strings in a vector. One could +use the "or" operator `|` to test for equality to multiple values, but +this can quickly become tedious. The function `%in%` allows you to +test if any of the elements of a search vector are found: + +```{r, purl=TRUE} +molecules <- c("dna", "rna", "protein", "peptide") +molecules[molecules == "rna" | molecules == "dna"] # returns both rna and dna +molecules %in% c("rna", "dna", "metabolite", "peptide", "glycerol") +molecules[molecules %in% c("rna", "dna", "metabolite", "peptide", "glycerol")] +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge: + +Can you figure out why `"four" > "five"` returns `TRUE`? + +::::::::::::::: solution + +## Solution + +```{r} +"four" > "five" +``` + +When using `>` or `<` on strings, R compares their alphabetical order. +Here `"four"` comes after `"five"`, and therefore is _greater than_ +it. + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Names + +It is possible to name each element of a vector. The code chunk below +shows an initial vector without any names, how names are set, and +retrieved. + +```{r} +x <- c(1, 5, 3, 5, 10) +names(x) ## no names +names(x) <- c("A", "B", "C", "D", "E") +names(x) ## now we have names +``` + +When a vector has names, it is possible to access elements by their +name, in addition to their index. + +```{r} +x[c(1, 3)] +x[c("A", "C")] +``` + +## Missing data + +As R was designed to analyze datasets, it includes the concept of +missing data (which is uncommon in other programming +languages). Missing data are represented in vectors as `NA`. + +When doing operations on numbers, most functions will return `NA` if +the data you are working with include missing values. This feature +makes it harder to overlook the cases where you are dealing with +missing data. You can add the argument `na.rm = TRUE` to calculate +the result while ignoring the missing values. + +```{r} +heights <- c(2, 4, 4, NA, 6) +mean(heights) +max(heights) +mean(heights, na.rm = TRUE) +max(heights, na.rm = TRUE) +``` + +If your data include missing values, you may want to become familiar +with the functions `is.na()`, `na.omit()`, and `complete.cases()`. See +below for examples. + +```{r} +## Extract those elements which are not missing values. +heights[!is.na(heights)] + +## Returns the object with incomplete cases removed. +## The returned object is an atomic vector of type `"numeric"` +## (or `"double"`). +na.omit(heights) + +## Extract those elements which are complete cases. +## The returned object is an atomic vector of type `"numeric"` +## (or `"double"`). +heights[complete.cases(heights)] +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge: + +1. Using this vector of heights in inches, create a new vector with the NAs removed. + +```{r} +heights <- c(63, 69, 60, 65, NA, 68, 61, 70, 61, 59, 64, 69, 63, 63, NA, 72, 65, 64, 70, 63, 65) +``` + +2. Use the function `median()` to calculate the median of the `heights` vector. +3. Use R to figure out how many people in the set are taller than 67 inches. + +::::::::::::::: solution + +## Solution + +```{r, purl=TRUE} +heights_no_na <- heights[!is.na(heights)] +## or +heights_no_na <- na.omit(heights) +``` + +```{r, purl=TRUE} +median(heights, na.rm = TRUE) +``` + +```{r, purl=TRUE} +heights_above_67 <- heights_no_na[heights_no_na > 67] +length(heights_above_67) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Generating vectors {#sec:genvec} + +```{r, echo=FALSE} +set.seed(1) +``` + +### Constructors + +There exists some functions to generate vectors of different type. To +generate a vector of numerics, one can use the `numeric()` +constructor, providing the length of the output vector as +parameter. The values will be initialised with 0. + +```{r, purl=TRUE} +numeric(3) +numeric(10) +``` + +Note that if we ask for a vector of numerics of length 0, we obtain +exactly that: + +```{r, purl=TRUE} +numeric(0) +``` + +There are similar constructors for characters and logicals, named +`character()` and `logical()` respectively. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge: + +What are the defaults for character and logical vectors? + +::::::::::::::: solution + +## Solution + +```{r, purl=TRUE} +character(2) ## the empty character +logical(2) ## FALSE +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +### Replicate elements + +The `rep` function allow to repeat a value a certain number of +times. If we want to initiate a vector of numerics of length 5 with +the value -1, for example, we could do the following: + +```{r, purl=TRUE} +rep(-1, 5) +``` + +Similarly, to generate a vector populated with missing values, which +is often a good way to start, without setting assumptions on the data +to be collected: + +```{r, purl=TRUE} +rep(NA, 5) +``` + +`rep` can take vectors of any length as input (above, we used vectors +of length 1) and any type. For example, if we want to repeat the +values 1, 2 and 3 five times, we would do the following: + +```{r, purl=TRUE} +rep(c(1, 2, 3), 5) +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge: + +What if we wanted to repeat the values 1, 2 and 3 five times, but +obtain five 1s, five 2s and five 3s in that order? There are two +possibilities - see `?rep` or `?sort` for help. + +::::::::::::::: solution + +## Solution + +```{r, purl=TRUE} +rep(c(1, 2, 3), each = 5) +sort(rep(c(1, 2, 3), 5)) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +### Sequence generation + +Another very useful function is `seq`, to generate a sequence of +numbers. For example, to generate a sequence of integers from 1 to 20 +by steps of 2, one would use: + +```{r, purl=TRUE} +seq(from = 1, to = 20, by = 2) +``` + +The default value of `by` is 1 and, given that the generation of a +sequence of one value to another with steps of 1 is frequently used, +there's a shortcut: + +```{r, purl=TRUE} +seq(1, 5, 1) +seq(1, 5) ## default by +1:5 +``` + +To generate a sequence of numbers from 1 to 20 of final length of 3, +one would use: + +```{r, purl=TRUE} +seq(from = 1, to = 20, length.out = 3) +``` + +### Random samples and permutations + +A last group of useful functions are those that generate random +data. The first one, `sample`, generates a random permutation of +another vector. For example, to draw a random order to 10 students +oral exam, I first assign each student a number from 1 to ten (for +instance based on the alphabetic order of their name) and then: + +```{r, purl=TRUE} +sample(1:10) +``` + +Without further arguments, `sample` will return a permutation of all +elements of the vector. If I want a random sample of a certain size, I +would set this value as the second argument. Below, I sample 5 random +letters from the alphabet contained in the pre-defined `letters` vector: + +```{r, purl=TRUE} +sample(letters, 5) +``` + +If I wanted an output larger than the input vector, or being able to +draw some elements multiple times, I would need to set the `replace` +argument to `TRUE`: + +```{r, purl=TRUE} +sample(1:5, 10, replace = TRUE) +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge: + +When trying the functions above out, you will have realised that the +samples are indeed random and that one doesn't get the same +permutation twice. To be able to reproduce these random draws, one can +set the random number generation seed manually with `set.seed()` +before drawing the random sample. + +Test this feature with your neighbour. First draw two random +permutations of `1:10` independently and observe that you get +different results. + +Now set the seed with, for example, `set.seed(123)` and repeat the +random draw. Observe that you now get the same random draws. + +Repeat by setting a different seed. + +::::::::::::::: solution + +## Solution + +Different permutations + +```{r, purl=TRUE} +sample(1:10) +sample(1:10) +``` + +Same permutations with seed 123 + +```{r, purl=TRUE} +set.seed(123) +sample(1:10) +set.seed(123) +sample(1:10) +``` + +A different seed + +```{r, purl=TRUE} +set.seed(1) +sample(1:10) +set.seed(1) +sample(1:10) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +### Drawing samples from a normal distribution + +The last function we are going to see is `rnorm`, that draws a random +sample from a normal distribution. Two normal distributions of means 0 +and 100 and standard deviations 1 and 5, noted _N(0, 1)_ and +_N(100, 5)_, are shown below. + +```{r, echo=FALSE, fig.width=12, fig.height=6, fig.cap="Two normal distributions: *N(0, 1)* on the left and *N(100, 5)* on the right."} +par(mfrow = c(1, 2)) +plot(density(rnorm(1000)), main = "", sub = "N(0, 1)") +plot(density(rnorm(1000, 100, 5)), main = "", sub = "N(100, 5)") +``` + +The three arguments, `n`, `mean` and `sd`, define the size of the +sample, and the parameters of the normal distribution, i.e the mean +and its standard deviation. The defaults of the latter are 0 and 1. + +```{r, purl=TRUE} +rnorm(5) +rnorm(5, 2, 2) +rnorm(5, 100, 5) +``` + +Now that we have learned how to write scripts, and the basics of R's +data structures, we are ready to start working with larger data, and +learn about data frames. + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- How to interact with R + +:::::::::::::::::::::::::::::::::::::::::::::::::: From 50a554f0bf91c038a647e4155ea19c2d39254777 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Fri, 29 Dec 2023 06:10:43 +0900 Subject: [PATCH 015/334] New translations 23-starting-with-r.md (Portuguese) --- locale/pt/episodes/23-starting-with-r.Rmd | 921 ++++++++++++++++++++++ 1 file changed, 921 insertions(+) create mode 100644 locale/pt/episodes/23-starting-with-r.Rmd diff --git a/locale/pt/episodes/23-starting-with-r.Rmd b/locale/pt/episodes/23-starting-with-r.Rmd new file mode 100644 index 000000000..47ac62388 --- /dev/null +++ b/locale/pt/episodes/23-starting-with-r.Rmd @@ -0,0 +1,921 @@ +--- +source: Rmd +title: Introduction to R +teaching: 60 +exercises: 60 +--- + +```{r, include=FALSE} +``` + +::::::::::::::::::::::::::::::::::::::: objectives + +- Define the following terms as they relate to R: object, assign, call, function, arguments, options. +- Assign values to objects in R. +- Learn how to _name_ objects +- Use comments to inform script. +- Solve simple arithmetic operations in R. +- Call functions and use arguments to change their default options. +- Inspect the content of vectors and manipulate their content. +- Subset and extract values from vectors. +- Analyze vectors with missing data. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- First commands in R + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Creating objects in R + +You can get output from R simply by typing math in the console: + +```{r, purl=TRUE} +3 + 5 +12 / 7 +``` + +However, to do useful and interesting things, we need to assign _values_ to +_objects_. To create an object, we need to give it a name followed by the +assignment operator `<-`, and the value we want to give it: + +```{r, purl=TRUE} +weight_kg <- 55 +``` + +`<-` is the assignment operator. It assigns values on the right to +objects on the left. So, after executing `x <- 3`, the value of `x` is +`3`. The arrow can be read as 3 **goes into** `x`. For historical +reasons, you can also use `=` for assignments, but not in every +context. Because of the +[slight differences](https://blog.revolutionanalytics.com/2008/12/use-equals-or-arrow-for-assignment.html) +in syntax, it is good practice to always use `<-` for assignments. + +In RStudio, typing <kbd>Alt</kbd> + <kbd>-</kbd> (push <kbd>Alt</kbd> +at the same time as the <kbd>-</kbd> key) will write `<-` in a single +keystroke in a PC, while typing <kbd>Option</kbd> + <kbd>-</kbd> (push <kbd>Option</kbd> at the same time as the <kbd>-</kbd> key) does the +same in a Mac. + +### Naming variables + +Objects can be given any name such as `x`, `current_temperature`, or +`subject_id`. You want your object names to be explicit and not too +long. They cannot start with a number (`2x` is not valid, but `x2` +is). R is case sensitive (e.g., `weight_kg` is different from +`Weight_kg`). There are some names that cannot be used because they +are the names of fundamental functions in R (e.g., `if`, `else`, +`for`, see +[here](https://stat.ethz.ch/R-manual/R-devel/library/base/html/Reserved.html) +for a complete list). In general, even if it's allowed, it's best to +not use other function names (e.g., `c`, `T`, `mean`, `data`, `df`, +`weights`). If in doubt, check the help to see if the name is already +in use. It's also best to avoid dots (`.`) within an object name as in +`my.dataset`. There are many functions in R with dots in their names +for historical reasons, but because dots have a special meaning in R +(for methods) and other programming languages, it's best to avoid +them. It is also recommended to use nouns for object names, and verbs +for function names. It's important to be consistent in the styling of +your code (where you put spaces, how you name objects, etc.). Using a +consistent coding style makes your code clearer to read for your +future self and your collaborators. In R, some popular style guides +are [Google's](https://google.github.io/styleguide/Rguide.xml), the +[tidyverse's](https://style.tidyverse.org/) style and the Bioconductor +style +guide. The +tidyverse's is very comprehensive and may seem overwhelming at +first. You can install the +[**`lintr`**](https://github.com/jimhester/lintr) package to +automatically check for issues in the styling of your code. + +> **Objects vs. variables**: What are known as `objects` in `R` are +> known as `variables` in many other programming languages. Depending +> on the context, `object` and `variable` can have drastically +> different meanings. However, in this lesson, the two words are used +> synonymously. For more information +> [see here.](https://cran.r-project.org/doc/manuals/r-release/R-lang.html#Objects) + +When assigning a value to an object, R does not print anything. You +can force R to print the value by using parentheses or by typing the +object name: + +```{r, purl=TRUE} +weight_kg <- 55 # doesn't print anything +(weight_kg <- 55) # but putting parenthesis around the call prints the value of `weight_kg` +weight_kg # and so does typing the name of the object +``` + +Now that R has `weight_kg` in memory, we can do arithmetic with it. For +instance, we may want to convert this weight into pounds (weight in pounds is 2.2 times the weight in kg): + +```{r, purl=TRUE} +2.2 * weight_kg +``` + +We can also change an object's value by assigning it a new one: + +```{r, purl=TRUE} +weight_kg <- 57.5 +2.2 * weight_kg +``` + +This means that assigning a value to one object does not change the values of +other objects For example, let's store the animal's weight in pounds in a new +object, `weight_lb`: + +```{r, purl=TRUE} +weight_lb <- 2.2 * weight_kg +``` + +and then change `weight_kg` to 100. + +```{r} +weight_kg <- 100 +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge: + +What do you think is the current content of the object `weight_lb`? +126.5 or 220? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Comments + +The comment character in R is `#`, anything to the right of a `#` in a +script will be ignored by R. It is useful to leave notes, and +explanations in your scripts. + +RStudio makes it easy to comment or uncomment a paragraph: after +selecting the lines you want to comment, press at the same time on +your keyboard <kbd>Ctrl</kbd> + <kbd>Shift</kbd> + <kbd>C</kbd>. If +you only want to comment out one line, you can put the cursor at any +location of that line (i.e. no need to select the whole line), then +press <kbd>Ctrl</kbd> + <kbd>Shift</kbd> + <kbd>C</kbd>. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge + +What are the values after each statement in the following? + +```{r, purl=TRUE} +mass <- 47.5 # mass? +age <- 122 # age? +mass <- mass * 2.0 # mass? +age <- age - 20 # age? +mass_index <- mass/age # mass_index? +``` + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Functions and their arguments + +Functions are "canned scripts" that automate more complicated sets of commands +including operations assignments, etc. Many functions are predefined, or can be +made available by importing R _packages_ (more on that later). A function +usually gets one or more inputs called _arguments_. Functions often (but not +always) return a _value_. A typical example would be the function `sqrt()`. The +input (the argument) must be a number, and the return value (in fact, the +output) is the square root of that number. Executing a function ('running it') +is called _calling_ the function. An example of a function call is: + +```{r, eval=FALSE, purl=FALSE} +b <- sqrt(a) +``` + +Here, the value of `a` is given to the `sqrt()` function, the `sqrt()` function +calculates the square root, and returns the value which is then assigned to +the object `b`. This function is very simple, because it takes just one argument. + +The return 'value' of a function need not be numerical (like that of `sqrt()`), +and it also does not need to be a single item: it can be a set of things, or +even a dataset. We'll see that when we read data files into R. + +Arguments can be anything, not only numbers or filenames, but also other +objects. Exactly what each argument means differs per function, and must be +looked up in the documentation (see below). Some functions take arguments which +may either be specified by the user, or, if left out, take on a _default_ value: +these are called _options_. Options are typically used to alter the way the +function operates, such as whether it ignores 'bad values', or what symbol to +use in a plot. However, if you want something specific, you can specify a value +of your choice which will be used instead of the default. + +Let's try a function that can take multiple arguments: `round()`. + +```{r, results="show", purl=TRUE} +round(3.14159) +``` + +Here, we've called `round()` with just one argument, `3.14159`, and it has +returned the value `3`. That's because the default is to round to the nearest +whole number. If we want more digits we can see how to do that by getting +information about the `round` function. We can use `args(round)` or look at the +help for this function using `?round`. + +```{r, results="show", purl=TRUE} +args(round) +``` + +```{r, eval=FALSE, purl=TRUE} +?round +``` + +We see that if we want a different number of digits, we can +type `digits=2` or however many we want. + +```{r, results="show", purl=TRUE} +round(3.14159, digits = 2) +``` + +If you provide the arguments in the exact same order as they are defined you +don't have to name them: + +```{r, results="show", purl=TRUE} +round(3.14159, 2) +``` + +And if you do name the arguments, you can switch their order: + +```{r, results="show", purl=TRUE} +round(digits = 2, x = 3.14159) +``` + +It's good practice to put the non-optional arguments (like the number you're +rounding) first in your function call, and to specify the names of all optional +arguments. If you don't, someone reading your code might have to look up the +definition of a function with unfamiliar arguments to understand what you're +doing. By specifying the name of the arguments you are also safeguarding +against possible future changes in the function interface, which may +potentially add new arguments in between the existing ones. + +## Vectors and data types + +A vector is the most common and basic data type in R, and is pretty much +the workhorse of R. A vector is composed by a series of values, such as +numbers or characters. We can assign a series of values to a vector using +the `c()` function. For example we can create a vector of animal weights and assign +it to a new object `weight_g`: + +```{r, purl=TRUE} +weight_g <- c(50, 60, 65, 82) +weight_g +``` + +A vector can also contain characters: + +```{r, purl=TRUE} +molecules <- c("dna", "rna", "protein") +molecules +``` + +The quotes around "dna", "rna", etc. are essential here. Without the +quotes R will assume there are objects called `dna`, `rna` and +`protein`. As these objects don't exist in R's memory, there will be +an error message. + +There are many functions that allow you to inspect the content of a +vector. `length()` tells you how many elements are in a particular vector: + +```{r, purl=TRUE} +length(weight_g) +length(molecules) +``` + +An important feature of a vector, is that all of the elements are the +same type of data. The function `class()` indicates the class (the +type of element) of an object: + +```{r, purl=TRUE} +class(weight_g) +class(molecules) +``` + +The function `str()` provides an overview of the structure of an +object and its elements. It is a useful function when working with +large and complex objects: + +```{r, purl=TRUE} +str(weight_g) +str(molecules) +``` + +You can use the `c()` function to add other elements to your vector: + +```{r} +weight_g <- c(weight_g, 90) # add to the end of the vector +weight_g <- c(30, weight_g) # add to the beginning of the vector +weight_g +``` + +In the first line, we take the original vector `weight_g`, add the +value `90` to the end of it, and save the result back into +`weight_g`. Then we add the value `30` to the beginning, again saving +the result back into `weight_g`. + +We can do this over and over again to grow a vector, or assemble a +dataset. As we program, this may be useful to add results that we are +collecting or calculating. + +An **atomic vector** is the simplest R **data type** and is a linear +vector of a single type. Above, we saw 2 of the 6 main **atomic +vector** types that R uses: `"character"` and `"numeric"` (or +`"double"`). These are the basic building blocks that all R objects +are built from. The other 4 **atomic vector** types are: + +- `"logical"` for `TRUE` and `FALSE` (the boolean data type) +- `"integer"` for integer numbers (e.g., `2L`, the `L` indicates to R + that it's an integer) +- `"complex"` to represent complex numbers with real and imaginary + parts (e.g., `1 + 4i`) and that's all we're going to say about them +- `"raw"` for bitstreams that we won't discuss further + +You can check the type of your vector using the `typeof()` function +and inputting your vector as the argument. + +Vectors are one of the many **data structures** that R uses. Other +important ones are lists (`list`), matrices (`matrix`), data frames +(`data.frame`), factors (`factor`) and arrays (`array`). + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge: + +We've seen that atomic vectors can be of type character, numeric (or +double), integer, and logical. But what happens if we try to mix +these types in a single vector? + +::::::::::::::: solution + +## Solution + +R implicitly converts them to all be the same type + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge: + +What will happen in each of these examples? (hint: use `class()` to +check the data type of your objects and type in their names to see what happens): + +```{r, eval=TRUE} +num_char <- c(1, 2, 3, "a") +num_logical <- c(1, 2, 3, TRUE, FALSE) +char_logical <- c("a", "b", "c", TRUE) +tricky <- c(1, 2, 3, "4") +``` + +::::::::::::::: solution + +## Solution + +```{r, purl=TRUE} +class(num_char) +num_char +class(num_logical) +num_logical +class(char_logical) +char_logical +class(tricky) +tricky +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge: + +Why do you think it happens? + +::::::::::::::: solution + +## Solution + +Vectors can be of only one data type. R tries to convert (coerce) +the content of this vector to find a _common denominator_ that +doesn't lose any information. + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge: + +How many values in `combined_logical` are `"TRUE"` (as a character) +in the following example: + +```{r, eval=TRUE} +num_logical <- c(1, 2, 3, TRUE) +char_logical <- c("a", "b", "c", TRUE) +combined_logical <- c(num_logical, char_logical) +``` + +::::::::::::::: solution + +## Solution + +Only one. There is no memory of past data types, and the coercion +happens the first time the vector is evaluated. Therefore, the `TRUE` +in `num_logical` gets converted into a `1` before it gets converted +into `"1"` in `combined_logical`. + +```{r} +combined_logical +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge: + +In R, we call converting objects from one class into another class +_coercion_. These conversions happen according to a hierarchy, +whereby some types get preferentially coerced into other types. Can +you draw a diagram that represents the hierarchy of how these data +types are coerced? + +::::::::::::::: solution + +## Solution + +logical → numeric → character ← logical + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +```{r, echo=FALSE, eval=FALSE, purl=TRUE} +## We've seen that atomic vectors can be of type character, numeric, integer, and +## logical. But what happens if we try to mix these types in a single +## vector? + +## What will happen in each of these examples? (hint: use `class()` to +## check the data type of your object) +num_char <- c(1, 2, 3, "a") + +num_logical <- c(1, 2, 3, TRUE) + +char_logical <- c("a", "b", "c", TRUE) + +tricky <- c(1, 2, 3, "4") + +## Why do you think it happens? + +## You've probably noticed that objects of different types get +## converted into a single, shared type within a vector. In R, we call +## converting objects from one class into another class +## _coercion_. These conversions happen according to a hierarchy, +## whereby some types get preferentially coerced into other types. Can +## you draw a diagram that represents the hierarchy of how these data +## types are coerced? +``` + +## Subsetting vectors + +If we want to extract one or several values from a vector, we must +provide one or several indices in square brackets. For instance: + +```{r, results="show", purl=TRUE} +molecules <- c("dna", "rna", "peptide", "protein") +molecules[2] +molecules[c(3, 2)] +``` + +We can also repeat the indices to create an object with more elements +than the original one: + +```{r, results="show", purl=TRUE} +more_molecules <- molecules[c(1, 2, 3, 2, 1, 4)] +more_molecules +``` + +R indices start at 1. Programming languages like Fortran, MATLAB, +Julia, and R start counting at 1, because that's what human beings +typically do. Languages in the C family (including C++, Java, Perl, +and Python) count from 0 because that's simpler for computers to do. + +Finally, it is also possible to get all the elements of a vector +except some specified elements using negative indices: + +```{r} +molecules ## all molecules +molecules[-1] ## all but the first one +molecules[-c(1, 3)] ## all but 1st/3rd ones +molecules[c(-1, -3)] ## all but 1st/3rd ones +``` + +## Conditional subsetting + +Another common way of subsetting is by using a logical vector. `TRUE` will +select the element with the same index, while `FALSE` will not: + +```{r, purl=TRUE} +weight_g <- c(21, 34, 39, 54, 55) +weight_g[c(TRUE, FALSE, TRUE, TRUE, FALSE)] +``` + +Typically, these logical vectors are not typed by hand, but are the +output of other functions or logical tests. For instance, if you +wanted to select only the values above 50: + +```{r, purl=TRUE} +## will return logicals with TRUE for the indices that meet +## the condition +weight_g > 50 +## so we can use this to select only the values above 50 +weight_g[weight_g > 50] +``` + +You can combine multiple tests using `&` (both conditions are true, +AND) or `|` (at least one of the conditions is true, OR): + +```{r, results="show", purl=TRUE} +weight_g[weight_g < 30 | weight_g > 50] +weight_g[weight_g >= 30 & weight_g == 21] +``` + +Here, `<` stands for "less than", `>` for "greater than", `>=` for +"greater than or equal to", and `==` for "equal to". The double equal +sign `==` is a test for numerical equality between the left and right +hand sides, and should not be confused with the single `=` sign, which +performs variable assignment (similar to `<-`). + +A common task is to search for certain strings in a vector. One could +use the "or" operator `|` to test for equality to multiple values, but +this can quickly become tedious. The function `%in%` allows you to +test if any of the elements of a search vector are found: + +```{r, purl=TRUE} +molecules <- c("dna", "rna", "protein", "peptide") +molecules[molecules == "rna" | molecules == "dna"] # returns both rna and dna +molecules %in% c("rna", "dna", "metabolite", "peptide", "glycerol") +molecules[molecules %in% c("rna", "dna", "metabolite", "peptide", "glycerol")] +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge: + +Can you figure out why `"four" > "five"` returns `TRUE`? + +::::::::::::::: solution + +## Solution + +```{r} +"four" > "five" +``` + +When using `>` or `<` on strings, R compares their alphabetical order. +Here `"four"` comes after `"five"`, and therefore is _greater than_ +it. + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Names + +It is possible to name each element of a vector. The code chunk below +shows an initial vector without any names, how names are set, and +retrieved. + +```{r} +x <- c(1, 5, 3, 5, 10) +names(x) ## no names +names(x) <- c("A", "B", "C", "D", "E") +names(x) ## now we have names +``` + +When a vector has names, it is possible to access elements by their +name, in addition to their index. + +```{r} +x[c(1, 3)] +x[c("A", "C")] +``` + +## Missing data + +As R was designed to analyze datasets, it includes the concept of +missing data (which is uncommon in other programming +languages). Missing data are represented in vectors as `NA`. + +When doing operations on numbers, most functions will return `NA` if +the data you are working with include missing values. This feature +makes it harder to overlook the cases where you are dealing with +missing data. You can add the argument `na.rm = TRUE` to calculate +the result while ignoring the missing values. + +```{r} +heights <- c(2, 4, 4, NA, 6) +mean(heights) +max(heights) +mean(heights, na.rm = TRUE) +max(heights, na.rm = TRUE) +``` + +If your data include missing values, you may want to become familiar +with the functions `is.na()`, `na.omit()`, and `complete.cases()`. See +below for examples. + +```{r} +## Extract those elements which are not missing values. +heights[!is.na(heights)] + +## Returns the object with incomplete cases removed. +## The returned object is an atomic vector of type `"numeric"` +## (or `"double"`). +na.omit(heights) + +## Extract those elements which are complete cases. +## The returned object is an atomic vector of type `"numeric"` +## (or `"double"`). +heights[complete.cases(heights)] +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge: + +1. Using this vector of heights in inches, create a new vector with the NAs removed. + +```{r} +heights <- c(63, 69, 60, 65, NA, 68, 61, 70, 61, 59, 64, 69, 63, 63, NA, 72, 65, 64, 70, 63, 65) +``` + +2. Use the function `median()` to calculate the median of the `heights` vector. +3. Use R to figure out how many people in the set are taller than 67 inches. + +::::::::::::::: solution + +## Solution + +```{r, purl=TRUE} +heights_no_na <- heights[!is.na(heights)] +## or +heights_no_na <- na.omit(heights) +``` + +```{r, purl=TRUE} +median(heights, na.rm = TRUE) +``` + +```{r, purl=TRUE} +heights_above_67 <- heights_no_na[heights_no_na > 67] +length(heights_above_67) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Generating vectors {#sec:genvec} + +```{r, echo=FALSE} +set.seed(1) +``` + +### Constructors + +There exists some functions to generate vectors of different type. To +generate a vector of numerics, one can use the `numeric()` +constructor, providing the length of the output vector as +parameter. The values will be initialised with 0. + +```{r, purl=TRUE} +numeric(3) +numeric(10) +``` + +Note that if we ask for a vector of numerics of length 0, we obtain +exactly that: + +```{r, purl=TRUE} +numeric(0) +``` + +There are similar constructors for characters and logicals, named +`character()` and `logical()` respectively. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge: + +What are the defaults for character and logical vectors? + +::::::::::::::: solution + +## Solution + +```{r, purl=TRUE} +character(2) ## the empty character +logical(2) ## FALSE +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +### Replicate elements + +The `rep` function allow to repeat a value a certain number of +times. If we want to initiate a vector of numerics of length 5 with +the value -1, for example, we could do the following: + +```{r, purl=TRUE} +rep(-1, 5) +``` + +Similarly, to generate a vector populated with missing values, which +is often a good way to start, without setting assumptions on the data +to be collected: + +```{r, purl=TRUE} +rep(NA, 5) +``` + +`rep` can take vectors of any length as input (above, we used vectors +of length 1) and any type. For example, if we want to repeat the +values 1, 2 and 3 five times, we would do the following: + +```{r, purl=TRUE} +rep(c(1, 2, 3), 5) +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge: + +What if we wanted to repeat the values 1, 2 and 3 five times, but +obtain five 1s, five 2s and five 3s in that order? There are two +possibilities - see `?rep` or `?sort` for help. + +::::::::::::::: solution + +## Solution + +```{r, purl=TRUE} +rep(c(1, 2, 3), each = 5) +sort(rep(c(1, 2, 3), 5)) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +### Sequence generation + +Another very useful function is `seq`, to generate a sequence of +numbers. For example, to generate a sequence of integers from 1 to 20 +by steps of 2, one would use: + +```{r, purl=TRUE} +seq(from = 1, to = 20, by = 2) +``` + +The default value of `by` is 1 and, given that the generation of a +sequence of one value to another with steps of 1 is frequently used, +there's a shortcut: + +```{r, purl=TRUE} +seq(1, 5, 1) +seq(1, 5) ## default by +1:5 +``` + +To generate a sequence of numbers from 1 to 20 of final length of 3, +one would use: + +```{r, purl=TRUE} +seq(from = 1, to = 20, length.out = 3) +``` + +### Random samples and permutations + +A last group of useful functions are those that generate random +data. The first one, `sample`, generates a random permutation of +another vector. For example, to draw a random order to 10 students +oral exam, I first assign each student a number from 1 to ten (for +instance based on the alphabetic order of their name) and then: + +```{r, purl=TRUE} +sample(1:10) +``` + +Without further arguments, `sample` will return a permutation of all +elements of the vector. If I want a random sample of a certain size, I +would set this value as the second argument. Below, I sample 5 random +letters from the alphabet contained in the pre-defined `letters` vector: + +```{r, purl=TRUE} +sample(letters, 5) +``` + +If I wanted an output larger than the input vector, or being able to +draw some elements multiple times, I would need to set the `replace` +argument to `TRUE`: + +```{r, purl=TRUE} +sample(1:5, 10, replace = TRUE) +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge: + +When trying the functions above out, you will have realised that the +samples are indeed random and that one doesn't get the same +permutation twice. To be able to reproduce these random draws, one can +set the random number generation seed manually with `set.seed()` +before drawing the random sample. + +Test this feature with your neighbour. First draw two random +permutations of `1:10` independently and observe that you get +different results. + +Now set the seed with, for example, `set.seed(123)` and repeat the +random draw. Observe that you now get the same random draws. + +Repeat by setting a different seed. + +::::::::::::::: solution + +## Solution + +Different permutations + +```{r, purl=TRUE} +sample(1:10) +sample(1:10) +``` + +Same permutations with seed 123 + +```{r, purl=TRUE} +set.seed(123) +sample(1:10) +set.seed(123) +sample(1:10) +``` + +A different seed + +```{r, purl=TRUE} +set.seed(1) +sample(1:10) +set.seed(1) +sample(1:10) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +### Drawing samples from a normal distribution + +The last function we are going to see is `rnorm`, that draws a random +sample from a normal distribution. Two normal distributions of means 0 +and 100 and standard deviations 1 and 5, noted _N(0, 1)_ and +_N(100, 5)_, are shown below. + +```{r, echo=FALSE, fig.width=12, fig.height=6, fig.cap="Two normal distributions: *N(0, 1)* on the left and *N(100, 5)* on the right."} +par(mfrow = c(1, 2)) +plot(density(rnorm(1000)), main = "", sub = "N(0, 1)") +plot(density(rnorm(1000, 100, 5)), main = "", sub = "N(100, 5)") +``` + +The three arguments, `n`, `mean` and `sd`, define the size of the +sample, and the parameters of the normal distribution, i.e the mean +and its standard deviation. The defaults of the latter are 0 and 1. + +```{r, purl=TRUE} +rnorm(5) +rnorm(5, 2, 2) +rnorm(5, 100, 5) +``` + +Now that we have learned how to write scripts, and the basics of R's +data structures, we are ready to start working with larger data, and +learn about data frames. + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- How to interact with R + +:::::::::::::::::::::::::::::::::::::::::::::::::: From 5b56574f60d13c07dc8da8a4cfa43fdf44314c2c Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Fri, 29 Dec 2023 06:10:45 +0900 Subject: [PATCH 016/334] New translations 23-starting-with-r.md (Chinese Simplified) --- locale/zh/episodes/23-starting-with-r.Rmd | 921 ++++++++++++++++++++++ 1 file changed, 921 insertions(+) create mode 100644 locale/zh/episodes/23-starting-with-r.Rmd diff --git a/locale/zh/episodes/23-starting-with-r.Rmd b/locale/zh/episodes/23-starting-with-r.Rmd new file mode 100644 index 000000000..47ac62388 --- /dev/null +++ b/locale/zh/episodes/23-starting-with-r.Rmd @@ -0,0 +1,921 @@ +--- +source: Rmd +title: Introduction to R +teaching: 60 +exercises: 60 +--- + +```{r, include=FALSE} +``` + +::::::::::::::::::::::::::::::::::::::: objectives + +- Define the following terms as they relate to R: object, assign, call, function, arguments, options. +- Assign values to objects in R. +- Learn how to _name_ objects +- Use comments to inform script. +- Solve simple arithmetic operations in R. +- Call functions and use arguments to change their default options. +- Inspect the content of vectors and manipulate their content. +- Subset and extract values from vectors. +- Analyze vectors with missing data. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- First commands in R + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Creating objects in R + +You can get output from R simply by typing math in the console: + +```{r, purl=TRUE} +3 + 5 +12 / 7 +``` + +However, to do useful and interesting things, we need to assign _values_ to +_objects_. To create an object, we need to give it a name followed by the +assignment operator `<-`, and the value we want to give it: + +```{r, purl=TRUE} +weight_kg <- 55 +``` + +`<-` is the assignment operator. It assigns values on the right to +objects on the left. So, after executing `x <- 3`, the value of `x` is +`3`. The arrow can be read as 3 **goes into** `x`. For historical +reasons, you can also use `=` for assignments, but not in every +context. Because of the +[slight differences](https://blog.revolutionanalytics.com/2008/12/use-equals-or-arrow-for-assignment.html) +in syntax, it is good practice to always use `<-` for assignments. + +In RStudio, typing <kbd>Alt</kbd> + <kbd>-</kbd> (push <kbd>Alt</kbd> +at the same time as the <kbd>-</kbd> key) will write `<-` in a single +keystroke in a PC, while typing <kbd>Option</kbd> + <kbd>-</kbd> (push <kbd>Option</kbd> at the same time as the <kbd>-</kbd> key) does the +same in a Mac. + +### Naming variables + +Objects can be given any name such as `x`, `current_temperature`, or +`subject_id`. You want your object names to be explicit and not too +long. They cannot start with a number (`2x` is not valid, but `x2` +is). R is case sensitive (e.g., `weight_kg` is different from +`Weight_kg`). There are some names that cannot be used because they +are the names of fundamental functions in R (e.g., `if`, `else`, +`for`, see +[here](https://stat.ethz.ch/R-manual/R-devel/library/base/html/Reserved.html) +for a complete list). In general, even if it's allowed, it's best to +not use other function names (e.g., `c`, `T`, `mean`, `data`, `df`, +`weights`). If in doubt, check the help to see if the name is already +in use. It's also best to avoid dots (`.`) within an object name as in +`my.dataset`. There are many functions in R with dots in their names +for historical reasons, but because dots have a special meaning in R +(for methods) and other programming languages, it's best to avoid +them. It is also recommended to use nouns for object names, and verbs +for function names. It's important to be consistent in the styling of +your code (where you put spaces, how you name objects, etc.). Using a +consistent coding style makes your code clearer to read for your +future self and your collaborators. In R, some popular style guides +are [Google's](https://google.github.io/styleguide/Rguide.xml), the +[tidyverse's](https://style.tidyverse.org/) style and the Bioconductor +style +guide. The +tidyverse's is very comprehensive and may seem overwhelming at +first. You can install the +[**`lintr`**](https://github.com/jimhester/lintr) package to +automatically check for issues in the styling of your code. + +> **Objects vs. variables**: What are known as `objects` in `R` are +> known as `variables` in many other programming languages. Depending +> on the context, `object` and `variable` can have drastically +> different meanings. However, in this lesson, the two words are used +> synonymously. For more information +> [see here.](https://cran.r-project.org/doc/manuals/r-release/R-lang.html#Objects) + +When assigning a value to an object, R does not print anything. You +can force R to print the value by using parentheses or by typing the +object name: + +```{r, purl=TRUE} +weight_kg <- 55 # doesn't print anything +(weight_kg <- 55) # but putting parenthesis around the call prints the value of `weight_kg` +weight_kg # and so does typing the name of the object +``` + +Now that R has `weight_kg` in memory, we can do arithmetic with it. For +instance, we may want to convert this weight into pounds (weight in pounds is 2.2 times the weight in kg): + +```{r, purl=TRUE} +2.2 * weight_kg +``` + +We can also change an object's value by assigning it a new one: + +```{r, purl=TRUE} +weight_kg <- 57.5 +2.2 * weight_kg +``` + +This means that assigning a value to one object does not change the values of +other objects For example, let's store the animal's weight in pounds in a new +object, `weight_lb`: + +```{r, purl=TRUE} +weight_lb <- 2.2 * weight_kg +``` + +and then change `weight_kg` to 100. + +```{r} +weight_kg <- 100 +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge: + +What do you think is the current content of the object `weight_lb`? +126.5 or 220? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Comments + +The comment character in R is `#`, anything to the right of a `#` in a +script will be ignored by R. It is useful to leave notes, and +explanations in your scripts. + +RStudio makes it easy to comment or uncomment a paragraph: after +selecting the lines you want to comment, press at the same time on +your keyboard <kbd>Ctrl</kbd> + <kbd>Shift</kbd> + <kbd>C</kbd>. If +you only want to comment out one line, you can put the cursor at any +location of that line (i.e. no need to select the whole line), then +press <kbd>Ctrl</kbd> + <kbd>Shift</kbd> + <kbd>C</kbd>. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge + +What are the values after each statement in the following? + +```{r, purl=TRUE} +mass <- 47.5 # mass? +age <- 122 # age? +mass <- mass * 2.0 # mass? +age <- age - 20 # age? +mass_index <- mass/age # mass_index? +``` + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Functions and their arguments + +Functions are "canned scripts" that automate more complicated sets of commands +including operations assignments, etc. Many functions are predefined, or can be +made available by importing R _packages_ (more on that later). A function +usually gets one or more inputs called _arguments_. Functions often (but not +always) return a _value_. A typical example would be the function `sqrt()`. The +input (the argument) must be a number, and the return value (in fact, the +output) is the square root of that number. Executing a function ('running it') +is called _calling_ the function. An example of a function call is: + +```{r, eval=FALSE, purl=FALSE} +b <- sqrt(a) +``` + +Here, the value of `a` is given to the `sqrt()` function, the `sqrt()` function +calculates the square root, and returns the value which is then assigned to +the object `b`. This function is very simple, because it takes just one argument. + +The return 'value' of a function need not be numerical (like that of `sqrt()`), +and it also does not need to be a single item: it can be a set of things, or +even a dataset. We'll see that when we read data files into R. + +Arguments can be anything, not only numbers or filenames, but also other +objects. Exactly what each argument means differs per function, and must be +looked up in the documentation (see below). Some functions take arguments which +may either be specified by the user, or, if left out, take on a _default_ value: +these are called _options_. Options are typically used to alter the way the +function operates, such as whether it ignores 'bad values', or what symbol to +use in a plot. However, if you want something specific, you can specify a value +of your choice which will be used instead of the default. + +Let's try a function that can take multiple arguments: `round()`. + +```{r, results="show", purl=TRUE} +round(3.14159) +``` + +Here, we've called `round()` with just one argument, `3.14159`, and it has +returned the value `3`. That's because the default is to round to the nearest +whole number. If we want more digits we can see how to do that by getting +information about the `round` function. We can use `args(round)` or look at the +help for this function using `?round`. + +```{r, results="show", purl=TRUE} +args(round) +``` + +```{r, eval=FALSE, purl=TRUE} +?round +``` + +We see that if we want a different number of digits, we can +type `digits=2` or however many we want. + +```{r, results="show", purl=TRUE} +round(3.14159, digits = 2) +``` + +If you provide the arguments in the exact same order as they are defined you +don't have to name them: + +```{r, results="show", purl=TRUE} +round(3.14159, 2) +``` + +And if you do name the arguments, you can switch their order: + +```{r, results="show", purl=TRUE} +round(digits = 2, x = 3.14159) +``` + +It's good practice to put the non-optional arguments (like the number you're +rounding) first in your function call, and to specify the names of all optional +arguments. If you don't, someone reading your code might have to look up the +definition of a function with unfamiliar arguments to understand what you're +doing. By specifying the name of the arguments you are also safeguarding +against possible future changes in the function interface, which may +potentially add new arguments in between the existing ones. + +## Vectors and data types + +A vector is the most common and basic data type in R, and is pretty much +the workhorse of R. A vector is composed by a series of values, such as +numbers or characters. We can assign a series of values to a vector using +the `c()` function. For example we can create a vector of animal weights and assign +it to a new object `weight_g`: + +```{r, purl=TRUE} +weight_g <- c(50, 60, 65, 82) +weight_g +``` + +A vector can also contain characters: + +```{r, purl=TRUE} +molecules <- c("dna", "rna", "protein") +molecules +``` + +The quotes around "dna", "rna", etc. are essential here. Without the +quotes R will assume there are objects called `dna`, `rna` and +`protein`. As these objects don't exist in R's memory, there will be +an error message. + +There are many functions that allow you to inspect the content of a +vector. `length()` tells you how many elements are in a particular vector: + +```{r, purl=TRUE} +length(weight_g) +length(molecules) +``` + +An important feature of a vector, is that all of the elements are the +same type of data. The function `class()` indicates the class (the +type of element) of an object: + +```{r, purl=TRUE} +class(weight_g) +class(molecules) +``` + +The function `str()` provides an overview of the structure of an +object and its elements. It is a useful function when working with +large and complex objects: + +```{r, purl=TRUE} +str(weight_g) +str(molecules) +``` + +You can use the `c()` function to add other elements to your vector: + +```{r} +weight_g <- c(weight_g, 90) # add to the end of the vector +weight_g <- c(30, weight_g) # add to the beginning of the vector +weight_g +``` + +In the first line, we take the original vector `weight_g`, add the +value `90` to the end of it, and save the result back into +`weight_g`. Then we add the value `30` to the beginning, again saving +the result back into `weight_g`. + +We can do this over and over again to grow a vector, or assemble a +dataset. As we program, this may be useful to add results that we are +collecting or calculating. + +An **atomic vector** is the simplest R **data type** and is a linear +vector of a single type. Above, we saw 2 of the 6 main **atomic +vector** types that R uses: `"character"` and `"numeric"` (or +`"double"`). These are the basic building blocks that all R objects +are built from. The other 4 **atomic vector** types are: + +- `"logical"` for `TRUE` and `FALSE` (the boolean data type) +- `"integer"` for integer numbers (e.g., `2L`, the `L` indicates to R + that it's an integer) +- `"complex"` to represent complex numbers with real and imaginary + parts (e.g., `1 + 4i`) and that's all we're going to say about them +- `"raw"` for bitstreams that we won't discuss further + +You can check the type of your vector using the `typeof()` function +and inputting your vector as the argument. + +Vectors are one of the many **data structures** that R uses. Other +important ones are lists (`list`), matrices (`matrix`), data frames +(`data.frame`), factors (`factor`) and arrays (`array`). + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge: + +We've seen that atomic vectors can be of type character, numeric (or +double), integer, and logical. But what happens if we try to mix +these types in a single vector? + +::::::::::::::: solution + +## Solution + +R implicitly converts them to all be the same type + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge: + +What will happen in each of these examples? (hint: use `class()` to +check the data type of your objects and type in their names to see what happens): + +```{r, eval=TRUE} +num_char <- c(1, 2, 3, "a") +num_logical <- c(1, 2, 3, TRUE, FALSE) +char_logical <- c("a", "b", "c", TRUE) +tricky <- c(1, 2, 3, "4") +``` + +::::::::::::::: solution + +## Solution + +```{r, purl=TRUE} +class(num_char) +num_char +class(num_logical) +num_logical +class(char_logical) +char_logical +class(tricky) +tricky +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge: + +Why do you think it happens? + +::::::::::::::: solution + +## Solution + +Vectors can be of only one data type. R tries to convert (coerce) +the content of this vector to find a _common denominator_ that +doesn't lose any information. + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge: + +How many values in `combined_logical` are `"TRUE"` (as a character) +in the following example: + +```{r, eval=TRUE} +num_logical <- c(1, 2, 3, TRUE) +char_logical <- c("a", "b", "c", TRUE) +combined_logical <- c(num_logical, char_logical) +``` + +::::::::::::::: solution + +## Solution + +Only one. There is no memory of past data types, and the coercion +happens the first time the vector is evaluated. Therefore, the `TRUE` +in `num_logical` gets converted into a `1` before it gets converted +into `"1"` in `combined_logical`. + +```{r} +combined_logical +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge: + +In R, we call converting objects from one class into another class +_coercion_. These conversions happen according to a hierarchy, +whereby some types get preferentially coerced into other types. Can +you draw a diagram that represents the hierarchy of how these data +types are coerced? + +::::::::::::::: solution + +## Solution + +logical → numeric → character ← logical + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +```{r, echo=FALSE, eval=FALSE, purl=TRUE} +## We've seen that atomic vectors can be of type character, numeric, integer, and +## logical. But what happens if we try to mix these types in a single +## vector? + +## What will happen in each of these examples? (hint: use `class()` to +## check the data type of your object) +num_char <- c(1, 2, 3, "a") + +num_logical <- c(1, 2, 3, TRUE) + +char_logical <- c("a", "b", "c", TRUE) + +tricky <- c(1, 2, 3, "4") + +## Why do you think it happens? + +## You've probably noticed that objects of different types get +## converted into a single, shared type within a vector. In R, we call +## converting objects from one class into another class +## _coercion_. These conversions happen according to a hierarchy, +## whereby some types get preferentially coerced into other types. Can +## you draw a diagram that represents the hierarchy of how these data +## types are coerced? +``` + +## Subsetting vectors + +If we want to extract one or several values from a vector, we must +provide one or several indices in square brackets. For instance: + +```{r, results="show", purl=TRUE} +molecules <- c("dna", "rna", "peptide", "protein") +molecules[2] +molecules[c(3, 2)] +``` + +We can also repeat the indices to create an object with more elements +than the original one: + +```{r, results="show", purl=TRUE} +more_molecules <- molecules[c(1, 2, 3, 2, 1, 4)] +more_molecules +``` + +R indices start at 1. Programming languages like Fortran, MATLAB, +Julia, and R start counting at 1, because that's what human beings +typically do. Languages in the C family (including C++, Java, Perl, +and Python) count from 0 because that's simpler for computers to do. + +Finally, it is also possible to get all the elements of a vector +except some specified elements using negative indices: + +```{r} +molecules ## all molecules +molecules[-1] ## all but the first one +molecules[-c(1, 3)] ## all but 1st/3rd ones +molecules[c(-1, -3)] ## all but 1st/3rd ones +``` + +## Conditional subsetting + +Another common way of subsetting is by using a logical vector. `TRUE` will +select the element with the same index, while `FALSE` will not: + +```{r, purl=TRUE} +weight_g <- c(21, 34, 39, 54, 55) +weight_g[c(TRUE, FALSE, TRUE, TRUE, FALSE)] +``` + +Typically, these logical vectors are not typed by hand, but are the +output of other functions or logical tests. For instance, if you +wanted to select only the values above 50: + +```{r, purl=TRUE} +## will return logicals with TRUE for the indices that meet +## the condition +weight_g > 50 +## so we can use this to select only the values above 50 +weight_g[weight_g > 50] +``` + +You can combine multiple tests using `&` (both conditions are true, +AND) or `|` (at least one of the conditions is true, OR): + +```{r, results="show", purl=TRUE} +weight_g[weight_g < 30 | weight_g > 50] +weight_g[weight_g >= 30 & weight_g == 21] +``` + +Here, `<` stands for "less than", `>` for "greater than", `>=` for +"greater than or equal to", and `==` for "equal to". The double equal +sign `==` is a test for numerical equality between the left and right +hand sides, and should not be confused with the single `=` sign, which +performs variable assignment (similar to `<-`). + +A common task is to search for certain strings in a vector. One could +use the "or" operator `|` to test for equality to multiple values, but +this can quickly become tedious. The function `%in%` allows you to +test if any of the elements of a search vector are found: + +```{r, purl=TRUE} +molecules <- c("dna", "rna", "protein", "peptide") +molecules[molecules == "rna" | molecules == "dna"] # returns both rna and dna +molecules %in% c("rna", "dna", "metabolite", "peptide", "glycerol") +molecules[molecules %in% c("rna", "dna", "metabolite", "peptide", "glycerol")] +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge: + +Can you figure out why `"four" > "five"` returns `TRUE`? + +::::::::::::::: solution + +## Solution + +```{r} +"four" > "five" +``` + +When using `>` or `<` on strings, R compares their alphabetical order. +Here `"four"` comes after `"five"`, and therefore is _greater than_ +it. + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Names + +It is possible to name each element of a vector. The code chunk below +shows an initial vector without any names, how names are set, and +retrieved. + +```{r} +x <- c(1, 5, 3, 5, 10) +names(x) ## no names +names(x) <- c("A", "B", "C", "D", "E") +names(x) ## now we have names +``` + +When a vector has names, it is possible to access elements by their +name, in addition to their index. + +```{r} +x[c(1, 3)] +x[c("A", "C")] +``` + +## Missing data + +As R was designed to analyze datasets, it includes the concept of +missing data (which is uncommon in other programming +languages). Missing data are represented in vectors as `NA`. + +When doing operations on numbers, most functions will return `NA` if +the data you are working with include missing values. This feature +makes it harder to overlook the cases where you are dealing with +missing data. You can add the argument `na.rm = TRUE` to calculate +the result while ignoring the missing values. + +```{r} +heights <- c(2, 4, 4, NA, 6) +mean(heights) +max(heights) +mean(heights, na.rm = TRUE) +max(heights, na.rm = TRUE) +``` + +If your data include missing values, you may want to become familiar +with the functions `is.na()`, `na.omit()`, and `complete.cases()`. See +below for examples. + +```{r} +## Extract those elements which are not missing values. +heights[!is.na(heights)] + +## Returns the object with incomplete cases removed. +## The returned object is an atomic vector of type `"numeric"` +## (or `"double"`). +na.omit(heights) + +## Extract those elements which are complete cases. +## The returned object is an atomic vector of type `"numeric"` +## (or `"double"`). +heights[complete.cases(heights)] +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge: + +1. Using this vector of heights in inches, create a new vector with the NAs removed. + +```{r} +heights <- c(63, 69, 60, 65, NA, 68, 61, 70, 61, 59, 64, 69, 63, 63, NA, 72, 65, 64, 70, 63, 65) +``` + +2. Use the function `median()` to calculate the median of the `heights` vector. +3. Use R to figure out how many people in the set are taller than 67 inches. + +::::::::::::::: solution + +## Solution + +```{r, purl=TRUE} +heights_no_na <- heights[!is.na(heights)] +## or +heights_no_na <- na.omit(heights) +``` + +```{r, purl=TRUE} +median(heights, na.rm = TRUE) +``` + +```{r, purl=TRUE} +heights_above_67 <- heights_no_na[heights_no_na > 67] +length(heights_above_67) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Generating vectors {#sec:genvec} + +```{r, echo=FALSE} +set.seed(1) +``` + +### Constructors + +There exists some functions to generate vectors of different type. To +generate a vector of numerics, one can use the `numeric()` +constructor, providing the length of the output vector as +parameter. The values will be initialised with 0. + +```{r, purl=TRUE} +numeric(3) +numeric(10) +``` + +Note that if we ask for a vector of numerics of length 0, we obtain +exactly that: + +```{r, purl=TRUE} +numeric(0) +``` + +There are similar constructors for characters and logicals, named +`character()` and `logical()` respectively. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge: + +What are the defaults for character and logical vectors? + +::::::::::::::: solution + +## Solution + +```{r, purl=TRUE} +character(2) ## the empty character +logical(2) ## FALSE +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +### Replicate elements + +The `rep` function allow to repeat a value a certain number of +times. If we want to initiate a vector of numerics of length 5 with +the value -1, for example, we could do the following: + +```{r, purl=TRUE} +rep(-1, 5) +``` + +Similarly, to generate a vector populated with missing values, which +is often a good way to start, without setting assumptions on the data +to be collected: + +```{r, purl=TRUE} +rep(NA, 5) +``` + +`rep` can take vectors of any length as input (above, we used vectors +of length 1) and any type. For example, if we want to repeat the +values 1, 2 and 3 five times, we would do the following: + +```{r, purl=TRUE} +rep(c(1, 2, 3), 5) +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge: + +What if we wanted to repeat the values 1, 2 and 3 five times, but +obtain five 1s, five 2s and five 3s in that order? There are two +possibilities - see `?rep` or `?sort` for help. + +::::::::::::::: solution + +## Solution + +```{r, purl=TRUE} +rep(c(1, 2, 3), each = 5) +sort(rep(c(1, 2, 3), 5)) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +### Sequence generation + +Another very useful function is `seq`, to generate a sequence of +numbers. For example, to generate a sequence of integers from 1 to 20 +by steps of 2, one would use: + +```{r, purl=TRUE} +seq(from = 1, to = 20, by = 2) +``` + +The default value of `by` is 1 and, given that the generation of a +sequence of one value to another with steps of 1 is frequently used, +there's a shortcut: + +```{r, purl=TRUE} +seq(1, 5, 1) +seq(1, 5) ## default by +1:5 +``` + +To generate a sequence of numbers from 1 to 20 of final length of 3, +one would use: + +```{r, purl=TRUE} +seq(from = 1, to = 20, length.out = 3) +``` + +### Random samples and permutations + +A last group of useful functions are those that generate random +data. The first one, `sample`, generates a random permutation of +another vector. For example, to draw a random order to 10 students +oral exam, I first assign each student a number from 1 to ten (for +instance based on the alphabetic order of their name) and then: + +```{r, purl=TRUE} +sample(1:10) +``` + +Without further arguments, `sample` will return a permutation of all +elements of the vector. If I want a random sample of a certain size, I +would set this value as the second argument. Below, I sample 5 random +letters from the alphabet contained in the pre-defined `letters` vector: + +```{r, purl=TRUE} +sample(letters, 5) +``` + +If I wanted an output larger than the input vector, or being able to +draw some elements multiple times, I would need to set the `replace` +argument to `TRUE`: + +```{r, purl=TRUE} +sample(1:5, 10, replace = TRUE) +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge: + +When trying the functions above out, you will have realised that the +samples are indeed random and that one doesn't get the same +permutation twice. To be able to reproduce these random draws, one can +set the random number generation seed manually with `set.seed()` +before drawing the random sample. + +Test this feature with your neighbour. First draw two random +permutations of `1:10` independently and observe that you get +different results. + +Now set the seed with, for example, `set.seed(123)` and repeat the +random draw. Observe that you now get the same random draws. + +Repeat by setting a different seed. + +::::::::::::::: solution + +## Solution + +Different permutations + +```{r, purl=TRUE} +sample(1:10) +sample(1:10) +``` + +Same permutations with seed 123 + +```{r, purl=TRUE} +set.seed(123) +sample(1:10) +set.seed(123) +sample(1:10) +``` + +A different seed + +```{r, purl=TRUE} +set.seed(1) +sample(1:10) +set.seed(1) +sample(1:10) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +### Drawing samples from a normal distribution + +The last function we are going to see is `rnorm`, that draws a random +sample from a normal distribution. Two normal distributions of means 0 +and 100 and standard deviations 1 and 5, noted _N(0, 1)_ and +_N(100, 5)_, are shown below. + +```{r, echo=FALSE, fig.width=12, fig.height=6, fig.cap="Two normal distributions: *N(0, 1)* on the left and *N(100, 5)* on the right."} +par(mfrow = c(1, 2)) +plot(density(rnorm(1000)), main = "", sub = "N(0, 1)") +plot(density(rnorm(1000, 100, 5)), main = "", sub = "N(100, 5)") +``` + +The three arguments, `n`, `mean` and `sd`, define the size of the +sample, and the parameters of the normal distribution, i.e the mean +and its standard deviation. The defaults of the latter are 0 and 1. + +```{r, purl=TRUE} +rnorm(5) +rnorm(5, 2, 2) +rnorm(5, 100, 5) +``` + +Now that we have learned how to write scripts, and the basics of R's +data structures, we are ready to start working with larger data, and +learn about data frames. + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- How to interact with R + +:::::::::::::::::::::::::::::::::::::::::::::::::: From 772dc7d794551c0ee7886a37b13d7af25582ccf9 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Fri, 29 Dec 2023 06:10:47 +0900 Subject: [PATCH 017/334] New translations 25-starting-with-data.md (French) --- locale/fr/episodes/25-starting-with-data.Rmd | 781 +++++++++++++++++++ 1 file changed, 781 insertions(+) create mode 100644 locale/fr/episodes/25-starting-with-data.Rmd diff --git a/locale/fr/episodes/25-starting-with-data.Rmd b/locale/fr/episodes/25-starting-with-data.Rmd new file mode 100644 index 000000000..bc29da0cd --- /dev/null +++ b/locale/fr/episodes/25-starting-with-data.Rmd @@ -0,0 +1,781 @@ +--- +source: Rmd +title: Starting with data +teaching: 30 +exercises: 30 +--- + +```{r, include=FALSE} +``` + +::::::::::::::::::::::::::::::::::::::: objectives + +- Describe what a `data.frame` is. +- Load external data from a .csv file into a data frame. +- Summarize the contents of a data frame. +- Describe what a factor is. +- Convert between strings and factors. +- Reorder and rename factors. +- Format dates. +- Export and save data. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- First data analysis in R + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Presentation of the gene expression data + +We are going to use part of the data published by Blackmore , _The +effect of upper-respiratory infection on transcriptomic changes in the +CNS_. The goal of the study was to determine the effect of an +upper-respiratory infection on changes in RNA transcription occurring +in the cerebellum and spinal cord post infection. Gender matched eight +week old C57BL/6 mice were inoculated with saline or with Influenza A by +intranasal route and transcriptomic changes in the cerebellum and +spinal cord tissues were evaluated by RNA-seq at days 0 +(non-infected), 4 and 8. + +The dataset is stored as a comma-separated values (CSV) file. Each row +holds information for a single RNA expression measurement, and the first eleven +columns represent: + +| Column | Description | +| ---------- | -------------------------------------------------------------------------------------------- | +| gene | The name of the gene that was measured | +| sample | The name of the sample the gene expression was measured in | +| expression | The value of the gene expression | +| organism | The organism/species - here all data stem from mice | +| age | The age of the mouse (all mice were 8 weeks here) | +| sex | The sex of the mouse | +| infection | The infection state of the mouse, i.e. infected with Influenza A or not infected. | +| strain | The Influenza A strain. | +| time | The duration of the infection (in days). | +| tissue | The tissue that was used for the gene expression experiment, i.e. cerebellum or spinal cord. | +| mouse | The mouse unique identifier. | + +We are going to use the R function `download.file()` to download the +CSV file that contains the gene expression data, and we will use +`read.csv()` to load into memory the content of the CSV file as an +object of class `data.frame`. Inside the `download.file` command, the +first entry is a character string with the source URL. This source URL +downloads a CSV file from a GitHub repository. The text after the +comma (`"data/rnaseq.csv"`) is the destination of the file on your +local machine. You'll need to have a folder on your machine called +`"data"` where you'll download the file. So this command downloads the +remote file, names it `"rnaseq.csv"` and adds it to a preexisting +folder named `"data"`. + +```{r, eval=TRUE} +download.file(url = "https://github.com/carpentries-incubator/bioc-intro/raw/main/episodes/data/rnaseq.csv", + destfile = "data/rnaseq.csv") +``` + +You are now ready to load the data: + +```{r, eval=TRUE, purl=TRUE} +rna <- read.csv("data/rnaseq.csv") +``` + +This statement doesn't produce any output because, as you might +recall, assignments don't display anything. If we want to check that +our data has been loaded, we can see the contents of the data frame by +typing its name: + +```{r, eval=FALSE} +rna +``` + +Wow\... that was a lot of output. At least it means the data loaded +properly. Let's check the top (the first 6 lines) of this data frame +using the function `head()`: + +```{r, purl=TRUE} +head(rna) +## Try also +## View(rna) +``` + +**Note** + +`read.csv()` assumes that fields are delineated by commas, however, in +several countries, the comma is used as a decimal separator and the +semicolon (;) is used as a field delineator. If you want to read in +this type of files in R, you can use the `read.csv2()` function. It +behaves exactly like `read.csv()` but uses different parameters for +the decimal and the field separators. If you are working with another +format, they can be both specified by the user. Check out the help for +`read.csv()` by typing `?read.csv` to learn more. There is also the +`read.delim()` function for reading tab separated data files. It is important to +note that all of these functions are actually wrapper functions for +the main `read.table()` function with different arguments. As such, +the data above could have also been loaded by using `read.table()` +with the separation argument as `,`. The code is as follows: + +```{r, eval=TRUE, purl=TRUE} +rna <- read.table(file = "data/rnaseq.csv", + sep = ",", + header = TRUE) +``` + +The header argument has to be set to TRUE to be able to read the +headers as by default `read.table()` has the header argument set to +FALSE. + +## What are data frames? + +Data frames are the _de facto_ data structure for most tabular data, +and what we use for statistics and plotting. + +A data frame can be created by hand, but most commonly they are +generated by the functions `read.csv()` or `read.table()`; in other +words, when importing spreadsheets from your hard drive (or the web). + +A data frame is the representation of data in the format of a table +where the columns are vectors that all have the same length. Because +columns are vectors, each column must contain a single type of data +(e.g., characters, integers, factors). For example, here is a figure +depicting a data frame comprising a numeric, a character, and a +logical vector. + +![](./fig/data-frame.svg) + +We can see this when inspecting the <b>str</b>ucture of a data frame +with the function `str()`: + +```{r} +str(rna) +``` + +## Inspecting `data.frame` Objects + +We already saw how the functions `head()` and `str()` can be useful to +check the content and the structure of a data frame. Here is a +non-exhaustive list of functions to get a sense of the +content/structure of the data. Let's try them out! + +**Size**: + +- `dim(rna)` - returns a vector with the number of rows as the first + element, and the number of columns as the second element (the + **dim**ensions of the object). +- `nrow(rna)` - returns the number of rows. +- `ncol(rna)` - returns the number of columns. + +**Content**: + +- `head(rna)` - shows the first 6 rows. +- `tail(rna)` - shows the last 6 rows. + +**Names**: + +- `names(rna)` - returns the column names (synonym of `colnames()` for + `data.frame` objects). +- `rownames(rna)` - returns the row names. + +**Summary**: + +- `str(rna)` - structure of the object and information about the + class, length and content of each column. +- `summary(rna)` - summary statistics for each column. + +Note: most of these functions are "generic", they can be used on other types of +objects besides `data.frame`. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge: + +Based on the output of `str(rna)`, can you answer the following +questions? + +- What is the class of the object `rna`? +- How many rows and how many columns are in this object? + +::::::::::::::: solution + +## Solution + +- class: data frame +- how many rows: 66465, how many columns: 11 + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Indexing and subsetting data frames + +Our `rna` data frame has rows and columns (it has 2 dimensions); if we +want to extract some specific data from it, we need to specify the +"coordinates" we want. Row numbers come first, followed by +column numbers. However, note that different ways of specifying these +coordinates lead to results with different classes. + +```{r, eval=FALSE, purl=TRUE} +# first element in the first column of the data frame (as a vector) +rna[1, 1] +# first element in the 6th column (as a vector) +rna[1, 6] +# first column of the data frame (as a vector) +rna[, 1] +# first column of the data frame (as a data.frame) +rna[1] +# first three elements in the 7th column (as a vector) +rna[1:3, 7] +# the 3rd row of the data frame (as a data.frame) +rna[3, ] +# equivalent to head_rna <- head(rna) +head_rna <- rna[1:6, ] +head_rna +``` + +`:` is a special function that creates numeric vectors of integers in +increasing or decreasing order, test `1:10` and `10:1` for +instance. See section @ref(sec:genvec) for details. + +You can also exclude certain indices of a data frame using the "`-`" sign: + +```{r, eval=FALSE, purl=TRUE} +rna[, -1] ## The whole data frame, except the first column +rna[-c(7:66465), ] ## Equivalent to head(rna) +``` + +Data frames can be subsetted by calling indices (as shown previously), +but also by calling their column names directly: + +```{r, eval=FALSE, purl=TRUE} +rna["gene"] # Result is a data.frame +rna[, "gene"] # Result is a vector +rna[["gene"]] # Result is a vector +rna$gene # Result is a vector +``` + +In RStudio, you can use the autocompletion feature to get the full and +correct names of the columns. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge + +1. Create a `data.frame` (`rna_200`) containing only the data in + row 200 of the `rna` dataset. + +2. Notice how `nrow()` gave you the number of rows in a `data.frame`? + +- Use that number to pull out just that last row in the initial + `rna` data frame. + +- Compare that with what you see as the last row using `tail()` to + make sure it's meeting expectations. + +- Pull out that last row using `nrow()` instead of the row number. + +- Create a new data frame (`rna_last`) from that last row. + +3. Use `nrow()` to extract the row that is in the middle of the + `rna` dataframe. Store the content of this row in an object + named `rna_middle`. + +4. Combine `nrow()` with the `-` notation above to reproduce the + behavior of `head(rna)`, keeping just the first through 6th + rows of the rna dataset. + +::::::::::::::: solution + +## Solution + +```{r, purl=TRUE} +## 1. +rna_200 <- rna[200, ] +## 2. +## Saving `n_rows` to improve readability and reduce duplication +n_rows <- nrow(rna) +rna_last <- rna[n_rows, ] +## 3. +rna_middle <- rna[n_rows / 2, ] +## 4. +rna_head <- rna[-(7:n_rows), ] +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Factors + +Factors represent **categorical data**. They are stored as integers +associated with labels and they can be ordered or unordered. While +factors look (and often behave) like character vectors, they are +actually treated as integer vectors by R. So you need to be very +careful when treating them as strings. + +Once created, factors can only contain a pre-defined set of values, +known as _levels_. By default, R always sorts levels in alphabetical +order. For instance, if you have a factor with 2 levels: + +```{r, purl=TRUE} +sex <- factor(c("male", "female", "female", "male", "female")) +``` + +R will assign `1` to the level `"female"` and `2` to the level +`"male"` (because `f` comes before `m`, even though the first element +in this vector is `"male"`). You can see this by using the function +`levels()` and you can find the number of levels using `nlevels()`: + +```{r, purl=TRUE} +levels(sex) +nlevels(sex) +``` + +Sometimes, the order of the factors does not matter, other times you +might want to specify the order because it is meaningful (e.g., "low", +"medium", "high"), it improves your visualization, or it is required +by a particular type of analysis. Here, one way to reorder our levels +in the `sex` vector would be: + +```{r, purl=TRUE} +sex ## current order +sex <- factor(sex, levels = c("male", "female")) +sex ## after re-ordering +``` + +In R's memory, these factors are represented by integers (1, 2, 3), +but are more informative than integers because factors are self +describing: `"female"`, `"male"` is more descriptive than `1`, +`2`. Which one is "male"? You wouldn't be able to tell just from the +integer data. Factors, on the other hand, have this information built-in. +It is particularly helpful when there are many levels (like the +gene biotype in our example dataset). + +When your data is stored as a factor, you can use the `plot()` +function to get a quick glance at the number of observations +represented by each factor level. Let's look at the number of males +and females in our data. + +```{r firstfactorplot, fig.cap="Bar plot of the number of females and males.", purl=TRUE} +plot(sex) +``` + +### Converting to character + +If you need to convert a factor to a character vector, you use +`as.character(x)`. + +```{r, purl=TRUE} +as.character(sex) +``` + +<!-- ### Numeric factors --> + +<!-- Converting factors where the levels appear as numbers (such as --> + +<!-- concentration levels, or years) to a numeric vector is a little --> + +<!-- trickier. The `as.numeric()` function returns the index values of the --> + +<!-- factor, not its levels, so it will result in an entirely new (and --> + +<!-- unwanted in this case) set of numbers. One method to avoid this is to --> + +<!-- convert factors to characters, and then to numbers. Another method is --> + +<!-- to use the `levels()` function. Compare: --> + +<!-- ```{r} --> + +<!-- year_fct <- factor(c(1990, 1983, 1977, 1998, 1990)) --> + +<!-- as.numeric(year_fct) ## Wrong! And there is no warning... --> + +<!-- as.numeric(as.character(year_fct)) ## Works... --> + +<!-- as.numeric(levels(year_fct))[year_fct] ## The recommended way. --> + +<!-- ``` + +<!-- Notice that in the `levels()` approach, three important steps occur: --> + +<!-- * We obtain all the factor levels using `levels(year_fct)` --> + +<!-- * We convert these levels to numeric values using `as.numeric(levels(year_fct))` --> + +<!-- * We then access these numeric values using the underlying integers of the --> + +<!-- vector `year_fct` inside the square brackets --> + +### Renaming factors + +If we want to rename these factor, it is sufficient to change its +levels: + +```{r, purl=TRUE} +levels(sex) +levels(sex) <- c("M", "F") +sex +plot(sex) +``` + +:::::::::::::::::::::::::::::::::::::: challenge + +## Challenge: + +- Rename "F" and "M" to "Female" and "Male" respectively. + +::::::::::::::: solution + +## Solution + +```{r, eval=TRUE, purl=TRUE} +levels(sex) +levels(sex) <- c("Male", "Female") +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge: + +We have seen how data frames are created when using `read.csv()`, but +they can also be created by hand with the `data.frame()` function. +There are a few mistakes in this hand-crafted `data.frame`. Can you +spot and fix them? Don't hesitate to experiment! + +```{r, eval=FALSE} +animal_data <- data.frame( + animal = c(dog, cat, sea cucumber, sea urchin), + feel = c("furry", "squishy", "spiny"), + weight = c(45, 8 1.1, 0.8)) +``` + +::::::::::::::: solution + +## Solution + +- missing quotations around the names of the animals +- missing one entry in the "feel" column (probably for one of the furry animals) +- missing one comma in the weight column + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge: + +Can you predict the class for each of the columns in the following +example? + +Check your guesses using `str(country_climate)`: + +- Are they what you expected? Why? Why not? + +- Try again by adding `stringsAsFactors = TRUE` after the last + variable when creating the data frame. What is happening now? + `stringsAsFactors` can also be set when reading text-based + spreadsheets into R using `read.csv()`. + +```{r, eval=FALSE, purl=TRUE} +country_climate <- data.frame( + country = c("Canada", "Panama", "South Africa", "Australia"), + climate = c("cold", "hot", "temperate", "hot/temperate"), + temperature = c(10, 30, 18, "15"), + northern_hemisphere = c(TRUE, TRUE, FALSE, "FALSE"), + has_kangaroo = c(FALSE, FALSE, FALSE, 1) + ) +``` + +::::::::::::::: solution + +## Solution + +```{r, eval=TRUE, purl=TRUE} +country_climate <- data.frame( + country = c("Canada", "Panama", "South Africa", "Australia"), + climate = c("cold", "hot", "temperate", "hot/temperate"), + temperature = c(10, 30, 18, "15"), + northern_hemisphere = c(TRUE, TRUE, FALSE, "FALSE"), + has_kangaroo = c(FALSE, FALSE, FALSE, 1) + ) +str(country_climate) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +The automatic conversion of data type is sometimes a blessing, sometimes an +annoyance. Be aware that it exists, learn the rules, and double check that data +you import in R are of the correct type within your data frame. If not, use it +to your advantage to detect mistakes that might have been introduced during data +entry (a letter in a column that should only contain numbers for instance). + +Learn more in this RStudio +tutorial + +## Matrices + +Before proceeding, now that we have learnt about data frames, let's +recap package installation and learn about a new data type, namely the +`matrix`. Like a `data.frame`, a matrix has two dimensions, rows and +columns. But the major difference is that all cells in a `matrix` must +be of the same type: `numeric`, `character`, `logical`, ... In that +respect, matrices are closer to a `vector` than a `data.frame`. + +The default constructor for a matrix is `matrix`. It takes a vector of +values to populate the matrix and the number of row and/or +columns[^ncol]. The values are sorted along the columns, as illustrated +below. + +```{r mat1, purl=TRUE} +m <- matrix(1:9, ncol = 3, nrow = 3) +m +``` + +[^ncol]: Either the number of rows or columns are enough, as the other one can be deduced from the length of the values. Try out what happens if the values and number of rows/columns don't add up. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge: + +Using the function `installed.packages()`, create a `character` matrix +containing the information about all packages currently installed on +your computer. Explore it. + +::::::::::::::: solution + +## Solution: + +```{r pkg_sln, eval=FALSE, purl=TRUE} +## create the matrix +ip <- installed.packages() +head(ip) +## try also View(ip) +## number of package +nrow(ip) +## names of all installed packages +rownames(ip) +## type of information we have about each package +colnames(ip) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +It is often useful to create large random data matrices as test +data. The exercise below asks you to create such a matrix with random +data drawn from a normal distribution of mean 0 and standard deviation +1, which can be done with the `rnorm()` function. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge: + +Construct a matrix of dimension 1000 by 3 of normally distributed data +(mean 0, standard deviation 1) + +::::::::::::::: solution + +## Solution + +```{r rnormmat_sln, purl=TRUE} +set.seed(123) +m <- matrix(rnorm(3000), ncol = 3) +dim(m) +head(m) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Formatting Dates + +One of the most common issues that new (and experienced!) R users have +is converting date and time information into a variable that is +appropriate and usable during analyses. + +### Note on dates in spreadsheet programs + +Dates in spreadsheets are generally stored in a single column. While +this seems the most natural way to record dates, it actually is not +best practice. A spreadsheet application will display the dates in a +seemingly correct way (to a human observer) but how it actually +handles and stores the dates may be problematic. It is often much +safer to store dates with YEAR, MONTH and DAY in separate columns or +as YEAR and DAY-OF-YEAR in separate columns. + +Spreadsheet programs such as LibreOffice, Microsoft Excel, OpenOffice, +Gnumeric, ... have different (and often incompatible) ways of encoding +dates (even for the same program between versions and operating +systems). Additionally, Excel can turn things that aren't dates into +dates +(@Zeeberg:2004), for example names or identifiers like MAR1, DEC1, +OCT4. So if you're avoiding the date format overall, it's easier to +identify these issues. + +The Dates as +data +section of the Data Carpentry lesson provides additional insights +about pitfalls of dates with spreadsheets. + +We are going to use the `ymd()` function from the package +**`lubridate`** (which belongs to the **`tidyverse`**; learn more +[here](https://www.tidyverse.org/)). . **`lubridate`** gets installed +as part of the **`tidyverse`** installation. When you load the +**`tidyverse`** (`library(tidyverse)`), the core packages (the +packages used in most data analyses) get loaded. **`lubridate`** +however does not belong to the core tidyverse, so you have to load it +explicitly with `library(lubridate)`. + +Start by loading the required package: + +```{r loadlibridate, message=FALSE, purl=TRUE} +library("lubridate") +``` + +`ymd()` takes a vector representing year, month, and day, and converts +it to a `Date` vector. `Date` is a class of data recognized by R as +being a date and can be manipulated as such. The argument that the +function requires is flexible, but, as a best practice, is a character +vector formatted as "YYYY-MM-DD". + +Let's create a date object and inspect the structure: + +```{r, purl=TRUE} +my_date <- ymd("2015-01-01") +str(my_date) +``` + +Now let's paste the year, month, and day separately - we get the same result: + +```{r, purl=TRUE} +# sep indicates the character to use to separate each component +my_date <- ymd(paste("2015", "1", "1", sep = "-")) +str(my_date) +``` + +Let's now familiarise ourselves with a typical date manipulation +pipeline. The small data below has stored dates in different `year`, +`month` and `day` columns. + +```{r, purl=TRUE} +x <- data.frame(year = c(1996, 1992, 1987, 1986, 2000, 1990, 2002, 1994, 1997, 1985), + month = c(2, 3, 3, 10, 1, 8, 3, 4, 5, 5), + day = c(24, 8, 1, 5, 8, 17, 13, 10, 11, 24), + value = c(4, 5, 1, 9, 3, 8, 10, 2, 6, 7)) +x +``` + +Now we apply this function to the `x` dataset. We first create a +character vector from the `year`, `month`, and `day` columns of `x` +using `paste()`: + +```{r, purl=TRUE} +paste(x$year, x$month, x$day, sep = "-") +``` + +This character vector can be used as the argument for `ymd()`: + +```{r, purl=TRUE} +ymd(paste(x$year, x$month, x$day, sep = "-")) +``` + +The resulting `Date` vector can be added to `x` as a new column called `date`: + +```{r, purl=TRUE} +x$date <- ymd(paste(x$year, x$month, x$day, sep = "-")) +str(x) # notice the new column, with 'date' as the class +``` + +Let's make sure everything worked correctly. One way to inspect the +new column is to use `summary()`: + +```{r, purl=TRUE} +summary(x$date) +``` + +Note that `ymd()` expects to have the year, month and day, in that +order. If you have for instance day, month and year, you would need +`dmy()`. + +```{r, purl=TRUE} +dmy(paste(x$day, x$month, x$year, sep = "-")) +``` + +`lubdridate` has many functions to address all date variations. + +## Summary of R objects + +So far, we have seen several types of R object varying in the number +of dimensions and whether they could store a single or multiple data +types: + +- **`vector`**: one dimension (they have a length), single type of data. +- **`matrix`**: two dimensions, single type of data. +- **`data.frame`**: two dimensions, one type per column. + +## Lists + +A data type that we haven't seen yet, but that is useful to know, and +follows from the summary that we have just seen are lists: + +- **`list`**: one dimension, every item can be of a different data + type. + +Below, let's create a list containing a vector of numbers, characters, +a matrix, a dataframe and another list: + +```{r list0, purl=TRUE} +l <- list(1:10, ## numeric + letters, ## character + installed.packages(), ## a matrix + cars, ## a data.frame + list(1, 2, 3)) ## a list +length(l) +str(l) +``` + +List subsetting is done using `[]` to subset a new sub-list or `[[]]` +to extract a single element of that list (using indices or names, if +the list is named). + +```{r, purl=TRUE} +l[[1]] ## first element +l[1:2] ## a list of length 2 +l[1] ## a list of length 1 +``` + +## Exporting and saving tabular data {#sec:exportandsave} + +We have seen how to read a text-based spreadsheet into R using the +`read.table` family of functions. To export a `data.frame` to a +text-based spreadsheet, we can use the `write.table` set of functions +(`write.csv`, `write.delim`, ...). They all take the variable to be +exported and the file to be exported to. For example, to export the +`rna` data to the `my_rna.csv` file in the `data_output` +directory, we would execute: + +```{r, eval=FALSE, purl=TRUE} +write.csv(rna, file = "data_output/my_rna.csv") +``` + +This new csv file can now be shared with other collaborators who +aren't familiar with R. Note that even though there are commas in some of +the fields in the `data.frame` (see for example the "product" column), R will +by default surround each field with quotes, and thus we will be able to +read it back into R correctly, despite also using commas as column +separators. + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- Tabular data in R + +:::::::::::::::::::::::::::::::::::::::::::::::::: From e7b763daac9f0b07a8961bb55f192b8475d9ad87 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Fri, 29 Dec 2023 06:10:49 +0900 Subject: [PATCH 018/334] New translations 25-starting-with-data.md (Spanish) --- locale/es/episodes/25-starting-with-data.Rmd | 781 +++++++++++++++++++ 1 file changed, 781 insertions(+) create mode 100644 locale/es/episodes/25-starting-with-data.Rmd diff --git a/locale/es/episodes/25-starting-with-data.Rmd b/locale/es/episodes/25-starting-with-data.Rmd new file mode 100644 index 000000000..ea2a353ca --- /dev/null +++ b/locale/es/episodes/25-starting-with-data.Rmd @@ -0,0 +1,781 @@ +--- +source: Rmd +title: Starting with data +teaching: 30 +exercises: 30 +--- + +```{r, include=FALSE} +``` + +::::::::::::::::::::::::::::::::::::::: objetivos + +- Describe what a `data.frame` is. +- Load external data from a .csv file into a data frame. +- Summarize the contents of a data frame. +- Describe what a factor is. +- Convert between strings and factors. +- Reorder and rename factors. +- Format dates. +- Export and save data. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- First data analysis in R + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Presentation of the gene expression data + +We are going to use part of the data published by Blackmore , _The +effect of upper-respiratory infection on transcriptomic changes in the +CNS_. The goal of the study was to determine the effect of an +upper-respiratory infection on changes in RNA transcription occurring +in the cerebellum and spinal cord post infection. Gender matched eight +week old C57BL/6 mice were inoculated with saline or with Influenza A by +intranasal route and transcriptomic changes in the cerebellum and +spinal cord tissues were evaluated by RNA-seq at days 0 +(non-infected), 4 and 8. + +The dataset is stored as a comma-separated values (CSV) file. Each row +holds information for a single RNA expression measurement, and the first eleven +columns represent: + +| Column | Description | +| ---------- | -------------------------------------------------------------------------------------------- | +| gene | The name of the gene that was measured | +| sample | The name of the sample the gene expression was measured in | +| expression | The value of the gene expression | +| organism | The organism/species - here all data stem from mice | +| age | The age of the mouse (all mice were 8 weeks here) | +| sex | The sex of the mouse | +| infection | The infection state of the mouse, i.e. infected with Influenza A or not infected. | +| strain | The Influenza A strain. | +| time | The duration of the infection (in days). | +| tissue | The tissue that was used for the gene expression experiment, i.e. cerebellum or spinal cord. | +| mouse | The mouse unique identifier. | + +We are going to use the R function `download.file()` to download the +CSV file that contains the gene expression data, and we will use +`read.csv()` to load into memory the content of the CSV file as an +object of class `data.frame`. Inside the `download.file` command, the +first entry is a character string with the source URL. This source URL +downloads a CSV file from a GitHub repository. The text after the +comma (`"data/rnaseq.csv"`) is the destination of the file on your +local machine. You'll need to have a folder on your machine called +`"data"` where you'll download the file. So this command downloads the +remote file, names it `"rnaseq.csv"` and adds it to a preexisting +folder named `"data"`. + +```{r, eval=TRUE} +download.file(url = "https://github.com/carpentries-incubator/bioc-intro/raw/main/episodes/data/rnaseq.csv", + destfile = "data/rnaseq.csv") +``` + +You are now ready to load the data: + +```{r, eval=TRUE, purl=TRUE} +rna <- read.csv("data/rnaseq.csv") +``` + +This statement doesn't produce any output because, as you might +recall, assignments don't display anything. If we want to check that +our data has been loaded, we can see the contents of the data frame by +typing its name: + +```{r, eval=FALSE} +rna +``` + +Wow\... that was a lot of output. At least it means the data loaded +properly. Let's check the top (the first 6 lines) of this data frame +using the function `head()`: + +```{r, purl=TRUE} +head(rna) +## Try also +## View(rna) +``` + +**Note** + +`read.csv()` assumes that fields are delineated by commas, however, in +several countries, the comma is used as a decimal separator and the +semicolon (;) is used as a field delineator. If you want to read in +this type of files in R, you can use the `read.csv2()` function. It +behaves exactly like `read.csv()` but uses different parameters for +the decimal and the field separators. If you are working with another +format, they can be both specified by the user. Check out the help for +`read.csv()` by typing `?read.csv` to learn more. There is also the +`read.delim()` function for reading tab separated data files. It is important to +note that all of these functions are actually wrapper functions for +the main `read.table()` function with different arguments. As such, +the data above could have also been loaded by using `read.table()` +with the separation argument as `,`. The code is as follows: + +```{r, eval=TRUE, purl=TRUE} +rna <- read.table(file = "data/rnaseq.csv", + sep = ",", + header = TRUE) +``` + +The header argument has to be set to TRUE to be able to read the +headers as by default `read.table()` has the header argument set to +FALSE. + +## What are data frames? + +Data frames are the _de facto_ data structure for most tabular data, +and what we use for statistics and plotting. + +A data frame can be created by hand, but most commonly they are +generated by the functions `read.csv()` or `read.table()`; in other +words, when importing spreadsheets from your hard drive (or the web). + +A data frame is the representation of data in the format of a table +where the columns are vectors that all have the same length. Because +columns are vectors, each column must contain a single type of data +(e.g., characters, integers, factors). For example, here is a figure +depicting a data frame comprising a numeric, a character, and a +logical vector. + +![](./fig/data-frame.svg) + +We can see this when inspecting the <b>str</b>ucture of a data frame +with the function `str()`: + +```{r} +str(rna) +``` + +## Inspecting `data.frame` Objects + +We already saw how the functions `head()` and `str()` can be useful to +check the content and the structure of a data frame. Here is a +non-exhaustive list of functions to get a sense of the +content/structure of the data. Let's try them out! + +**Size**: + +- `dim(rna)` - returns a vector with the number of rows as the first + element, and the number of columns as the second element (the + **dim**ensions of the object). +- `nrow(rna)` - returns the number of rows. +- `ncol(rna)` - returns the number of columns. + +**Content**: + +- `head(rna)` - shows the first 6 rows. +- `tail(rna)` - shows the last 6 rows. + +**Names**: + +- `names(rna)` - returns the column names (synonym of `colnames()` for + `data.frame` objects). +- `rownames(rna)` - returns the row names. + +**Summary**: + +- `str(rna)` - structure of the object and information about the + class, length and content of each column. +- `summary(rna)` - summary statistics for each column. + +Note: most of these functions are "generic", they can be used on other types of +objects besides `data.frame`. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge: + +Based on the output of `str(rna)`, can you answer the following +questions? + +- What is the class of the object `rna`? +- How many rows and how many columns are in this object? + +::::::::::::::: solution + +## Solution + +- class: data frame +- how many rows: 66465, how many columns: 11 + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Indexing and subsetting data frames + +Our `rna` data frame has rows and columns (it has 2 dimensions); if we +want to extract some specific data from it, we need to specify the +"coordinates" we want. Row numbers come first, followed by +column numbers. However, note that different ways of specifying these +coordinates lead to results with different classes. + +```{r, eval=FALSE, purl=TRUE} +# first element in the first column of the data frame (as a vector) +rna[1, 1] +# first element in the 6th column (as a vector) +rna[1, 6] +# first column of the data frame (as a vector) +rna[, 1] +# first column of the data frame (as a data.frame) +rna[1] +# first three elements in the 7th column (as a vector) +rna[1:3, 7] +# the 3rd row of the data frame (as a data.frame) +rna[3, ] +# equivalent to head_rna <- head(rna) +head_rna <- rna[1:6, ] +head_rna +``` + +`:` is a special function that creates numeric vectors of integers in +increasing or decreasing order, test `1:10` and `10:1` for +instance. See section @ref(sec:genvec) for details. + +You can also exclude certain indices of a data frame using the "`-`" sign: + +```{r, eval=FALSE, purl=TRUE} +rna[, -1] ## The whole data frame, except the first column +rna[-c(7:66465), ] ## Equivalent to head(rna) +``` + +Data frames can be subsetted by calling indices (as shown previously), +but also by calling their column names directly: + +```{r, eval=FALSE, purl=TRUE} +rna["gene"] # Result is a data.frame +rna[, "gene"] # Result is a vector +rna[["gene"]] # Result is a vector +rna$gene # Result is a vector +``` + +In RStudio, you can use the autocompletion feature to get the full and +correct names of the columns. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge + +1. Create a `data.frame` (`rna_200`) containing only the data in + row 200 of the `rna` dataset. + +2. Notice how `nrow()` gave you the number of rows in a `data.frame`? + +- Use that number to pull out just that last row in the initial + `rna` data frame. + +- Compare that with what you see as the last row using `tail()` to + make sure it's meeting expectations. + +- Pull out that last row using `nrow()` instead of the row number. + +- Create a new data frame (`rna_last`) from that last row. + +3. Use `nrow()` to extract the row that is in the middle of the + `rna` dataframe. Store the content of this row in an object + named `rna_middle`. + +4. Combine `nrow()` with the `-` notation above to reproduce the + behavior of `head(rna)`, keeping just the first through 6th + rows of the rna dataset. + +::::::::::::::: solution + +## Solution + +```{r, purl=TRUE} +## 1. +rna_200 <- rna[200, ] +## 2. +## Saving `n_rows` to improve readability and reduce duplication +n_rows <- nrow(rna) +rna_last <- rna[n_rows, ] +## 3. +rna_middle <- rna[n_rows / 2, ] +## 4. +rna_head <- rna[-(7:n_rows), ] +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Factors + +Factors represent **categorical data**. They are stored as integers +associated with labels and they can be ordered or unordered. While +factors look (and often behave) like character vectors, they are +actually treated as integer vectors by R. So you need to be very +careful when treating them as strings. + +Once created, factors can only contain a pre-defined set of values, +known as _levels_. By default, R always sorts levels in alphabetical +order. For instance, if you have a factor with 2 levels: + +```{r, purl=TRUE} +sex <- factor(c("male", "female", "female", "male", "female")) +``` + +R will assign `1` to the level `"female"` and `2` to the level +`"male"` (because `f` comes before `m`, even though the first element +in this vector is `"male"`). You can see this by using the function +`levels()` and you can find the number of levels using `nlevels()`: + +```{r, purl=TRUE} +levels(sex) +nlevels(sex) +``` + +Sometimes, the order of the factors does not matter, other times you +might want to specify the order because it is meaningful (e.g., "low", +"medium", "high"), it improves your visualization, or it is required +by a particular type of analysis. Here, one way to reorder our levels +in the `sex` vector would be: + +```{r, purl=TRUE} +sex ## current order +sex <- factor(sex, levels = c("male", "female")) +sex ## after re-ordering +``` + +In R's memory, these factors are represented by integers (1, 2, 3), +but are more informative than integers because factors are self +describing: `"female"`, `"male"` is more descriptive than `1`, +`2`. Which one is "male"? You wouldn't be able to tell just from the +integer data. Factors, on the other hand, have this information built-in. +It is particularly helpful when there are many levels (like the +gene biotype in our example dataset). + +When your data is stored as a factor, you can use the `plot()` +function to get a quick glance at the number of observations +represented by each factor level. Let's look at the number of males +and females in our data. + +```{r firstfactorplot, fig.cap="Bar plot of the number of females and males.", purl=TRUE} +plot(sex) +``` + +### Converting to character + +If you need to convert a factor to a character vector, you use +`as.character(x)`. + +```{r, purl=TRUE} +as.character(sex) +``` + +<!-- ### Numeric factors --> + +<!-- Converting factors where the levels appear as numbers (such as --> + +<!-- concentration levels, or years) to a numeric vector is a little --> + +<!-- trickier. The `as.numeric()` function returns the index values of the --> + +<!-- factor, not its levels, so it will result in an entirely new (and --> + +<!-- unwanted in this case) set of numbers. One method to avoid this is to --> + +<!-- convert factors to characters, and then to numbers. Another method is --> + +<!-- to use the `levels()` function. Compare: --> + +<!-- ```{r} --> + +<!-- year_fct <- factor(c(1990, 1983, 1977, 1998, 1990)) --> + +<!-- as.numeric(year_fct) ## Wrong! And there is no warning... --> + +<!-- as.numeric(as.character(year_fct)) ## Works... --> + +<!-- as.numeric(levels(year_fct))[year_fct] ## The recommended way. --> + +<!-- ``` + +<!-- Notice that in the `levels()` approach, three important steps occur: --> + +<!-- * We obtain all the factor levels using `levels(year_fct)` --> + +<!-- * We convert these levels to numeric values using `as.numeric(levels(year_fct))` --> + +<!-- * We then access these numeric values using the underlying integers of the --> + +<!-- vector `year_fct` inside the square brackets --> + +### Renaming factors + +If we want to rename these factor, it is sufficient to change its +levels: + +```{r, purl=TRUE} +levels(sex) +levels(sex) <- c("M", "F") +sex +plot(sex) +``` + +:::::::::::::::::::::::::::::::::::::: challenge + +## Challenge: + +- Rename "F" and "M" to "Female" and "Male" respectively. + +::::::::::::::: solution + +## Solution + +```{r, eval=TRUE, purl=TRUE} +levels(sex) +levels(sex) <- c("Male", "Female") +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge: + +We have seen how data frames are created when using `read.csv()`, but +they can also be created by hand with the `data.frame()` function. +There are a few mistakes in this hand-crafted `data.frame`. Can you +spot and fix them? Don't hesitate to experiment! + +```{r, eval=FALSE} +animal_data <- data.frame( + animal = c(dog, cat, sea cucumber, sea urchin), + feel = c("furry", "squishy", "spiny"), + weight = c(45, 8 1.1, 0.8)) +``` + +::::::::::::::: solution + +## Solution + +- missing quotations around the names of the animals +- missing one entry in the "feel" column (probably for one of the furry animals) +- missing one comma in the weight column + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge: + +Can you predict the class for each of the columns in the following +example? + +Check your guesses using `str(country_climate)`: + +- Are they what you expected? Why? Why not? + +- Try again by adding `stringsAsFactors = TRUE` after the last + variable when creating the data frame. What is happening now? + `stringsAsFactors` can also be set when reading text-based + spreadsheets into R using `read.csv()`. + +```{r, eval=FALSE, purl=TRUE} +country_climate <- data.frame( + country = c("Canada", "Panama", "South Africa", "Australia"), + climate = c("cold", "hot", "temperate", "hot/temperate"), + temperature = c(10, 30, 18, "15"), + northern_hemisphere = c(TRUE, TRUE, FALSE, "FALSE"), + has_kangaroo = c(FALSE, FALSE, FALSE, 1) + ) +``` + +::::::::::::::: solution + +## Solution + +```{r, eval=TRUE, purl=TRUE} +country_climate <- data.frame( + country = c("Canada", "Panama", "South Africa", "Australia"), + climate = c("cold", "hot", "temperate", "hot/temperate"), + temperature = c(10, 30, 18, "15"), + northern_hemisphere = c(TRUE, TRUE, FALSE, "FALSE"), + has_kangaroo = c(FALSE, FALSE, FALSE, 1) + ) +str(country_climate) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +The automatic conversion of data type is sometimes a blessing, sometimes an +annoyance. Be aware that it exists, learn the rules, and double check that data +you import in R are of the correct type within your data frame. If not, use it +to your advantage to detect mistakes that might have been introduced during data +entry (a letter in a column that should only contain numbers for instance). + +Learn more in this RStudio +tutorial + +## Matrices + +Before proceeding, now that we have learnt about data frames, let's +recap package installation and learn about a new data type, namely the +`matrix`. Like a `data.frame`, a matrix has two dimensions, rows and +columns. But the major difference is that all cells in a `matrix` must +be of the same type: `numeric`, `character`, `logical`, ... In that +respect, matrices are closer to a `vector` than a `data.frame`. + +The default constructor for a matrix is `matrix`. It takes a vector of +values to populate the matrix and the number of row and/or +columns[^ncol]. The values are sorted along the columns, as illustrated +below. + +```{r mat1, purl=TRUE} +m <- matrix(1:9, ncol = 3, nrow = 3) +m +``` + +[^ncol]: Either the number of rows or columns are enough, as the other one can be deduced from the length of the values. Try out what happens if the values and number of rows/columns don't add up. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge: + +Using the function `installed.packages()`, create a `character` matrix +containing the information about all packages currently installed on +your computer. Explore it. + +::::::::::::::: solution + +## Solution: + +```{r pkg_sln, eval=FALSE, purl=TRUE} +## create the matrix +ip <- installed.packages() +head(ip) +## try also View(ip) +## number of package +nrow(ip) +## names of all installed packages +rownames(ip) +## type of information we have about each package +colnames(ip) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +It is often useful to create large random data matrices as test +data. The exercise below asks you to create such a matrix with random +data drawn from a normal distribution of mean 0 and standard deviation +1, which can be done with the `rnorm()` function. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge: + +Construct a matrix of dimension 1000 by 3 of normally distributed data +(mean 0, standard deviation 1) + +::::::::::::::: solution + +## Solution + +```{r rnormmat_sln, purl=TRUE} +set.seed(123) +m <- matrix(rnorm(3000), ncol = 3) +dim(m) +head(m) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Formatting Dates + +One of the most common issues that new (and experienced!) R users have +is converting date and time information into a variable that is +appropriate and usable during analyses. + +### Note on dates in spreadsheet programs + +Dates in spreadsheets are generally stored in a single column. While +this seems the most natural way to record dates, it actually is not +best practice. A spreadsheet application will display the dates in a +seemingly correct way (to a human observer) but how it actually +handles and stores the dates may be problematic. It is often much +safer to store dates with YEAR, MONTH and DAY in separate columns or +as YEAR and DAY-OF-YEAR in separate columns. + +Spreadsheet programs such as LibreOffice, Microsoft Excel, OpenOffice, +Gnumeric, ... have different (and often incompatible) ways of encoding +dates (even for the same program between versions and operating +systems). Additionally, Excel can turn things that aren't dates into +dates +(@Zeeberg:2004), for example names or identifiers like MAR1, DEC1, +OCT4. So if you're avoiding the date format overall, it's easier to +identify these issues. + +The Dates as +data +section of the Data Carpentry lesson provides additional insights +about pitfalls of dates with spreadsheets. + +We are going to use the `ymd()` function from the package +**`lubridate`** (which belongs to the **`tidyverse`**; learn more +[here](https://www.tidyverse.org/)). . **`lubridate`** gets installed +as part of the **`tidyverse`** installation. When you load the +**`tidyverse`** (`library(tidyverse)`), the core packages (the +packages used in most data analyses) get loaded. **`lubridate`** +however does not belong to the core tidyverse, so you have to load it +explicitly with `library(lubridate)`. + +Start by loading the required package: + +```{r loadlibridate, message=FALSE, purl=TRUE} +library("lubridate") +``` + +`ymd()` takes a vector representing year, month, and day, and converts +it to a `Date` vector. `Date` is a class of data recognized by R as +being a date and can be manipulated as such. The argument that the +function requires is flexible, but, as a best practice, is a character +vector formatted as "YYYY-MM-DD". + +Let's create a date object and inspect the structure: + +```{r, purl=TRUE} +my_date <- ymd("2015-01-01") +str(my_date) +``` + +Now let's paste the year, month, and day separately - we get the same result: + +```{r, purl=TRUE} +# sep indicates the character to use to separate each component +my_date <- ymd(paste("2015", "1", "1", sep = "-")) +str(my_date) +``` + +Let's now familiarise ourselves with a typical date manipulation +pipeline. The small data below has stored dates in different `year`, +`month` and `day` columns. + +```{r, purl=TRUE} +x <- data.frame(year = c(1996, 1992, 1987, 1986, 2000, 1990, 2002, 1994, 1997, 1985), + month = c(2, 3, 3, 10, 1, 8, 3, 4, 5, 5), + day = c(24, 8, 1, 5, 8, 17, 13, 10, 11, 24), + value = c(4, 5, 1, 9, 3, 8, 10, 2, 6, 7)) +x +``` + +Now we apply this function to the `x` dataset. We first create a +character vector from the `year`, `month`, and `day` columns of `x` +using `paste()`: + +```{r, purl=TRUE} +paste(x$year, x$month, x$day, sep = "-") +``` + +This character vector can be used as the argument for `ymd()`: + +```{r, purl=TRUE} +ymd(paste(x$year, x$month, x$day, sep = "-")) +``` + +The resulting `Date` vector can be added to `x` as a new column called `date`: + +```{r, purl=TRUE} +x$date <- ymd(paste(x$year, x$month, x$day, sep = "-")) +str(x) # notice the new column, with 'date' as the class +``` + +Let's make sure everything worked correctly. One way to inspect the +new column is to use `summary()`: + +```{r, purl=TRUE} +summary(x$date) +``` + +Note that `ymd()` expects to have the year, month and day, in that +order. If you have for instance day, month and year, you would need +`dmy()`. + +```{r, purl=TRUE} +dmy(paste(x$day, x$month, x$year, sep = "-")) +``` + +`lubdridate` has many functions to address all date variations. + +## Summary of R objects + +So far, we have seen several types of R object varying in the number +of dimensions and whether they could store a single or multiple data +types: + +- **`vector`**: one dimension (they have a length), single type of data. +- **`matrix`**: two dimensions, single type of data. +- **`data.frame`**: two dimensions, one type per column. + +## Lists + +A data type that we haven't seen yet, but that is useful to know, and +follows from the summary that we have just seen are lists: + +- **`list`**: one dimension, every item can be of a different data + type. + +Below, let's create a list containing a vector of numbers, characters, +a matrix, a dataframe and another list: + +```{r list0, purl=TRUE} +l <- list(1:10, ## numeric + letters, ## character + installed.packages(), ## a matrix + cars, ## a data.frame + list(1, 2, 3)) ## a list +length(l) +str(l) +``` + +List subsetting is done using `[]` to subset a new sub-list or `[[]]` +to extract a single element of that list (using indices or names, if +the list is named). + +```{r, purl=TRUE} +l[[1]] ## first element +l[1:2] ## a list of length 2 +l[1] ## a list of length 1 +``` + +## Exporting and saving tabular data {#sec:exportandsave} + +We have seen how to read a text-based spreadsheet into R using the +`read.table` family of functions. To export a `data.frame` to a +text-based spreadsheet, we can use the `write.table` set of functions +(`write.csv`, `write.delim`, ...). They all take the variable to be +exported and the file to be exported to. For example, to export the +`rna` data to the `my_rna.csv` file in the `data_output` +directory, we would execute: + +```{r, eval=FALSE, purl=TRUE} +write.csv(rna, file = "data_output/my_rna.csv") +``` + +This new csv file can now be shared with other collaborators who +aren't familiar with R. Note that even though there are commas in some of +the fields in the `data.frame` (see for example the "product" column), R will +by default surround each field with quotes, and thus we will be able to +read it back into R correctly, despite also using commas as column +separators. + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- Tabular data in R + +:::::::::::::::::::::::::::::::::::::::::::::::::: From 073954dd0c6f983f8108dcfde7f92a02f0d7563d Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Fri, 29 Dec 2023 06:10:51 +0900 Subject: [PATCH 019/334] New translations 25-starting-with-data.md (Japanese) --- locale/ja/episodes/25-starting-with-data.Rmd | 781 +++++++++++++++++++ 1 file changed, 781 insertions(+) create mode 100644 locale/ja/episodes/25-starting-with-data.Rmd diff --git a/locale/ja/episodes/25-starting-with-data.Rmd b/locale/ja/episodes/25-starting-with-data.Rmd new file mode 100644 index 000000000..b473a51d9 --- /dev/null +++ b/locale/ja/episodes/25-starting-with-data.Rmd @@ -0,0 +1,781 @@ +--- +source: Rmd +title: Starting with data +teaching: 30 +exercises: 30 +--- + +```{r, include=FALSE} +``` + +::::::::::::::::::::::::::::::::::::::: 目的 + +- Describe what a `data.frame` is. +- Load external data from a .csv file into a data frame. +- Summarize the contents of a data frame. +- Describe what a factor is. +- Convert between strings and factors. +- Reorder and rename factors. +- Format dates. +- Export and save data. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- First data analysis in R + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Presentation of the gene expression data + +We are going to use part of the data published by Blackmore , _The +effect of upper-respiratory infection on transcriptomic changes in the +CNS_. The goal of the study was to determine the effect of an +upper-respiratory infection on changes in RNA transcription occurring +in the cerebellum and spinal cord post infection. Gender matched eight +week old C57BL/6 mice were inoculated with saline or with Influenza A by +intranasal route and transcriptomic changes in the cerebellum and +spinal cord tissues were evaluated by RNA-seq at days 0 +(non-infected), 4 and 8. + +The dataset is stored as a comma-separated values (CSV) file. Each row +holds information for a single RNA expression measurement, and the first eleven +columns represent: + +| Column | Description | +| ---------- | -------------------------------------------------------------------------------------------- | +| gene | The name of the gene that was measured | +| sample | The name of the sample the gene expression was measured in | +| expression | The value of the gene expression | +| organism | The organism/species - here all data stem from mice | +| age | The age of the mouse (all mice were 8 weeks here) | +| sex | The sex of the mouse | +| infection | The infection state of the mouse, i.e. infected with Influenza A or not infected. | +| strain | The Influenza A strain. | +| time | The duration of the infection (in days). | +| tissue | The tissue that was used for the gene expression experiment, i.e. cerebellum or spinal cord. | +| mouse | The mouse unique identifier. | + +We are going to use the R function `download.file()` to download the +CSV file that contains the gene expression data, and we will use +`read.csv()` to load into memory the content of the CSV file as an +object of class `data.frame`. Inside the `download.file` command, the +first entry is a character string with the source URL. This source URL +downloads a CSV file from a GitHub repository. The text after the +comma (`"data/rnaseq.csv"`) is the destination of the file on your +local machine. You'll need to have a folder on your machine called +`"data"` where you'll download the file. So this command downloads the +remote file, names it `"rnaseq.csv"` and adds it to a preexisting +folder named `"data"`. + +```{r, eval=TRUE} +download.file(url = "https://github.com/carpentries-incubator/bioc-intro/raw/main/episodes/data/rnaseq.csv", + destfile = "data/rnaseq.csv") +``` + +You are now ready to load the data: + +```{r, eval=TRUE, purl=TRUE} +rna <- read.csv("data/rnaseq.csv") +``` + +This statement doesn't produce any output because, as you might +recall, assignments don't display anything. If we want to check that +our data has been loaded, we can see the contents of the data frame by +typing its name: + +```{r, eval=FALSE} +rna +``` + +Wow\... that was a lot of output. At least it means the data loaded +properly. Let's check the top (the first 6 lines) of this data frame +using the function `head()`: + +```{r, purl=TRUE} +head(rna) +## Try also +## View(rna) +``` + +**Note** + +`read.csv()` assumes that fields are delineated by commas, however, in +several countries, the comma is used as a decimal separator and the +semicolon (;) is used as a field delineator. If you want to read in +this type of files in R, you can use the `read.csv2()` function. It +behaves exactly like `read.csv()` but uses different parameters for +the decimal and the field separators. If you are working with another +format, they can be both specified by the user. Check out the help for +`read.csv()` by typing `?read.csv` to learn more. There is also the +`read.delim()` function for reading tab separated data files. It is important to +note that all of these functions are actually wrapper functions for +the main `read.table()` function with different arguments. As such, +the data above could have also been loaded by using `read.table()` +with the separation argument as `,`. The code is as follows: + +```{r, eval=TRUE, purl=TRUE} +rna <- read.table(file = "data/rnaseq.csv", + sep = ",", + header = TRUE) +``` + +The header argument has to be set to TRUE to be able to read the +headers as by default `read.table()` has the header argument set to +FALSE. + +## What are data frames? + +Data frames are the _de facto_ data structure for most tabular data, +and what we use for statistics and plotting. + +A data frame can be created by hand, but most commonly they are +generated by the functions `read.csv()` or `read.table()`; in other +words, when importing spreadsheets from your hard drive (or the web). + +A data frame is the representation of data in the format of a table +where the columns are vectors that all have the same length. Because +columns are vectors, each column must contain a single type of data +(e.g., characters, integers, factors). For example, here is a figure +depicting a data frame comprising a numeric, a character, and a +logical vector. + +![](./fig/data-frame.svg) + +We can see this when inspecting the <b>str</b>ucture of a data frame +with the function `str()`: + +```{r} +str(rna) +``` + +## Inspecting `data.frame` Objects + +We already saw how the functions `head()` and `str()` can be useful to +check the content and the structure of a data frame. Here is a +non-exhaustive list of functions to get a sense of the +content/structure of the data. Let's try them out! + +**Size**: + +- `dim(rna)` - returns a vector with the number of rows as the first + element, and the number of columns as the second element (the + **dim**ensions of the object). +- `nrow(rna)` - returns the number of rows. +- `ncol(rna)` - returns the number of columns. + +**Content**: + +- `head(rna)` - shows the first 6 rows. +- `tail(rna)` - shows the last 6 rows. + +**Names**: + +- `names(rna)` - returns the column names (synonym of `colnames()` for + `data.frame` objects). +- `rownames(rna)` - returns the row names. + +**Summary**: + +- `str(rna)` - structure of the object and information about the + class, length and content of each column. +- `summary(rna)` - summary statistics for each column. + +Note: most of these functions are "generic", they can be used on other types of +objects besides `data.frame`. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge: + +Based on the output of `str(rna)`, can you answer the following +questions? + +- What is the class of the object `rna`? +- How many rows and how many columns are in this object? + +::::::::::::::: solution + +## Solution + +- class: data frame +- how many rows: 66465, how many columns: 11 + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Indexing and subsetting data frames + +Our `rna` data frame has rows and columns (it has 2 dimensions); if we +want to extract some specific data from it, we need to specify the +"coordinates" we want. Row numbers come first, followed by +column numbers. However, note that different ways of specifying these +coordinates lead to results with different classes. + +```{r, eval=FALSE, purl=TRUE} +# first element in the first column of the data frame (as a vector) +rna[1, 1] +# first element in the 6th column (as a vector) +rna[1, 6] +# first column of the data frame (as a vector) +rna[, 1] +# first column of the data frame (as a data.frame) +rna[1] +# first three elements in the 7th column (as a vector) +rna[1:3, 7] +# the 3rd row of the data frame (as a data.frame) +rna[3, ] +# equivalent to head_rna <- head(rna) +head_rna <- rna[1:6, ] +head_rna +``` + +`:` is a special function that creates numeric vectors of integers in +increasing or decreasing order, test `1:10` and `10:1` for +instance. See section @ref(sec:genvec) for details. + +You can also exclude certain indices of a data frame using the "`-`" sign: + +```{r, eval=FALSE, purl=TRUE} +rna[, -1] ## The whole data frame, except the first column +rna[-c(7:66465), ] ## Equivalent to head(rna) +``` + +Data frames can be subsetted by calling indices (as shown previously), +but also by calling their column names directly: + +```{r, eval=FALSE, purl=TRUE} +rna["gene"] # Result is a data.frame +rna[, "gene"] # Result is a vector +rna[["gene"]] # Result is a vector +rna$gene # Result is a vector +``` + +In RStudio, you can use the autocompletion feature to get the full and +correct names of the columns. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge + +1. Create a `data.frame` (`rna_200`) containing only the data in + row 200 of the `rna` dataset. + +2. Notice how `nrow()` gave you the number of rows in a `data.frame`? + +- Use that number to pull out just that last row in the initial + `rna` data frame. + +- Compare that with what you see as the last row using `tail()` to + make sure it's meeting expectations. + +- Pull out that last row using `nrow()` instead of the row number. + +- Create a new data frame (`rna_last`) from that last row. + +3. Use `nrow()` to extract the row that is in the middle of the + `rna` dataframe. Store the content of this row in an object + named `rna_middle`. + +4. Combine `nrow()` with the `-` notation above to reproduce the + behavior of `head(rna)`, keeping just the first through 6th + rows of the rna dataset. + +::::::::::::::: solution + +## Solution + +```{r, purl=TRUE} +## 1. +rna_200 <- rna[200, ] +## 2. +## Saving `n_rows` to improve readability and reduce duplication +n_rows <- nrow(rna) +rna_last <- rna[n_rows, ] +## 3. +rna_middle <- rna[n_rows / 2, ] +## 4. +rna_head <- rna[-(7:n_rows), ] +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Factors + +Factors represent **categorical data**. They are stored as integers +associated with labels and they can be ordered or unordered. While +factors look (and often behave) like character vectors, they are +actually treated as integer vectors by R. So you need to be very +careful when treating them as strings. + +Once created, factors can only contain a pre-defined set of values, +known as _levels_. By default, R always sorts levels in alphabetical +order. For instance, if you have a factor with 2 levels: + +```{r, purl=TRUE} +sex <- factor(c("male", "female", "female", "male", "female")) +``` + +R will assign `1` to the level `"female"` and `2` to the level +`"male"` (because `f` comes before `m`, even though the first element +in this vector is `"male"`). You can see this by using the function +`levels()` and you can find the number of levels using `nlevels()`: + +```{r, purl=TRUE} +levels(sex) +nlevels(sex) +``` + +Sometimes, the order of the factors does not matter, other times you +might want to specify the order because it is meaningful (e.g., "low", +"medium", "high"), it improves your visualization, or it is required +by a particular type of analysis. Here, one way to reorder our levels +in the `sex` vector would be: + +```{r, purl=TRUE} +sex ## current order +sex <- factor(sex, levels = c("male", "female")) +sex ## after re-ordering +``` + +In R's memory, these factors are represented by integers (1, 2, 3), +but are more informative than integers because factors are self +describing: `"female"`, `"male"` is more descriptive than `1`, +`2`. Which one is "male"? You wouldn't be able to tell just from the +integer data. Factors, on the other hand, have this information built-in. +It is particularly helpful when there are many levels (like the +gene biotype in our example dataset). + +When your data is stored as a factor, you can use the `plot()` +function to get a quick glance at the number of observations +represented by each factor level. Let's look at the number of males +and females in our data. + +```{r firstfactorplot, fig.cap="Bar plot of the number of females and males.", purl=TRUE} +plot(sex) +``` + +### Converting to character + +If you need to convert a factor to a character vector, you use +`as.character(x)`. + +```{r, purl=TRUE} +as.character(sex) +``` + +<!-- ### Numeric factors --> + +<!-- Converting factors where the levels appear as numbers (such as --> + +<!-- concentration levels, or years) to a numeric vector is a little --> + +<!-- trickier. The `as.numeric()` function returns the index values of the --> + +<!-- factor, not its levels, so it will result in an entirely new (and --> + +<!-- unwanted in this case) set of numbers. One method to avoid this is to --> + +<!-- convert factors to characters, and then to numbers. Another method is --> + +<!-- to use the `levels()` function. Compare: --> + +<!-- ```{r} --> + +<!-- year_fct <- factor(c(1990, 1983, 1977, 1998, 1990)) --> + +<!-- as.numeric(year_fct) ## Wrong! And there is no warning... --> + +<!-- as.numeric(as.character(year_fct)) ## Works... --> + +<!-- as.numeric(levels(year_fct))[year_fct] ## The recommended way. --> + +<!-- ``` + +<!-- Notice that in the `levels()` approach, three important steps occur: --> + +<!-- * We obtain all the factor levels using `levels(year_fct)` --> + +<!-- * We convert these levels to numeric values using `as.numeric(levels(year_fct))` --> + +<!-- * We then access these numeric values using the underlying integers of the --> + +<!-- vector `year_fct` inside the square brackets --> + +### Renaming factors + +If we want to rename these factor, it is sufficient to change its +levels: + +```{r, purl=TRUE} +levels(sex) +levels(sex) <- c("M", "F") +sex +plot(sex) +``` + +:::::::::::::::::::::::::::::::::::::: challenge + +## Challenge: + +- Rename "F" and "M" to "Female" and "Male" respectively. + +::::::::::::::: solution + +## Solution + +```{r, eval=TRUE, purl=TRUE} +levels(sex) +levels(sex) <- c("Male", "Female") +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge: + +We have seen how data frames are created when using `read.csv()`, but +they can also be created by hand with the `data.frame()` function. +There are a few mistakes in this hand-crafted `data.frame`. Can you +spot and fix them? Don't hesitate to experiment! + +```{r, eval=FALSE} +animal_data <- data.frame( + animal = c(dog, cat, sea cucumber, sea urchin), + feel = c("furry", "squishy", "spiny"), + weight = c(45, 8 1.1, 0.8)) +``` + +::::::::::::::: solution + +## Solution + +- missing quotations around the names of the animals +- missing one entry in the "feel" column (probably for one of the furry animals) +- missing one comma in the weight column + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge: + +Can you predict the class for each of the columns in the following +example? + +Check your guesses using `str(country_climate)`: + +- Are they what you expected? Why? Why not? + +- Try again by adding `stringsAsFactors = TRUE` after the last + variable when creating the data frame. What is happening now? + `stringsAsFactors` can also be set when reading text-based + spreadsheets into R using `read.csv()`. + +```{r, eval=FALSE, purl=TRUE} +country_climate <- data.frame( + country = c("Canada", "Panama", "South Africa", "Australia"), + climate = c("cold", "hot", "temperate", "hot/temperate"), + temperature = c(10, 30, 18, "15"), + northern_hemisphere = c(TRUE, TRUE, FALSE, "FALSE"), + has_kangaroo = c(FALSE, FALSE, FALSE, 1) + ) +``` + +::::::::::::::: solution + +## Solution + +```{r, eval=TRUE, purl=TRUE} +country_climate <- data.frame( + country = c("Canada", "Panama", "South Africa", "Australia"), + climate = c("cold", "hot", "temperate", "hot/temperate"), + temperature = c(10, 30, 18, "15"), + northern_hemisphere = c(TRUE, TRUE, FALSE, "FALSE"), + has_kangaroo = c(FALSE, FALSE, FALSE, 1) + ) +str(country_climate) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +The automatic conversion of data type is sometimes a blessing, sometimes an +annoyance. Be aware that it exists, learn the rules, and double check that data +you import in R are of the correct type within your data frame. If not, use it +to your advantage to detect mistakes that might have been introduced during data +entry (a letter in a column that should only contain numbers for instance). + +Learn more in this RStudio +tutorial + +## Matrices + +Before proceeding, now that we have learnt about data frames, let's +recap package installation and learn about a new data type, namely the +`matrix`. Like a `data.frame`, a matrix has two dimensions, rows and +columns. But the major difference is that all cells in a `matrix` must +be of the same type: `numeric`, `character`, `logical`, ... In that +respect, matrices are closer to a `vector` than a `data.frame`. + +The default constructor for a matrix is `matrix`. It takes a vector of +values to populate the matrix and the number of row and/or +columns[^ncol]. The values are sorted along the columns, as illustrated +below. + +```{r mat1, purl=TRUE} +m <- matrix(1:9, ncol = 3, nrow = 3) +m +``` + +[^ncol]: Either the number of rows or columns are enough, as the other one can be deduced from the length of the values. Try out what happens if the values and number of rows/columns don't add up. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge: + +Using the function `installed.packages()`, create a `character` matrix +containing the information about all packages currently installed on +your computer. Explore it. + +::::::::::::::: solution + +## Solution: + +```{r pkg_sln, eval=FALSE, purl=TRUE} +## create the matrix +ip <- installed.packages() +head(ip) +## try also View(ip) +## number of package +nrow(ip) +## names of all installed packages +rownames(ip) +## type of information we have about each package +colnames(ip) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +It is often useful to create large random data matrices as test +data. The exercise below asks you to create such a matrix with random +data drawn from a normal distribution of mean 0 and standard deviation +1, which can be done with the `rnorm()` function. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge: + +Construct a matrix of dimension 1000 by 3 of normally distributed data +(mean 0, standard deviation 1) + +::::::::::::::: solution + +## Solution + +```{r rnormmat_sln, purl=TRUE} +set.seed(123) +m <- matrix(rnorm(3000), ncol = 3) +dim(m) +head(m) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Formatting Dates + +One of the most common issues that new (and experienced!) R users have +is converting date and time information into a variable that is +appropriate and usable during analyses. + +### Note on dates in spreadsheet programs + +Dates in spreadsheets are generally stored in a single column. While +this seems the most natural way to record dates, it actually is not +best practice. A spreadsheet application will display the dates in a +seemingly correct way (to a human observer) but how it actually +handles and stores the dates may be problematic. It is often much +safer to store dates with YEAR, MONTH and DAY in separate columns or +as YEAR and DAY-OF-YEAR in separate columns. + +Spreadsheet programs such as LibreOffice, Microsoft Excel, OpenOffice, +Gnumeric, ... have different (and often incompatible) ways of encoding +dates (even for the same program between versions and operating +systems). Additionally, Excel can turn things that aren't dates into +dates +(@Zeeberg:2004), for example names or identifiers like MAR1, DEC1, +OCT4. So if you're avoiding the date format overall, it's easier to +identify these issues. + +The Dates as +data +section of the Data Carpentry lesson provides additional insights +about pitfalls of dates with spreadsheets. + +We are going to use the `ymd()` function from the package +**`lubridate`** (which belongs to the **`tidyverse`**; learn more +[here](https://www.tidyverse.org/)). . **`lubridate`** gets installed +as part of the **`tidyverse`** installation. When you load the +**`tidyverse`** (`library(tidyverse)`), the core packages (the +packages used in most data analyses) get loaded. **`lubridate`** +however does not belong to the core tidyverse, so you have to load it +explicitly with `library(lubridate)`. + +Start by loading the required package: + +```{r loadlibridate, message=FALSE, purl=TRUE} +library("lubridate") +``` + +`ymd()` takes a vector representing year, month, and day, and converts +it to a `Date` vector. `Date` is a class of data recognized by R as +being a date and can be manipulated as such. The argument that the +function requires is flexible, but, as a best practice, is a character +vector formatted as "YYYY-MM-DD". + +Let's create a date object and inspect the structure: + +```{r, purl=TRUE} +my_date <- ymd("2015-01-01") +str(my_date) +``` + +Now let's paste the year, month, and day separately - we get the same result: + +```{r, purl=TRUE} +# sep indicates the character to use to separate each component +my_date <- ymd(paste("2015", "1", "1", sep = "-")) +str(my_date) +``` + +Let's now familiarise ourselves with a typical date manipulation +pipeline. The small data below has stored dates in different `year`, +`month` and `day` columns. + +```{r, purl=TRUE} +x <- data.frame(year = c(1996, 1992, 1987, 1986, 2000, 1990, 2002, 1994, 1997, 1985), + month = c(2, 3, 3, 10, 1, 8, 3, 4, 5, 5), + day = c(24, 8, 1, 5, 8, 17, 13, 10, 11, 24), + value = c(4, 5, 1, 9, 3, 8, 10, 2, 6, 7)) +x +``` + +Now we apply this function to the `x` dataset. We first create a +character vector from the `year`, `month`, and `day` columns of `x` +using `paste()`: + +```{r, purl=TRUE} +paste(x$year, x$month, x$day, sep = "-") +``` + +This character vector can be used as the argument for `ymd()`: + +```{r, purl=TRUE} +ymd(paste(x$year, x$month, x$day, sep = "-")) +``` + +The resulting `Date` vector can be added to `x` as a new column called `date`: + +```{r, purl=TRUE} +x$date <- ymd(paste(x$year, x$month, x$day, sep = "-")) +str(x) # notice the new column, with 'date' as the class +``` + +Let's make sure everything worked correctly. One way to inspect the +new column is to use `summary()`: + +```{r, purl=TRUE} +summary(x$date) +``` + +Note that `ymd()` expects to have the year, month and day, in that +order. If you have for instance day, month and year, you would need +`dmy()`. + +```{r, purl=TRUE} +dmy(paste(x$day, x$month, x$year, sep = "-")) +``` + +`lubdridate` has many functions to address all date variations. + +## Summary of R objects + +So far, we have seen several types of R object varying in the number +of dimensions and whether they could store a single or multiple data +types: + +- **`vector`**: one dimension (they have a length), single type of data. +- **`matrix`**: two dimensions, single type of data. +- **`data.frame`**: two dimensions, one type per column. + +## Lists + +A data type that we haven't seen yet, but that is useful to know, and +follows from the summary that we have just seen are lists: + +- **`list`**: one dimension, every item can be of a different data + type. + +Below, let's create a list containing a vector of numbers, characters, +a matrix, a dataframe and another list: + +```{r list0, purl=TRUE} +l <- list(1:10, ## numeric + letters, ## character + installed.packages(), ## a matrix + cars, ## a data.frame + list(1, 2, 3)) ## a list +length(l) +str(l) +``` + +List subsetting is done using `[]` to subset a new sub-list or `[[]]` +to extract a single element of that list (using indices or names, if +the list is named). + +```{r, purl=TRUE} +l[[1]] ## first element +l[1:2] ## a list of length 2 +l[1] ## a list of length 1 +``` + +## Exporting and saving tabular data {#sec:exportandsave} + +We have seen how to read a text-based spreadsheet into R using the +`read.table` family of functions. To export a `data.frame` to a +text-based spreadsheet, we can use the `write.table` set of functions +(`write.csv`, `write.delim`, ...). They all take the variable to be +exported and the file to be exported to. For example, to export the +`rna` data to the `my_rna.csv` file in the `data_output` +directory, we would execute: + +```{r, eval=FALSE, purl=TRUE} +write.csv(rna, file = "data_output/my_rna.csv") +``` + +This new csv file can now be shared with other collaborators who +aren't familiar with R. Note that even though there are commas in some of +the fields in the `data.frame` (see for example the "product" column), R will +by default surround each field with quotes, and thus we will be able to +read it back into R correctly, despite also using commas as column +separators. + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- Tabular data in R + +:::::::::::::::::::::::::::::::::::::::::::::::::: From af011325ccdce291ae0221cca60f99938bf27e90 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Fri, 29 Dec 2023 06:10:52 +0900 Subject: [PATCH 020/334] New translations 25-starting-with-data.md (Portuguese) --- locale/pt/episodes/25-starting-with-data.Rmd | 781 +++++++++++++++++++ 1 file changed, 781 insertions(+) create mode 100644 locale/pt/episodes/25-starting-with-data.Rmd diff --git a/locale/pt/episodes/25-starting-with-data.Rmd b/locale/pt/episodes/25-starting-with-data.Rmd new file mode 100644 index 000000000..bc29da0cd --- /dev/null +++ b/locale/pt/episodes/25-starting-with-data.Rmd @@ -0,0 +1,781 @@ +--- +source: Rmd +title: Starting with data +teaching: 30 +exercises: 30 +--- + +```{r, include=FALSE} +``` + +::::::::::::::::::::::::::::::::::::::: objectives + +- Describe what a `data.frame` is. +- Load external data from a .csv file into a data frame. +- Summarize the contents of a data frame. +- Describe what a factor is. +- Convert between strings and factors. +- Reorder and rename factors. +- Format dates. +- Export and save data. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- First data analysis in R + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Presentation of the gene expression data + +We are going to use part of the data published by Blackmore , _The +effect of upper-respiratory infection on transcriptomic changes in the +CNS_. The goal of the study was to determine the effect of an +upper-respiratory infection on changes in RNA transcription occurring +in the cerebellum and spinal cord post infection. Gender matched eight +week old C57BL/6 mice were inoculated with saline or with Influenza A by +intranasal route and transcriptomic changes in the cerebellum and +spinal cord tissues were evaluated by RNA-seq at days 0 +(non-infected), 4 and 8. + +The dataset is stored as a comma-separated values (CSV) file. Each row +holds information for a single RNA expression measurement, and the first eleven +columns represent: + +| Column | Description | +| ---------- | -------------------------------------------------------------------------------------------- | +| gene | The name of the gene that was measured | +| sample | The name of the sample the gene expression was measured in | +| expression | The value of the gene expression | +| organism | The organism/species - here all data stem from mice | +| age | The age of the mouse (all mice were 8 weeks here) | +| sex | The sex of the mouse | +| infection | The infection state of the mouse, i.e. infected with Influenza A or not infected. | +| strain | The Influenza A strain. | +| time | The duration of the infection (in days). | +| tissue | The tissue that was used for the gene expression experiment, i.e. cerebellum or spinal cord. | +| mouse | The mouse unique identifier. | + +We are going to use the R function `download.file()` to download the +CSV file that contains the gene expression data, and we will use +`read.csv()` to load into memory the content of the CSV file as an +object of class `data.frame`. Inside the `download.file` command, the +first entry is a character string with the source URL. This source URL +downloads a CSV file from a GitHub repository. The text after the +comma (`"data/rnaseq.csv"`) is the destination of the file on your +local machine. You'll need to have a folder on your machine called +`"data"` where you'll download the file. So this command downloads the +remote file, names it `"rnaseq.csv"` and adds it to a preexisting +folder named `"data"`. + +```{r, eval=TRUE} +download.file(url = "https://github.com/carpentries-incubator/bioc-intro/raw/main/episodes/data/rnaseq.csv", + destfile = "data/rnaseq.csv") +``` + +You are now ready to load the data: + +```{r, eval=TRUE, purl=TRUE} +rna <- read.csv("data/rnaseq.csv") +``` + +This statement doesn't produce any output because, as you might +recall, assignments don't display anything. If we want to check that +our data has been loaded, we can see the contents of the data frame by +typing its name: + +```{r, eval=FALSE} +rna +``` + +Wow\... that was a lot of output. At least it means the data loaded +properly. Let's check the top (the first 6 lines) of this data frame +using the function `head()`: + +```{r, purl=TRUE} +head(rna) +## Try also +## View(rna) +``` + +**Note** + +`read.csv()` assumes that fields are delineated by commas, however, in +several countries, the comma is used as a decimal separator and the +semicolon (;) is used as a field delineator. If you want to read in +this type of files in R, you can use the `read.csv2()` function. It +behaves exactly like `read.csv()` but uses different parameters for +the decimal and the field separators. If you are working with another +format, they can be both specified by the user. Check out the help for +`read.csv()` by typing `?read.csv` to learn more. There is also the +`read.delim()` function for reading tab separated data files. It is important to +note that all of these functions are actually wrapper functions for +the main `read.table()` function with different arguments. As such, +the data above could have also been loaded by using `read.table()` +with the separation argument as `,`. The code is as follows: + +```{r, eval=TRUE, purl=TRUE} +rna <- read.table(file = "data/rnaseq.csv", + sep = ",", + header = TRUE) +``` + +The header argument has to be set to TRUE to be able to read the +headers as by default `read.table()` has the header argument set to +FALSE. + +## What are data frames? + +Data frames are the _de facto_ data structure for most tabular data, +and what we use for statistics and plotting. + +A data frame can be created by hand, but most commonly they are +generated by the functions `read.csv()` or `read.table()`; in other +words, when importing spreadsheets from your hard drive (or the web). + +A data frame is the representation of data in the format of a table +where the columns are vectors that all have the same length. Because +columns are vectors, each column must contain a single type of data +(e.g., characters, integers, factors). For example, here is a figure +depicting a data frame comprising a numeric, a character, and a +logical vector. + +![](./fig/data-frame.svg) + +We can see this when inspecting the <b>str</b>ucture of a data frame +with the function `str()`: + +```{r} +str(rna) +``` + +## Inspecting `data.frame` Objects + +We already saw how the functions `head()` and `str()` can be useful to +check the content and the structure of a data frame. Here is a +non-exhaustive list of functions to get a sense of the +content/structure of the data. Let's try them out! + +**Size**: + +- `dim(rna)` - returns a vector with the number of rows as the first + element, and the number of columns as the second element (the + **dim**ensions of the object). +- `nrow(rna)` - returns the number of rows. +- `ncol(rna)` - returns the number of columns. + +**Content**: + +- `head(rna)` - shows the first 6 rows. +- `tail(rna)` - shows the last 6 rows. + +**Names**: + +- `names(rna)` - returns the column names (synonym of `colnames()` for + `data.frame` objects). +- `rownames(rna)` - returns the row names. + +**Summary**: + +- `str(rna)` - structure of the object and information about the + class, length and content of each column. +- `summary(rna)` - summary statistics for each column. + +Note: most of these functions are "generic", they can be used on other types of +objects besides `data.frame`. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge: + +Based on the output of `str(rna)`, can you answer the following +questions? + +- What is the class of the object `rna`? +- How many rows and how many columns are in this object? + +::::::::::::::: solution + +## Solution + +- class: data frame +- how many rows: 66465, how many columns: 11 + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Indexing and subsetting data frames + +Our `rna` data frame has rows and columns (it has 2 dimensions); if we +want to extract some specific data from it, we need to specify the +"coordinates" we want. Row numbers come first, followed by +column numbers. However, note that different ways of specifying these +coordinates lead to results with different classes. + +```{r, eval=FALSE, purl=TRUE} +# first element in the first column of the data frame (as a vector) +rna[1, 1] +# first element in the 6th column (as a vector) +rna[1, 6] +# first column of the data frame (as a vector) +rna[, 1] +# first column of the data frame (as a data.frame) +rna[1] +# first three elements in the 7th column (as a vector) +rna[1:3, 7] +# the 3rd row of the data frame (as a data.frame) +rna[3, ] +# equivalent to head_rna <- head(rna) +head_rna <- rna[1:6, ] +head_rna +``` + +`:` is a special function that creates numeric vectors of integers in +increasing or decreasing order, test `1:10` and `10:1` for +instance. See section @ref(sec:genvec) for details. + +You can also exclude certain indices of a data frame using the "`-`" sign: + +```{r, eval=FALSE, purl=TRUE} +rna[, -1] ## The whole data frame, except the first column +rna[-c(7:66465), ] ## Equivalent to head(rna) +``` + +Data frames can be subsetted by calling indices (as shown previously), +but also by calling their column names directly: + +```{r, eval=FALSE, purl=TRUE} +rna["gene"] # Result is a data.frame +rna[, "gene"] # Result is a vector +rna[["gene"]] # Result is a vector +rna$gene # Result is a vector +``` + +In RStudio, you can use the autocompletion feature to get the full and +correct names of the columns. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge + +1. Create a `data.frame` (`rna_200`) containing only the data in + row 200 of the `rna` dataset. + +2. Notice how `nrow()` gave you the number of rows in a `data.frame`? + +- Use that number to pull out just that last row in the initial + `rna` data frame. + +- Compare that with what you see as the last row using `tail()` to + make sure it's meeting expectations. + +- Pull out that last row using `nrow()` instead of the row number. + +- Create a new data frame (`rna_last`) from that last row. + +3. Use `nrow()` to extract the row that is in the middle of the + `rna` dataframe. Store the content of this row in an object + named `rna_middle`. + +4. Combine `nrow()` with the `-` notation above to reproduce the + behavior of `head(rna)`, keeping just the first through 6th + rows of the rna dataset. + +::::::::::::::: solution + +## Solution + +```{r, purl=TRUE} +## 1. +rna_200 <- rna[200, ] +## 2. +## Saving `n_rows` to improve readability and reduce duplication +n_rows <- nrow(rna) +rna_last <- rna[n_rows, ] +## 3. +rna_middle <- rna[n_rows / 2, ] +## 4. +rna_head <- rna[-(7:n_rows), ] +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Factors + +Factors represent **categorical data**. They are stored as integers +associated with labels and they can be ordered or unordered. While +factors look (and often behave) like character vectors, they are +actually treated as integer vectors by R. So you need to be very +careful when treating them as strings. + +Once created, factors can only contain a pre-defined set of values, +known as _levels_. By default, R always sorts levels in alphabetical +order. For instance, if you have a factor with 2 levels: + +```{r, purl=TRUE} +sex <- factor(c("male", "female", "female", "male", "female")) +``` + +R will assign `1` to the level `"female"` and `2` to the level +`"male"` (because `f` comes before `m`, even though the first element +in this vector is `"male"`). You can see this by using the function +`levels()` and you can find the number of levels using `nlevels()`: + +```{r, purl=TRUE} +levels(sex) +nlevels(sex) +``` + +Sometimes, the order of the factors does not matter, other times you +might want to specify the order because it is meaningful (e.g., "low", +"medium", "high"), it improves your visualization, or it is required +by a particular type of analysis. Here, one way to reorder our levels +in the `sex` vector would be: + +```{r, purl=TRUE} +sex ## current order +sex <- factor(sex, levels = c("male", "female")) +sex ## after re-ordering +``` + +In R's memory, these factors are represented by integers (1, 2, 3), +but are more informative than integers because factors are self +describing: `"female"`, `"male"` is more descriptive than `1`, +`2`. Which one is "male"? You wouldn't be able to tell just from the +integer data. Factors, on the other hand, have this information built-in. +It is particularly helpful when there are many levels (like the +gene biotype in our example dataset). + +When your data is stored as a factor, you can use the `plot()` +function to get a quick glance at the number of observations +represented by each factor level. Let's look at the number of males +and females in our data. + +```{r firstfactorplot, fig.cap="Bar plot of the number of females and males.", purl=TRUE} +plot(sex) +``` + +### Converting to character + +If you need to convert a factor to a character vector, you use +`as.character(x)`. + +```{r, purl=TRUE} +as.character(sex) +``` + +<!-- ### Numeric factors --> + +<!-- Converting factors where the levels appear as numbers (such as --> + +<!-- concentration levels, or years) to a numeric vector is a little --> + +<!-- trickier. The `as.numeric()` function returns the index values of the --> + +<!-- factor, not its levels, so it will result in an entirely new (and --> + +<!-- unwanted in this case) set of numbers. One method to avoid this is to --> + +<!-- convert factors to characters, and then to numbers. Another method is --> + +<!-- to use the `levels()` function. Compare: --> + +<!-- ```{r} --> + +<!-- year_fct <- factor(c(1990, 1983, 1977, 1998, 1990)) --> + +<!-- as.numeric(year_fct) ## Wrong! And there is no warning... --> + +<!-- as.numeric(as.character(year_fct)) ## Works... --> + +<!-- as.numeric(levels(year_fct))[year_fct] ## The recommended way. --> + +<!-- ``` + +<!-- Notice that in the `levels()` approach, three important steps occur: --> + +<!-- * We obtain all the factor levels using `levels(year_fct)` --> + +<!-- * We convert these levels to numeric values using `as.numeric(levels(year_fct))` --> + +<!-- * We then access these numeric values using the underlying integers of the --> + +<!-- vector `year_fct` inside the square brackets --> + +### Renaming factors + +If we want to rename these factor, it is sufficient to change its +levels: + +```{r, purl=TRUE} +levels(sex) +levels(sex) <- c("M", "F") +sex +plot(sex) +``` + +:::::::::::::::::::::::::::::::::::::: challenge + +## Challenge: + +- Rename "F" and "M" to "Female" and "Male" respectively. + +::::::::::::::: solution + +## Solution + +```{r, eval=TRUE, purl=TRUE} +levels(sex) +levels(sex) <- c("Male", "Female") +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge: + +We have seen how data frames are created when using `read.csv()`, but +they can also be created by hand with the `data.frame()` function. +There are a few mistakes in this hand-crafted `data.frame`. Can you +spot and fix them? Don't hesitate to experiment! + +```{r, eval=FALSE} +animal_data <- data.frame( + animal = c(dog, cat, sea cucumber, sea urchin), + feel = c("furry", "squishy", "spiny"), + weight = c(45, 8 1.1, 0.8)) +``` + +::::::::::::::: solution + +## Solution + +- missing quotations around the names of the animals +- missing one entry in the "feel" column (probably for one of the furry animals) +- missing one comma in the weight column + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge: + +Can you predict the class for each of the columns in the following +example? + +Check your guesses using `str(country_climate)`: + +- Are they what you expected? Why? Why not? + +- Try again by adding `stringsAsFactors = TRUE` after the last + variable when creating the data frame. What is happening now? + `stringsAsFactors` can also be set when reading text-based + spreadsheets into R using `read.csv()`. + +```{r, eval=FALSE, purl=TRUE} +country_climate <- data.frame( + country = c("Canada", "Panama", "South Africa", "Australia"), + climate = c("cold", "hot", "temperate", "hot/temperate"), + temperature = c(10, 30, 18, "15"), + northern_hemisphere = c(TRUE, TRUE, FALSE, "FALSE"), + has_kangaroo = c(FALSE, FALSE, FALSE, 1) + ) +``` + +::::::::::::::: solution + +## Solution + +```{r, eval=TRUE, purl=TRUE} +country_climate <- data.frame( + country = c("Canada", "Panama", "South Africa", "Australia"), + climate = c("cold", "hot", "temperate", "hot/temperate"), + temperature = c(10, 30, 18, "15"), + northern_hemisphere = c(TRUE, TRUE, FALSE, "FALSE"), + has_kangaroo = c(FALSE, FALSE, FALSE, 1) + ) +str(country_climate) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +The automatic conversion of data type is sometimes a blessing, sometimes an +annoyance. Be aware that it exists, learn the rules, and double check that data +you import in R are of the correct type within your data frame. If not, use it +to your advantage to detect mistakes that might have been introduced during data +entry (a letter in a column that should only contain numbers for instance). + +Learn more in this RStudio +tutorial + +## Matrices + +Before proceeding, now that we have learnt about data frames, let's +recap package installation and learn about a new data type, namely the +`matrix`. Like a `data.frame`, a matrix has two dimensions, rows and +columns. But the major difference is that all cells in a `matrix` must +be of the same type: `numeric`, `character`, `logical`, ... In that +respect, matrices are closer to a `vector` than a `data.frame`. + +The default constructor for a matrix is `matrix`. It takes a vector of +values to populate the matrix and the number of row and/or +columns[^ncol]. The values are sorted along the columns, as illustrated +below. + +```{r mat1, purl=TRUE} +m <- matrix(1:9, ncol = 3, nrow = 3) +m +``` + +[^ncol]: Either the number of rows or columns are enough, as the other one can be deduced from the length of the values. Try out what happens if the values and number of rows/columns don't add up. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge: + +Using the function `installed.packages()`, create a `character` matrix +containing the information about all packages currently installed on +your computer. Explore it. + +::::::::::::::: solution + +## Solution: + +```{r pkg_sln, eval=FALSE, purl=TRUE} +## create the matrix +ip <- installed.packages() +head(ip) +## try also View(ip) +## number of package +nrow(ip) +## names of all installed packages +rownames(ip) +## type of information we have about each package +colnames(ip) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +It is often useful to create large random data matrices as test +data. The exercise below asks you to create such a matrix with random +data drawn from a normal distribution of mean 0 and standard deviation +1, which can be done with the `rnorm()` function. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge: + +Construct a matrix of dimension 1000 by 3 of normally distributed data +(mean 0, standard deviation 1) + +::::::::::::::: solution + +## Solution + +```{r rnormmat_sln, purl=TRUE} +set.seed(123) +m <- matrix(rnorm(3000), ncol = 3) +dim(m) +head(m) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Formatting Dates + +One of the most common issues that new (and experienced!) R users have +is converting date and time information into a variable that is +appropriate and usable during analyses. + +### Note on dates in spreadsheet programs + +Dates in spreadsheets are generally stored in a single column. While +this seems the most natural way to record dates, it actually is not +best practice. A spreadsheet application will display the dates in a +seemingly correct way (to a human observer) but how it actually +handles and stores the dates may be problematic. It is often much +safer to store dates with YEAR, MONTH and DAY in separate columns or +as YEAR and DAY-OF-YEAR in separate columns. + +Spreadsheet programs such as LibreOffice, Microsoft Excel, OpenOffice, +Gnumeric, ... have different (and often incompatible) ways of encoding +dates (even for the same program between versions and operating +systems). Additionally, Excel can turn things that aren't dates into +dates +(@Zeeberg:2004), for example names or identifiers like MAR1, DEC1, +OCT4. So if you're avoiding the date format overall, it's easier to +identify these issues. + +The Dates as +data +section of the Data Carpentry lesson provides additional insights +about pitfalls of dates with spreadsheets. + +We are going to use the `ymd()` function from the package +**`lubridate`** (which belongs to the **`tidyverse`**; learn more +[here](https://www.tidyverse.org/)). . **`lubridate`** gets installed +as part of the **`tidyverse`** installation. When you load the +**`tidyverse`** (`library(tidyverse)`), the core packages (the +packages used in most data analyses) get loaded. **`lubridate`** +however does not belong to the core tidyverse, so you have to load it +explicitly with `library(lubridate)`. + +Start by loading the required package: + +```{r loadlibridate, message=FALSE, purl=TRUE} +library("lubridate") +``` + +`ymd()` takes a vector representing year, month, and day, and converts +it to a `Date` vector. `Date` is a class of data recognized by R as +being a date and can be manipulated as such. The argument that the +function requires is flexible, but, as a best practice, is a character +vector formatted as "YYYY-MM-DD". + +Let's create a date object and inspect the structure: + +```{r, purl=TRUE} +my_date <- ymd("2015-01-01") +str(my_date) +``` + +Now let's paste the year, month, and day separately - we get the same result: + +```{r, purl=TRUE} +# sep indicates the character to use to separate each component +my_date <- ymd(paste("2015", "1", "1", sep = "-")) +str(my_date) +``` + +Let's now familiarise ourselves with a typical date manipulation +pipeline. The small data below has stored dates in different `year`, +`month` and `day` columns. + +```{r, purl=TRUE} +x <- data.frame(year = c(1996, 1992, 1987, 1986, 2000, 1990, 2002, 1994, 1997, 1985), + month = c(2, 3, 3, 10, 1, 8, 3, 4, 5, 5), + day = c(24, 8, 1, 5, 8, 17, 13, 10, 11, 24), + value = c(4, 5, 1, 9, 3, 8, 10, 2, 6, 7)) +x +``` + +Now we apply this function to the `x` dataset. We first create a +character vector from the `year`, `month`, and `day` columns of `x` +using `paste()`: + +```{r, purl=TRUE} +paste(x$year, x$month, x$day, sep = "-") +``` + +This character vector can be used as the argument for `ymd()`: + +```{r, purl=TRUE} +ymd(paste(x$year, x$month, x$day, sep = "-")) +``` + +The resulting `Date` vector can be added to `x` as a new column called `date`: + +```{r, purl=TRUE} +x$date <- ymd(paste(x$year, x$month, x$day, sep = "-")) +str(x) # notice the new column, with 'date' as the class +``` + +Let's make sure everything worked correctly. One way to inspect the +new column is to use `summary()`: + +```{r, purl=TRUE} +summary(x$date) +``` + +Note that `ymd()` expects to have the year, month and day, in that +order. If you have for instance day, month and year, you would need +`dmy()`. + +```{r, purl=TRUE} +dmy(paste(x$day, x$month, x$year, sep = "-")) +``` + +`lubdridate` has many functions to address all date variations. + +## Summary of R objects + +So far, we have seen several types of R object varying in the number +of dimensions and whether they could store a single or multiple data +types: + +- **`vector`**: one dimension (they have a length), single type of data. +- **`matrix`**: two dimensions, single type of data. +- **`data.frame`**: two dimensions, one type per column. + +## Lists + +A data type that we haven't seen yet, but that is useful to know, and +follows from the summary that we have just seen are lists: + +- **`list`**: one dimension, every item can be of a different data + type. + +Below, let's create a list containing a vector of numbers, characters, +a matrix, a dataframe and another list: + +```{r list0, purl=TRUE} +l <- list(1:10, ## numeric + letters, ## character + installed.packages(), ## a matrix + cars, ## a data.frame + list(1, 2, 3)) ## a list +length(l) +str(l) +``` + +List subsetting is done using `[]` to subset a new sub-list or `[[]]` +to extract a single element of that list (using indices or names, if +the list is named). + +```{r, purl=TRUE} +l[[1]] ## first element +l[1:2] ## a list of length 2 +l[1] ## a list of length 1 +``` + +## Exporting and saving tabular data {#sec:exportandsave} + +We have seen how to read a text-based spreadsheet into R using the +`read.table` family of functions. To export a `data.frame` to a +text-based spreadsheet, we can use the `write.table` set of functions +(`write.csv`, `write.delim`, ...). They all take the variable to be +exported and the file to be exported to. For example, to export the +`rna` data to the `my_rna.csv` file in the `data_output` +directory, we would execute: + +```{r, eval=FALSE, purl=TRUE} +write.csv(rna, file = "data_output/my_rna.csv") +``` + +This new csv file can now be shared with other collaborators who +aren't familiar with R. Note that even though there are commas in some of +the fields in the `data.frame` (see for example the "product" column), R will +by default surround each field with quotes, and thus we will be able to +read it back into R correctly, despite also using commas as column +separators. + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- Tabular data in R + +:::::::::::::::::::::::::::::::::::::::::::::::::: From a75f272ec14e16e75a07aa788689ff9ad5200f70 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Fri, 29 Dec 2023 06:10:54 +0900 Subject: [PATCH 021/334] New translations 25-starting-with-data.md (Chinese Simplified) --- locale/zh/episodes/25-starting-with-data.Rmd | 781 +++++++++++++++++++ 1 file changed, 781 insertions(+) create mode 100644 locale/zh/episodes/25-starting-with-data.Rmd diff --git a/locale/zh/episodes/25-starting-with-data.Rmd b/locale/zh/episodes/25-starting-with-data.Rmd new file mode 100644 index 000000000..bc29da0cd --- /dev/null +++ b/locale/zh/episodes/25-starting-with-data.Rmd @@ -0,0 +1,781 @@ +--- +source: Rmd +title: Starting with data +teaching: 30 +exercises: 30 +--- + +```{r, include=FALSE} +``` + +::::::::::::::::::::::::::::::::::::::: objectives + +- Describe what a `data.frame` is. +- Load external data from a .csv file into a data frame. +- Summarize the contents of a data frame. +- Describe what a factor is. +- Convert between strings and factors. +- Reorder and rename factors. +- Format dates. +- Export and save data. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- First data analysis in R + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Presentation of the gene expression data + +We are going to use part of the data published by Blackmore , _The +effect of upper-respiratory infection on transcriptomic changes in the +CNS_. The goal of the study was to determine the effect of an +upper-respiratory infection on changes in RNA transcription occurring +in the cerebellum and spinal cord post infection. Gender matched eight +week old C57BL/6 mice were inoculated with saline or with Influenza A by +intranasal route and transcriptomic changes in the cerebellum and +spinal cord tissues were evaluated by RNA-seq at days 0 +(non-infected), 4 and 8. + +The dataset is stored as a comma-separated values (CSV) file. Each row +holds information for a single RNA expression measurement, and the first eleven +columns represent: + +| Column | Description | +| ---------- | -------------------------------------------------------------------------------------------- | +| gene | The name of the gene that was measured | +| sample | The name of the sample the gene expression was measured in | +| expression | The value of the gene expression | +| organism | The organism/species - here all data stem from mice | +| age | The age of the mouse (all mice were 8 weeks here) | +| sex | The sex of the mouse | +| infection | The infection state of the mouse, i.e. infected with Influenza A or not infected. | +| strain | The Influenza A strain. | +| time | The duration of the infection (in days). | +| tissue | The tissue that was used for the gene expression experiment, i.e. cerebellum or spinal cord. | +| mouse | The mouse unique identifier. | + +We are going to use the R function `download.file()` to download the +CSV file that contains the gene expression data, and we will use +`read.csv()` to load into memory the content of the CSV file as an +object of class `data.frame`. Inside the `download.file` command, the +first entry is a character string with the source URL. This source URL +downloads a CSV file from a GitHub repository. The text after the +comma (`"data/rnaseq.csv"`) is the destination of the file on your +local machine. You'll need to have a folder on your machine called +`"data"` where you'll download the file. So this command downloads the +remote file, names it `"rnaseq.csv"` and adds it to a preexisting +folder named `"data"`. + +```{r, eval=TRUE} +download.file(url = "https://github.com/carpentries-incubator/bioc-intro/raw/main/episodes/data/rnaseq.csv", + destfile = "data/rnaseq.csv") +``` + +You are now ready to load the data: + +```{r, eval=TRUE, purl=TRUE} +rna <- read.csv("data/rnaseq.csv") +``` + +This statement doesn't produce any output because, as you might +recall, assignments don't display anything. If we want to check that +our data has been loaded, we can see the contents of the data frame by +typing its name: + +```{r, eval=FALSE} +rna +``` + +Wow\... that was a lot of output. At least it means the data loaded +properly. Let's check the top (the first 6 lines) of this data frame +using the function `head()`: + +```{r, purl=TRUE} +head(rna) +## Try also +## View(rna) +``` + +**Note** + +`read.csv()` assumes that fields are delineated by commas, however, in +several countries, the comma is used as a decimal separator and the +semicolon (;) is used as a field delineator. If you want to read in +this type of files in R, you can use the `read.csv2()` function. It +behaves exactly like `read.csv()` but uses different parameters for +the decimal and the field separators. If you are working with another +format, they can be both specified by the user. Check out the help for +`read.csv()` by typing `?read.csv` to learn more. There is also the +`read.delim()` function for reading tab separated data files. It is important to +note that all of these functions are actually wrapper functions for +the main `read.table()` function with different arguments. As such, +the data above could have also been loaded by using `read.table()` +with the separation argument as `,`. The code is as follows: + +```{r, eval=TRUE, purl=TRUE} +rna <- read.table(file = "data/rnaseq.csv", + sep = ",", + header = TRUE) +``` + +The header argument has to be set to TRUE to be able to read the +headers as by default `read.table()` has the header argument set to +FALSE. + +## What are data frames? + +Data frames are the _de facto_ data structure for most tabular data, +and what we use for statistics and plotting. + +A data frame can be created by hand, but most commonly they are +generated by the functions `read.csv()` or `read.table()`; in other +words, when importing spreadsheets from your hard drive (or the web). + +A data frame is the representation of data in the format of a table +where the columns are vectors that all have the same length. Because +columns are vectors, each column must contain a single type of data +(e.g., characters, integers, factors). For example, here is a figure +depicting a data frame comprising a numeric, a character, and a +logical vector. + +![](./fig/data-frame.svg) + +We can see this when inspecting the <b>str</b>ucture of a data frame +with the function `str()`: + +```{r} +str(rna) +``` + +## Inspecting `data.frame` Objects + +We already saw how the functions `head()` and `str()` can be useful to +check the content and the structure of a data frame. Here is a +non-exhaustive list of functions to get a sense of the +content/structure of the data. Let's try them out! + +**Size**: + +- `dim(rna)` - returns a vector with the number of rows as the first + element, and the number of columns as the second element (the + **dim**ensions of the object). +- `nrow(rna)` - returns the number of rows. +- `ncol(rna)` - returns the number of columns. + +**Content**: + +- `head(rna)` - shows the first 6 rows. +- `tail(rna)` - shows the last 6 rows. + +**Names**: + +- `names(rna)` - returns the column names (synonym of `colnames()` for + `data.frame` objects). +- `rownames(rna)` - returns the row names. + +**Summary**: + +- `str(rna)` - structure of the object and information about the + class, length and content of each column. +- `summary(rna)` - summary statistics for each column. + +Note: most of these functions are "generic", they can be used on other types of +objects besides `data.frame`. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge: + +Based on the output of `str(rna)`, can you answer the following +questions? + +- What is the class of the object `rna`? +- How many rows and how many columns are in this object? + +::::::::::::::: solution + +## Solution + +- class: data frame +- how many rows: 66465, how many columns: 11 + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Indexing and subsetting data frames + +Our `rna` data frame has rows and columns (it has 2 dimensions); if we +want to extract some specific data from it, we need to specify the +"coordinates" we want. Row numbers come first, followed by +column numbers. However, note that different ways of specifying these +coordinates lead to results with different classes. + +```{r, eval=FALSE, purl=TRUE} +# first element in the first column of the data frame (as a vector) +rna[1, 1] +# first element in the 6th column (as a vector) +rna[1, 6] +# first column of the data frame (as a vector) +rna[, 1] +# first column of the data frame (as a data.frame) +rna[1] +# first three elements in the 7th column (as a vector) +rna[1:3, 7] +# the 3rd row of the data frame (as a data.frame) +rna[3, ] +# equivalent to head_rna <- head(rna) +head_rna <- rna[1:6, ] +head_rna +``` + +`:` is a special function that creates numeric vectors of integers in +increasing or decreasing order, test `1:10` and `10:1` for +instance. See section @ref(sec:genvec) for details. + +You can also exclude certain indices of a data frame using the "`-`" sign: + +```{r, eval=FALSE, purl=TRUE} +rna[, -1] ## The whole data frame, except the first column +rna[-c(7:66465), ] ## Equivalent to head(rna) +``` + +Data frames can be subsetted by calling indices (as shown previously), +but also by calling their column names directly: + +```{r, eval=FALSE, purl=TRUE} +rna["gene"] # Result is a data.frame +rna[, "gene"] # Result is a vector +rna[["gene"]] # Result is a vector +rna$gene # Result is a vector +``` + +In RStudio, you can use the autocompletion feature to get the full and +correct names of the columns. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge + +1. Create a `data.frame` (`rna_200`) containing only the data in + row 200 of the `rna` dataset. + +2. Notice how `nrow()` gave you the number of rows in a `data.frame`? + +- Use that number to pull out just that last row in the initial + `rna` data frame. + +- Compare that with what you see as the last row using `tail()` to + make sure it's meeting expectations. + +- Pull out that last row using `nrow()` instead of the row number. + +- Create a new data frame (`rna_last`) from that last row. + +3. Use `nrow()` to extract the row that is in the middle of the + `rna` dataframe. Store the content of this row in an object + named `rna_middle`. + +4. Combine `nrow()` with the `-` notation above to reproduce the + behavior of `head(rna)`, keeping just the first through 6th + rows of the rna dataset. + +::::::::::::::: solution + +## Solution + +```{r, purl=TRUE} +## 1. +rna_200 <- rna[200, ] +## 2. +## Saving `n_rows` to improve readability and reduce duplication +n_rows <- nrow(rna) +rna_last <- rna[n_rows, ] +## 3. +rna_middle <- rna[n_rows / 2, ] +## 4. +rna_head <- rna[-(7:n_rows), ] +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Factors + +Factors represent **categorical data**. They are stored as integers +associated with labels and they can be ordered or unordered. While +factors look (and often behave) like character vectors, they are +actually treated as integer vectors by R. So you need to be very +careful when treating them as strings. + +Once created, factors can only contain a pre-defined set of values, +known as _levels_. By default, R always sorts levels in alphabetical +order. For instance, if you have a factor with 2 levels: + +```{r, purl=TRUE} +sex <- factor(c("male", "female", "female", "male", "female")) +``` + +R will assign `1` to the level `"female"` and `2` to the level +`"male"` (because `f` comes before `m`, even though the first element +in this vector is `"male"`). You can see this by using the function +`levels()` and you can find the number of levels using `nlevels()`: + +```{r, purl=TRUE} +levels(sex) +nlevels(sex) +``` + +Sometimes, the order of the factors does not matter, other times you +might want to specify the order because it is meaningful (e.g., "low", +"medium", "high"), it improves your visualization, or it is required +by a particular type of analysis. Here, one way to reorder our levels +in the `sex` vector would be: + +```{r, purl=TRUE} +sex ## current order +sex <- factor(sex, levels = c("male", "female")) +sex ## after re-ordering +``` + +In R's memory, these factors are represented by integers (1, 2, 3), +but are more informative than integers because factors are self +describing: `"female"`, `"male"` is more descriptive than `1`, +`2`. Which one is "male"? You wouldn't be able to tell just from the +integer data. Factors, on the other hand, have this information built-in. +It is particularly helpful when there are many levels (like the +gene biotype in our example dataset). + +When your data is stored as a factor, you can use the `plot()` +function to get a quick glance at the number of observations +represented by each factor level. Let's look at the number of males +and females in our data. + +```{r firstfactorplot, fig.cap="Bar plot of the number of females and males.", purl=TRUE} +plot(sex) +``` + +### Converting to character + +If you need to convert a factor to a character vector, you use +`as.character(x)`. + +```{r, purl=TRUE} +as.character(sex) +``` + +<!-- ### Numeric factors --> + +<!-- Converting factors where the levels appear as numbers (such as --> + +<!-- concentration levels, or years) to a numeric vector is a little --> + +<!-- trickier. The `as.numeric()` function returns the index values of the --> + +<!-- factor, not its levels, so it will result in an entirely new (and --> + +<!-- unwanted in this case) set of numbers. One method to avoid this is to --> + +<!-- convert factors to characters, and then to numbers. Another method is --> + +<!-- to use the `levels()` function. Compare: --> + +<!-- ```{r} --> + +<!-- year_fct <- factor(c(1990, 1983, 1977, 1998, 1990)) --> + +<!-- as.numeric(year_fct) ## Wrong! And there is no warning... --> + +<!-- as.numeric(as.character(year_fct)) ## Works... --> + +<!-- as.numeric(levels(year_fct))[year_fct] ## The recommended way. --> + +<!-- ``` + +<!-- Notice that in the `levels()` approach, three important steps occur: --> + +<!-- * We obtain all the factor levels using `levels(year_fct)` --> + +<!-- * We convert these levels to numeric values using `as.numeric(levels(year_fct))` --> + +<!-- * We then access these numeric values using the underlying integers of the --> + +<!-- vector `year_fct` inside the square brackets --> + +### Renaming factors + +If we want to rename these factor, it is sufficient to change its +levels: + +```{r, purl=TRUE} +levels(sex) +levels(sex) <- c("M", "F") +sex +plot(sex) +``` + +:::::::::::::::::::::::::::::::::::::: challenge + +## Challenge: + +- Rename "F" and "M" to "Female" and "Male" respectively. + +::::::::::::::: solution + +## Solution + +```{r, eval=TRUE, purl=TRUE} +levels(sex) +levels(sex) <- c("Male", "Female") +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge: + +We have seen how data frames are created when using `read.csv()`, but +they can also be created by hand with the `data.frame()` function. +There are a few mistakes in this hand-crafted `data.frame`. Can you +spot and fix them? Don't hesitate to experiment! + +```{r, eval=FALSE} +animal_data <- data.frame( + animal = c(dog, cat, sea cucumber, sea urchin), + feel = c("furry", "squishy", "spiny"), + weight = c(45, 8 1.1, 0.8)) +``` + +::::::::::::::: solution + +## Solution + +- missing quotations around the names of the animals +- missing one entry in the "feel" column (probably for one of the furry animals) +- missing one comma in the weight column + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge: + +Can you predict the class for each of the columns in the following +example? + +Check your guesses using `str(country_climate)`: + +- Are they what you expected? Why? Why not? + +- Try again by adding `stringsAsFactors = TRUE` after the last + variable when creating the data frame. What is happening now? + `stringsAsFactors` can also be set when reading text-based + spreadsheets into R using `read.csv()`. + +```{r, eval=FALSE, purl=TRUE} +country_climate <- data.frame( + country = c("Canada", "Panama", "South Africa", "Australia"), + climate = c("cold", "hot", "temperate", "hot/temperate"), + temperature = c(10, 30, 18, "15"), + northern_hemisphere = c(TRUE, TRUE, FALSE, "FALSE"), + has_kangaroo = c(FALSE, FALSE, FALSE, 1) + ) +``` + +::::::::::::::: solution + +## Solution + +```{r, eval=TRUE, purl=TRUE} +country_climate <- data.frame( + country = c("Canada", "Panama", "South Africa", "Australia"), + climate = c("cold", "hot", "temperate", "hot/temperate"), + temperature = c(10, 30, 18, "15"), + northern_hemisphere = c(TRUE, TRUE, FALSE, "FALSE"), + has_kangaroo = c(FALSE, FALSE, FALSE, 1) + ) +str(country_climate) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +The automatic conversion of data type is sometimes a blessing, sometimes an +annoyance. Be aware that it exists, learn the rules, and double check that data +you import in R are of the correct type within your data frame. If not, use it +to your advantage to detect mistakes that might have been introduced during data +entry (a letter in a column that should only contain numbers for instance). + +Learn more in this RStudio +tutorial + +## Matrices + +Before proceeding, now that we have learnt about data frames, let's +recap package installation and learn about a new data type, namely the +`matrix`. Like a `data.frame`, a matrix has two dimensions, rows and +columns. But the major difference is that all cells in a `matrix` must +be of the same type: `numeric`, `character`, `logical`, ... In that +respect, matrices are closer to a `vector` than a `data.frame`. + +The default constructor for a matrix is `matrix`. It takes a vector of +values to populate the matrix and the number of row and/or +columns[^ncol]. The values are sorted along the columns, as illustrated +below. + +```{r mat1, purl=TRUE} +m <- matrix(1:9, ncol = 3, nrow = 3) +m +``` + +[^ncol]: Either the number of rows or columns are enough, as the other one can be deduced from the length of the values. Try out what happens if the values and number of rows/columns don't add up. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge: + +Using the function `installed.packages()`, create a `character` matrix +containing the information about all packages currently installed on +your computer. Explore it. + +::::::::::::::: solution + +## Solution: + +```{r pkg_sln, eval=FALSE, purl=TRUE} +## create the matrix +ip <- installed.packages() +head(ip) +## try also View(ip) +## number of package +nrow(ip) +## names of all installed packages +rownames(ip) +## type of information we have about each package +colnames(ip) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +It is often useful to create large random data matrices as test +data. The exercise below asks you to create such a matrix with random +data drawn from a normal distribution of mean 0 and standard deviation +1, which can be done with the `rnorm()` function. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge: + +Construct a matrix of dimension 1000 by 3 of normally distributed data +(mean 0, standard deviation 1) + +::::::::::::::: solution + +## Solution + +```{r rnormmat_sln, purl=TRUE} +set.seed(123) +m <- matrix(rnorm(3000), ncol = 3) +dim(m) +head(m) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Formatting Dates + +One of the most common issues that new (and experienced!) R users have +is converting date and time information into a variable that is +appropriate and usable during analyses. + +### Note on dates in spreadsheet programs + +Dates in spreadsheets are generally stored in a single column. While +this seems the most natural way to record dates, it actually is not +best practice. A spreadsheet application will display the dates in a +seemingly correct way (to a human observer) but how it actually +handles and stores the dates may be problematic. It is often much +safer to store dates with YEAR, MONTH and DAY in separate columns or +as YEAR and DAY-OF-YEAR in separate columns. + +Spreadsheet programs such as LibreOffice, Microsoft Excel, OpenOffice, +Gnumeric, ... have different (and often incompatible) ways of encoding +dates (even for the same program between versions and operating +systems). Additionally, Excel can turn things that aren't dates into +dates +(@Zeeberg:2004), for example names or identifiers like MAR1, DEC1, +OCT4. So if you're avoiding the date format overall, it's easier to +identify these issues. + +The Dates as +data +section of the Data Carpentry lesson provides additional insights +about pitfalls of dates with spreadsheets. + +We are going to use the `ymd()` function from the package +**`lubridate`** (which belongs to the **`tidyverse`**; learn more +[here](https://www.tidyverse.org/)). . **`lubridate`** gets installed +as part of the **`tidyverse`** installation. When you load the +**`tidyverse`** (`library(tidyverse)`), the core packages (the +packages used in most data analyses) get loaded. **`lubridate`** +however does not belong to the core tidyverse, so you have to load it +explicitly with `library(lubridate)`. + +Start by loading the required package: + +```{r loadlibridate, message=FALSE, purl=TRUE} +library("lubridate") +``` + +`ymd()` takes a vector representing year, month, and day, and converts +it to a `Date` vector. `Date` is a class of data recognized by R as +being a date and can be manipulated as such. The argument that the +function requires is flexible, but, as a best practice, is a character +vector formatted as "YYYY-MM-DD". + +Let's create a date object and inspect the structure: + +```{r, purl=TRUE} +my_date <- ymd("2015-01-01") +str(my_date) +``` + +Now let's paste the year, month, and day separately - we get the same result: + +```{r, purl=TRUE} +# sep indicates the character to use to separate each component +my_date <- ymd(paste("2015", "1", "1", sep = "-")) +str(my_date) +``` + +Let's now familiarise ourselves with a typical date manipulation +pipeline. The small data below has stored dates in different `year`, +`month` and `day` columns. + +```{r, purl=TRUE} +x <- data.frame(year = c(1996, 1992, 1987, 1986, 2000, 1990, 2002, 1994, 1997, 1985), + month = c(2, 3, 3, 10, 1, 8, 3, 4, 5, 5), + day = c(24, 8, 1, 5, 8, 17, 13, 10, 11, 24), + value = c(4, 5, 1, 9, 3, 8, 10, 2, 6, 7)) +x +``` + +Now we apply this function to the `x` dataset. We first create a +character vector from the `year`, `month`, and `day` columns of `x` +using `paste()`: + +```{r, purl=TRUE} +paste(x$year, x$month, x$day, sep = "-") +``` + +This character vector can be used as the argument for `ymd()`: + +```{r, purl=TRUE} +ymd(paste(x$year, x$month, x$day, sep = "-")) +``` + +The resulting `Date` vector can be added to `x` as a new column called `date`: + +```{r, purl=TRUE} +x$date <- ymd(paste(x$year, x$month, x$day, sep = "-")) +str(x) # notice the new column, with 'date' as the class +``` + +Let's make sure everything worked correctly. One way to inspect the +new column is to use `summary()`: + +```{r, purl=TRUE} +summary(x$date) +``` + +Note that `ymd()` expects to have the year, month and day, in that +order. If you have for instance day, month and year, you would need +`dmy()`. + +```{r, purl=TRUE} +dmy(paste(x$day, x$month, x$year, sep = "-")) +``` + +`lubdridate` has many functions to address all date variations. + +## Summary of R objects + +So far, we have seen several types of R object varying in the number +of dimensions and whether they could store a single or multiple data +types: + +- **`vector`**: one dimension (they have a length), single type of data. +- **`matrix`**: two dimensions, single type of data. +- **`data.frame`**: two dimensions, one type per column. + +## Lists + +A data type that we haven't seen yet, but that is useful to know, and +follows from the summary that we have just seen are lists: + +- **`list`**: one dimension, every item can be of a different data + type. + +Below, let's create a list containing a vector of numbers, characters, +a matrix, a dataframe and another list: + +```{r list0, purl=TRUE} +l <- list(1:10, ## numeric + letters, ## character + installed.packages(), ## a matrix + cars, ## a data.frame + list(1, 2, 3)) ## a list +length(l) +str(l) +``` + +List subsetting is done using `[]` to subset a new sub-list or `[[]]` +to extract a single element of that list (using indices or names, if +the list is named). + +```{r, purl=TRUE} +l[[1]] ## first element +l[1:2] ## a list of length 2 +l[1] ## a list of length 1 +``` + +## Exporting and saving tabular data {#sec:exportandsave} + +We have seen how to read a text-based spreadsheet into R using the +`read.table` family of functions. To export a `data.frame` to a +text-based spreadsheet, we can use the `write.table` set of functions +(`write.csv`, `write.delim`, ...). They all take the variable to be +exported and the file to be exported to. For example, to export the +`rna` data to the `my_rna.csv` file in the `data_output` +directory, we would execute: + +```{r, eval=FALSE, purl=TRUE} +write.csv(rna, file = "data_output/my_rna.csv") +``` + +This new csv file can now be shared with other collaborators who +aren't familiar with R. Note that even though there are commas in some of +the fields in the `data.frame` (see for example the "product" column), R will +by default surround each field with quotes, and thus we will be able to +read it back into R correctly, despite also using commas as column +separators. + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- Tabular data in R + +:::::::::::::::::::::::::::::::::::::::::::::::::: From d9dc1e418583a44e4070a9e08ba12badfe386c6c Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Fri, 29 Dec 2023 06:10:56 +0900 Subject: [PATCH 022/334] New translations 30-dplyr.md (French) --- locale/fr/episodes/30-dplyr.Rmd | 1044 +++++++++++++++++++++++++++++++ 1 file changed, 1044 insertions(+) create mode 100644 locale/fr/episodes/30-dplyr.Rmd diff --git a/locale/fr/episodes/30-dplyr.Rmd b/locale/fr/episodes/30-dplyr.Rmd new file mode 100644 index 000000000..d41f82e5f --- /dev/null +++ b/locale/fr/episodes/30-dplyr.Rmd @@ -0,0 +1,1044 @@ +--- +source: Rmd +title: Manipulating and analysing data with dplyr +teaching: 75 +exercises: 75 +--- + +```{r, include=FALSE} +``` + +::::::::::::::::::::::::::::::::::::::: objectives + +- Describe the purpose of the **`dplyr`** and **`tidyr`** packages. +- Describe several of their functions that are extremely useful to + manipulate data. +- Describe the concept of a wide and a long table format, and see + how to reshape a data frame from one format to the other one. +- Demonstrate how to join tables. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- Data analysis in R using the tidyverse meta-package + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +```{r loaddata_dplyr, echo=FALSE, purl=FALSE, message=FALSE} +if (!file.exists("data/rnaseq.csv")) +download.file(url = "https://github.com/carpentries-incubator/bioc-intro/raw/main/episodes/data/rnaseq.csv", + destfile = "data/rnaseq.csv") +``` + +## Data manipulation using **`dplyr`** and **`tidyr`** + +Bracket subsetting is handy, but it can be cumbersome and difficult to +read, especially for complicated operations. + +Some packages can greatly facilitate our task when we manipulate data. +Packages in R are basically sets of additional functions that let you +do more stuff. The functions we've been using so far, like `str()` or +`data.frame()`, come built into R; Loading packages can give you access to other +specific functions. Before you use a package for the first time you need to install +it on your machine, and then you should import it in every subsequent +R session when you need it. + +- The package **`dplyr`** provides powerful tools for data manipulation tasks. + It is built to work directly with data frames, with many manipulation tasks + optimised. + +- As we will see latter on, sometimes we want a data frame to be reshaped to be able + to do some specific analyses or for visualisation. The package **`tidyr`** addresses + this common problem of reshaping data and provides tools for manipulating + data in a tidy way. + +To learn more about **`dplyr`** and **`tidyr`** after the workshop, +you may want to check out this handy data transformation with + +and this one about +. + +- The **`tidyverse`** package is an "umbrella-package" that installs + several useful packages for data analysis which work well together, + such as **`tidyr`**, **`dplyr`**, **`ggplot2`**, **`tibble`**, etc. + These packages help us to work and interact with the data. + They allow us to do many things with your data, such as subsetting, transforming, + visualising, etc. + +If you did the set up, you should have already installed the tidyverse package. +Check to see if you have it by trying to load in from the library: + +```{r, message=FALSE, purl=TRUE} +## load the tidyverse packages, incl. dplyr +library("tidyverse") +``` + +If you got an error message `there is no package called ‘tidyverse’` then you have not +installed the package yet for this version of R. To install the **`tidyverse`** package type: + +```{r, eval=FALSE, purl=TRUE} +BiocManager::install("tidyverse") +``` + +If you had to install the **`tidyverse`** package, do not forget to load it in this R session by using the `library()` command above! + +## Loading data with tidyverse + +Instead of `read.csv()`, we will read in our data using the `read_csv()` +function (notice the `_` instead of the `.`), from the tidyverse package +**`readr`**. + +```{r, message=FALSE, purl=TRUE} +rna <- read_csv("data/rnaseq.csv") + +## view the data +rna +``` + +Notice that the class of the data is now referred to as a "tibble". + +Tibbles tweak some of the behaviors of the data frame objects we introduced in the +previously. The data structure is very similar to a data frame. For our purposes +the only differences are that: + +1. It displays the data type of each column under its name. + Note that <`dbl`> is a data type defined to hold numeric values with + decimal points. + +2. It only prints the first few rows of data and only as many columns as fit on + one screen. + +We are now going to learn some of the most common **`dplyr`** functions: + +- `select()`: subset columns +- `filter()`: subset rows on conditions +- `mutate()`: create new columns by using information from other columns +- `group_by()` and `summarise()`: create summary statistics on grouped data +- `arrange()`: sort results +- `count()`: count discrete values + +## Selecting columns and filtering rows + +To select columns of a data frame, use `select()`. The first argument +to this function is the data frame (`rna`), and the subsequent +arguments are the columns to keep. + +```{r, purl=TRUE} +select(rna, gene, sample, tissue, expression) +``` + +To select all columns _except_ certain ones, put a "-" in front of +the variable to exclude it. + +```{r, purl=TRUE} +select(rna, -tissue, -organism) +``` + +This will select all the variables in `rna` except `tissue` +and `organism`. + +To choose rows based on a specific criteria, use `filter()`: + +```{r, purl=TRUE} +filter(rna, sex == "Male") +filter(rna, sex == "Male" & infection == "NonInfected") +``` + +Now let's imagine we are interested in the human homologs of the mouse +genes analysed in this dataset. This information can be found in the +last column of the `rna` tibble, named +`hsapiens_homolog_associated_gene_name`. To visualise it easily, we +will create a new table containing just the 2 columns `gene` and +`hsapiens_homolog_associated_gene_name`. + +```{r} +genes <- select(rna, gene, hsapiens_homolog_associated_gene_name) +genes +``` + +Some mouse genes have no human homologs. These can be retrieved using +`filter()` and the `is.na()` function, that determines whether +something is an `NA`. + +```{r, purl=TRUE} +filter(genes, is.na(hsapiens_homolog_associated_gene_name)) +``` + +If we want to keep only mouse genes that have a human homolog, we can +insert a "!" symbol that negates the result, so we're asking for +every row where hsapiens_homolog_associated_gene_name _is not_ an +`NA`. + +```{r, purl=TRUE} +filter(genes, !is.na(hsapiens_homolog_associated_gene_name)) +``` + +## Pipes + +What if you want to select and filter at the same time? There are three +ways to do this: use intermediate steps, nested functions, or pipes. + +With intermediate steps, you create a temporary data frame and use +that as input to the next function, like this: + +```{r, purl=TRUE} +rna2 <- filter(rna, sex == "Male") +rna3 <- select(rna2, gene, sample, tissue, expression) +rna3 +``` + +This is readable, but can clutter up your workspace with lots of +intermediate objects that you have to name individually. With multiple +steps, that can be hard to keep track of. + +You can also nest functions (i.e. one function inside of another), +like this: + +```{r, purl=TRUE} +rna3 <- select(filter(rna, sex == "Male"), gene, sample, tissue, expression) +rna3 +``` + +This is handy, but can be difficult to read if too many functions are nested, as +R evaluates the expression from the inside out (in this case, filtering, then selecting). + +The last option, _pipes_, are a recent addition to R. Pipes let you take +the output of one function and send it directly to the next, which is useful +when you need to do many things to the same dataset. + +Pipes in R look like `%>%` (made available via the **`magrittr`** +package) or `|>` (through base R). If you use RStudio, you can type +the pipe with <kbd>Ctrl</kbd> + <kbd>Shift</kbd> + <kbd>M</kbd> if you +have a PC or <kbd>Cmd</kbd> + <kbd>Shift</kbd> + <kbd>M</kbd> if you +have a Mac. + +In the above code, we use the pipe to send the `rna` dataset first +through `filter()` to keep rows where `sex` is Male, then through +`select()` to keep only the `gene`, `sample`, `tissue`, and +`expression`columns. + +The pipe `%>%` takes the object on its left and passes it directly as +the first argument to the function on its right, we don't need to +explicitly include the data frame as an argument to the `filter()` and +`select()` functions any more. + +```{r, purl=TRUE} +rna %>% + filter(sex == "Male") %>% + select(gene, sample, tissue, expression) +``` + +Some may find it helpful to read the pipe like the word "then". For instance, +in the above example, we took the data frame `rna`, _then_ we `filter`ed +for rows with `sex == "Male"`, _then_ we `select`ed columns `gene`, `sample`, +`tissue`, and `expression`. + +The **`dplyr`** functions by themselves are somewhat simple, but by +combining them into linear workflows with the pipe, we can accomplish +more complex manipulations of data frames. + +If we want to create a new object with this smaller version of the data, we +can assign it a new name: + +```{r, purl=TRUE} +rna3 <- rna %>% + filter(sex == "Male") %>% + select(gene, sample, tissue, expression) + +rna3 +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge: + +Using pipes, subset the `rna` data to keep observations in female mice at time 0, +where the gene has an expression higher than 50000, and retain only the columns +`gene`, `sample`, `time`, `expression` and `age`. + +::::::::::::::: solution + +## Solution + +```{r} +rna %>% + filter(expression > 50000, + sex == "Female", + time == 0 ) %>% + select(gene, sample, time, expression, age) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Mutate + +Frequently you'll want to create new columns based on the values of existing +columns, for example to do unit conversions, or to find the ratio of values in two +columns. For this we'll use `mutate()`. + +To create a new column of time in hours: + +```{r, purl=TRUE} +rna %>% + mutate(time_hours = time * 24) %>% + select(time, time_hours) +``` + +You can also create a second new column based on the first new column within the same call of `mutate()`: + +```{r, purl=TRUE} +rna %>% + mutate(time_hours = time * 24, + time_mn = time_hours * 60) %>% + select(time, time_hours, time_mn) +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge + +Create a new data frame from the `rna` data that meets the following +criteria: contains only the `gene`, `chromosome_name`, +`phenotype_description`, `sample`, and `expression` columns. The expression +values should be log-transformed. This data frame must +only contain genes located on sex chromosomes, associated with a +phenotype_description, and with a log expression higher than 5. + +**Hint**: think about how the commands should be ordered to produce +this data frame! + +::::::::::::::: solution + +## Solution + +```{r, eval=TRUE, purl=TRUE} +rna %>% + mutate(expression = log(expression)) %>% + select(gene, chromosome_name, phenotype_description, sample, expression) %>% + filter(chromosome_name == "X" | chromosome_name == "Y") %>% + filter(!is.na(phenotype_description)) %>% + filter(expression > 5) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Split-apply-combine data analysis + +Many data analysis tasks can be approached using the +_split-apply-combine_ paradigm: split the data into groups, apply some +analysis to each group, and then combine the results. **`dplyr`** +makes this very easy through the use of the `group_by()` function. + +```{r} +rna %>% + group_by(gene) +``` + +The `group_by()` function doesn't perform any data processing, it +groups the data into subsets: in the example above, our initial +`tibble` of `r nrow(rna)` observations is split into +`r length(unique(rna$gene))` groups based on the `gene` variable. + +We could similarly decide to group the tibble by the samples: + +```{r} +rna %>% + group_by(sample) +``` + +Here our initial `tibble` of `r nrow(rna)` observations is split into +`r length(unique(rna$sample))` groups based on the `sample` variable. + +Once the data has been grouped, subsequent operations will be +applied on each group independently. + +### The `summarise()` function + +`group_by()` is often used together with `summarise()`, which +collapses each group into a single-row summary of that group. + +`group_by()` takes as arguments the column names that contain the +**categorical** variables for which you want to calculate the summary +statistics. So to compute the mean `expression` by gene: + +```{r} +rna %>% + group_by(gene) %>% + summarise(mean_expression = mean(expression)) +``` + +We could also want to calculate the mean expression levels of all genes in each sample: + +```{r} +rna %>% + group_by(sample) %>% + summarise(mean_expression = mean(expression)) +``` + +But we can can also group by multiple columns: + +```{r} +rna %>% + group_by(gene, infection, time) %>% + summarise(mean_expression = mean(expression)) +``` + +Once the data is grouped, you can also summarise multiple variables at the same +time (and not necessarily on the same variable). For instance, we could add a +column indicating the median `expression` by gene and by condition: + +```{r, purl=TRUE} +rna %>% + group_by(gene, infection, time) %>% + summarise(mean_expression = mean(expression), + median_expression = median(expression)) +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge + +Calculate the mean expression level of gene "Dok3" by timepoints. + +::::::::::::::: solution + +## Solution + +```{r, purl=TRUE} +rna %>% + filter(gene == "Dok3") %>% + group_by(time) %>% + summarise(mean = mean(expression)) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +### Counting + +When working with data, we often want to know the number of observations found +for each factor or combination of factors. For this task, **`dplyr`** provides +`count()`. For example, if we wanted to count the number of rows of data for +each infected and non-infected samples, we would do: + +```{r, purl=TRUE} +rna %>% + count(infection) +``` + +The `count()` function is shorthand for something we've already seen: grouping by a variable, and summarising it by counting the number of observations in that group. In other words, `rna %>% count(infection)` is equivalent to: + +```{r, purl=TRUE} +rna %>% + group_by(infection) %>% + summarise(n = n()) +``` + +The previous example shows the use of `count()` to count the number of rows/observations +for _one_ factor (i.e., `infection`). +If we wanted to count a _combination of factors_, such as `infection` and `time`, +we would specify the first and the second factor as the arguments of `count()`: + +```{r, purl=TRUE} +rna %>% + count(infection, time) +``` + +which is equivalent to this: + +```{r, purl=TRUE} +rna %>% + group_by(infection, time) %>% + summarise(n = n()) +``` + +It is sometimes useful to sort the result to facilitate the comparisons. +We can use `arrange()` to sort the table. +For instance, we might want to arrange the table above by time: + +```{r, purl=TRUE} +rna %>% + count(infection, time) %>% + arrange(time) +``` + +or by counts: + +```{r, purl=TRUE} +rna %>% + count(infection, time) %>% + arrange(n) +``` + +To sort in descending order, we need to add the `desc()` function: + +```{r, purl=TRUE} +rna %>% + count(infection, time) %>% + arrange(desc(n)) +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge + +1. How many genes were analysed in each sample? +2. Use `group_by()` and `summarise()` to evaluate the sequencing depth (the sum of all counts) in each sample. Which sample has the highest sequencing depth? +3. Pick one sample and evaluate the number of genes by biotype. +4. Identify genes associated with the "abnormal DNA methylation" phenotype description, and calculate their mean expression (in log) at time 0, time 4 and time 8. + +::::::::::::::: solution + +## Solution + +```{r} +## 1. +rna %>% + count(sample) +## 2. +rna %>% + group_by(sample) %>% + summarise(seq_depth = sum(expression)) %>% + arrange(desc(seq_depth)) +## 3. +rna %>% + filter(sample == "GSM2545336") %>% + count(gene_biotype) %>% + arrange(desc(n)) +## 4. +rna %>% + filter(phenotype_description == "abnormal DNA methylation") %>% + group_by(gene, time) %>% + summarise(mean_expression = mean(log(expression))) %>% + arrange() +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Reshaping data + +In the `rna` tibble, the rows contain expression values (the unit) that are +associated with a combination of 2 other variables: `gene` and `sample`. + +All the other columns correspond to variables describing either +the sample (organism, age, sex, ...) or the gene (gene_biotype, ENTREZ_ID, product, ...). +The variables that don't change with genes or with samples will have the same value in all the rows. + +```{r} +rna %>% + arrange(gene) +``` + +This structure is called a `long-format`, as one column contains all the values, +and other column(s) list(s) the context of the value. + +In certain cases, the `long-format` is not really "human-readable", and another format, +a `wide-format` is preferred, as a more compact way of representing the data. +This is typically the case with gene expression values that scientists are used to +look as matrices, were rows represent genes and columns represent samples. + +In this format, it would therefore become straightforward +to explore the relationship between the gene expression levels within, and +between, the samples. + +```{r, echo=FALSE} +rna %>% + select(gene, sample, expression) %>% + pivot_wider(names_from = sample, + values_from = expression) +``` + +To convert the gene expression values from `rna` into a wide-format, +we need to create a new table where the values of the `sample` column would +become the names of column variables. + +The key point here is that we are still following +a tidy data structure, but we have **reshaped** the data according to +the observations of interest: expression levels per gene instead +of recording them per gene and per sample. + +The opposite transformation would be to transform column names into +values of a new variable. + +We can do both these of transformations with two `tidyr` functions, +`pivot_longer()` and `pivot_wider()` (see +[here](https://tidyr.tidyverse.org/dev/articles/pivot.html) for +details). + +### Pivoting the data into a wider format + +Let's select the first 3 columns of `rna` and use `pivot_wider()` +to transform the data into a wide-format. + +```{r, purl=TRUE} +rna_exp <- rna %>% + select(gene, sample, expression) +rna_exp +``` + +`pivot_wider` takes three main arguments: + +1. the data to be transformed; +2. the `names_from` : the column whose values will become new column + names; +3. the `values_from`: the column whose values will fill the new + columns. + +\`\`\`{r, fig.cap="Wide pivot of the `rna` data.", echo=FALSE, message=FALSE} +knitr::include_graphics("fig/pivot_wider.png") + +```` + +```{r, purl=TRUE} +rna_wide <- rna_exp %>% + pivot_wider(names_from = sample, + values_from = expression) +rna_wide +```` + +Note that by default, the `pivot_wider()` function will add `NA` for missing values. + +Let's imagine that for some reason, we had some missing expression values for some +genes in certain samples. In the following fictive example, the gene Cyp2d22 has only +one expression value, in GSM2545338 sample. + +```{r, purl=TRUE} +rna_with_missing_values <- rna %>% + select(gene, sample, expression) %>% + filter(gene %in% c("Asl", "Apod", "Cyp2d22")) %>% + filter(sample %in% c("GSM2545336", "GSM2545337", "GSM2545338")) %>% + arrange(sample) %>% + filter(!(gene == "Cyp2d22" & sample != "GSM2545338")) +rna_with_missing_values +``` + +By default, the `pivot_wider()` function will add `NA` for missing +values. This can be parameterised with the `values_fill` argument of +the `pivot_wider()` function. + +```{r, purl=TRUE} +rna_with_missing_values %>% + pivot_wider(names_from = sample, + values_from = expression) + +rna_with_missing_values %>% + pivot_wider(names_from = sample, + values_from = expression, + values_fill = 0) +``` + +### Pivoting data into a longer format + +In the opposite situation we are using the column names and turning them into +a pair of new variables. One variable represents the column names as +values, and the other variable contains the values previously +associated with the column names. + +`pivot_longer()` takes four main arguments: + +1. the data to be transformed; +2. the `names_to`: the new column name we wish to create and populate with the + current column names; +3. the `values_to`: the new column name we wish to create and populate with + current values; +4. the names of the columns to be used to populate the `names_to` and + `values_to` variables (or to drop). + +\`\`\`{r, fig.cap="Long pivot of the `rna` data.", echo=FALSE, message=FALSE} +knitr::include_graphics("fig/pivot_longer.png") + +```` + +To recreate `rna_long` from `rna_wide` we would create a key +called `sample` and value called `expression` and use all columns +except `gene` for the key variable. Here we drop `gene` column +with a minus sign. + +Notice how the new variable names are to be quoted here. + +```{r} +rna_long <- rna_wide %>% + pivot_longer(names_to = "sample", + values_to = "expression", + -gene) +rna_long +```` + +We could also have used a specification for what columns to +include. This can be useful if you have a large number of identifying +columns, and it's easier to specify what to gather than what to leave +alone. Here the `starts_with()` function can help to retrieve sample +names without having to list them all! +Another possibility would be to use the `:` operator! + +```{r} +rna_wide %>% + pivot_longer(names_to = "sample", + values_to = "expression", + cols = starts_with("GSM")) +rna_wide %>% + pivot_longer(names_to = "sample", + values_to = "expression", + GSM2545336:GSM2545380) +``` + +Note that if we had missing values in the wide-format, the `NA` would be +included in the new long format. + +Remember our previous fictive tibble containing missing values: + +```{r} +rna_with_missing_values + +wide_with_NA <- rna_with_missing_values %>% + pivot_wider(names_from = sample, + values_from = expression) +wide_with_NA + +wide_with_NA %>% + pivot_longer(names_to = "sample", + values_to = "expression", + -gene) +``` + +Pivoting to wider and longer formats can be a useful way to balance out a dataset +so every replicate has the same composition. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Question + +Starting from the rna table, use the `pivot_wider()` function to create +a wide-format table giving the gene expression levels in each mouse. +Then use the `pivot_longer()` function to restore a long-format table. + +::::::::::::::: solution + +## Solution + +```{r, answer=TRUE, purl=TRUE} +rna1 <- rna %>% +select(gene, mouse, expression) %>% +pivot_wider(names_from = mouse, values_from = expression) +rna1 + +rna1 %>% +pivot_longer(names_to = "mouse_id", values_to = "counts", -gene) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Question + +Subset genes located on X and Y chromosomes from the `rna` data frame and +spread the data frame with `sex` as columns, `chromosome_name` as +rows, and the mean expression of genes located in each chromosome as the values, +as in the following tibble: + +```{r, echo=FALSE, message=FALSE} +knitr::include_graphics("fig/Exercise_pivot_W.png") +``` + +You will need to summarise before reshaping! + +::::::::::::::: solution + +## Solution + +Let's first calculate the mean expression level of X and Y linked genes from +male and female samples... + +```{r} + rna %>% + filter(chromosome_name == "Y" | chromosome_name == "X") %>% + group_by(sex, chromosome_name) %>% + summarise(mean = mean(expression)) +``` + +And pivot the table to wide format + +```{r, answer=TRUE, purl=TRUE} +rna_1 <- rna %>% + filter(chromosome_name == "Y" | chromosome_name == "X") %>% + group_by(sex, chromosome_name) %>% + summarise(mean = mean(expression)) %>% + pivot_wider(names_from = sex, + values_from = mean) + +rna_1 +``` + +Now take that data frame and transform it with `pivot_longer()` so +each row is a unique `chromosome_name` by `gender` combination. + +```{r, answer=TRUE, purl=TRUE} +rna_1 %>% + pivot_longer(names_to = "gender", + values_to = "mean", + -chromosome_name) + +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Question + +Use the `rna` dataset to create an expression matrix where each row +represents the mean expression levels of genes and columns represent +the different timepoints. + +::::::::::::::: solution + +## Solution + +Let's first calculate the mean expression by gene and by time + +```{r} +rna %>% + group_by(gene, time) %>% + summarise(mean_exp = mean(expression)) +``` + +before using the pivot_wider() function + +```{r} +rna_time <- rna %>% + group_by(gene, time) %>% + summarise(mean_exp = mean(expression)) %>% + pivot_wider(names_from = time, + values_from = mean_exp) +rna_time +``` + +Notice that this generates a tibble with some column names starting by a number. +If we wanted to select the column corresponding to the timepoints, +we could not use the column names directly... What happens when we select the column 4? + +```{r} +rna %>% + group_by(gene, time) %>% + summarise(mean_exp = mean(expression)) %>% + pivot_wider(names_from = time, + values_from = mean_exp) %>% + select(gene, 4) +``` + +To select the timepoint 4, we would have to quote the column name, with backticks "\`" + +```{r} +rna %>% + group_by(gene, time) %>% + summarise(mean_exp = mean(expression)) %>% + pivot_wider(names_from = time, + values_from = mean_exp) %>% + select(gene, `4`) +``` + +Another possibility would be to rename the column, +choosing a name that doesn't start by a number : + +```{r} +rna %>% + group_by(gene, time) %>% + summarise(mean_exp = mean(expression)) %>% + pivot_wider(names_from = time, + values_from = mean_exp) %>% + rename("time0" = `0`, "time4" = `4`, "time8" = `8`) %>% + select(gene, time4) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Question + +Use the previous data frame containing mean expression levels per timepoint and create +a new column containing fold-changes between timepoint 8 and timepoint 0, and fold-changes +between timepoint 8 and timepoint 4. +Convert this table into a long-format table gathering the fold-changes calculated. + +::::::::::::::: solution + +## Solution + +Starting from the rna_time tibble: + +```{r} +rna_time +``` + +Calculate fold-changes: + +```{r} +rna_time %>% + mutate(time_8_vs_0 = `8` / `0`, time_8_vs_4 = `8` / `4`) +``` + +And use the pivot_longer() function: + +```{r} +rna_time %>% + mutate(time_8_vs_0 = `8` / `0`, time_8_vs_4 = `8` / `4`) %>% + pivot_longer(names_to = "comparisons", + values_to = "Fold_changes", + time_8_vs_0:time_8_vs_4) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Joining tables + +In many real life situations, data are spread across multiple tables. +Usually this occurs because different types of information are +collected from different sources. + +It may be desirable for some analyses to combine data from two or more +tables into a single data frame based on a column that would be common +to all the tables. + +The `dplyr` package provides a set of join functions for combining two +data frames based on matches within specified columns. Here, we +provide a short introduction to joins. For further reading, please +refer to the chapter about table +joins. The +Data Transformation Cheat +Sheet +also provides a short overview on table joins. + +We are going to illustrate join using a small table, `rna_mini` that +we will create by subsetting the original `rna` table, keeping only 3 +columns and 10 lines. + +```{r} +rna_mini <- rna %>% + select(gene, sample, expression) %>% + head(10) +rna_mini +``` + +The second table, `annot1`, contains 2 columns, gene and +gene_description. You can either +[download annot1.csv](https://carpentries-incubator.github.io/bioc-intro/data/annot1.csv) +by clicking on the link and then moving it to the `data/` folder, or +you can use the R code below to download it directly to the folder. + +```{r, message=FALSE} +download.file(url = "https://carpentries-incubator.github.io/bioc-intro/data/annot1.csv", + destfile = "data/annot1.csv") +annot1 <- read_csv(file = "data/annot1.csv") +annot1 +``` + +We now want to join these two tables into a single one containing all +variables using the `full_join()` function from the `dplyr` package. The +function will automatically find the common variable to match columns +from the first and second table. In this case, `gene` is the common +variable. Such variables are called keys. Keys are used to match +observations across different tables. + +```{r} +full_join(rna_mini, annot1) +``` + +In real life, gene annotations are sometimes labelled differently. + +The `annot2` table is exactly the same than `annot1` except that the +variable containing gene names is labelled differently. Again, either +[download annot2.csv](https://carpentries-incubator.github.io/bioc-intro/data/annot2.csv) +yourself and move it to `data/` or use the R code below. + +```{r, message=FALSE} +download.file(url = "https://carpentries-incubator.github.io/bioc-intro/data/annot2.csv", + destfile = "data/annot2.csv") +annot2 <- read_csv(file = "data/annot2.csv") +annot2 +``` + +In case none of the variable names match, we can set manually the +variables to use for the matching. These variables can be set using +the `by` argument, as shown below with `rna_mini` and `annot2` tables. + +```{r} +full_join(rna_mini, annot2, by = c("gene" = "external_gene_name")) +``` + +As can be seen above, the variable name of the first table is retained +in the joined one. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge: + +Download the `annot3` table by clicking +[here](https://carpentries-incubator.github.io/bioc-intro/data/annot3.csv) +and put the table in your data/ repository. Using the `full_join()` +function, join tables `rna_mini` and `annot3`. What has happened for +genes _Klk6_, _mt-Tf_, _mt-Rnr1_, _mt-Tv_, _mt-Rnr2_, and _mt-Tl1_ ? + +::::::::::::::: solution + +## Solution + +```{r, message=FALSE} +annot3 <- read_csv("data/annot3.csv") +full_join(rna_mini, annot3) +``` + +Genes _Klk6_ is only present in `rna_mini`, while genes _mt-Tf_, _mt-Rnr1_, _mt-Tv_, +_mt-Rnr2_, and _mt-Tl1_ are only present in `annot3` table. Their respective values for the +variables of the table have been encoded as missing. + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Exporting data + +Now that you have learned how to use `dplyr` to extract information from +or summarise your raw data, you may want to export these new data sets to share +them with your collaborators or for archival. + +Similar to the `read_csv()` function used for reading CSV files into R, there is +a `write_csv()` function that generates CSV files from data frames. + +Before using `write_csv()`, we are going to create a new folder, `data_output`, +in our working directory that will store this generated dataset. We don't want +to write generated datasets in the same directory as our raw data. +It's good practice to keep them separate. The `data` folder should only contain +the raw, unaltered data, and should be left alone to make sure we don't delete +or modify it. In contrast, our script will generate the contents of the `data_output` +directory, so even if the files it contains are deleted, we can always +re-generate them. + +Let's use `write_csv()` to save the rna_wide table that we have created previously. + +```{r, purl=TRUE, eval=FALSE} +write_csv(rna_wide, file = "data_output/rna_wide.csv") +``` + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- Tabular data in R using the tidyverse meta-package + +:::::::::::::::::::::::::::::::::::::::::::::::::: From 48a2a8eb149bc1b756677fdbd5c36c87a111fc20 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Fri, 29 Dec 2023 06:10:58 +0900 Subject: [PATCH 023/334] New translations 30-dplyr.md (Spanish) --- locale/es/episodes/30-dplyr.Rmd | 1044 +++++++++++++++++++++++++++++++ 1 file changed, 1044 insertions(+) create mode 100644 locale/es/episodes/30-dplyr.Rmd diff --git a/locale/es/episodes/30-dplyr.Rmd b/locale/es/episodes/30-dplyr.Rmd new file mode 100644 index 000000000..fd4b2b14f --- /dev/null +++ b/locale/es/episodes/30-dplyr.Rmd @@ -0,0 +1,1044 @@ +--- +source: Rmd +title: Manipulating and analysing data with dplyr +teaching: 75 +exercises: 75 +--- + +```{r, include=FALSE} +``` + +::::::::::::::::::::::::::::::::::::::: objetivos + +- Describe the purpose of the **`dplyr`** and **`tidyr`** packages. +- Describe several of their functions that are extremely useful to + manipulate data. +- Describe the concept of a wide and a long table format, and see + how to reshape a data frame from one format to the other one. +- Demonstrate how to join tables. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- Data analysis in R using the tidyverse meta-package + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +```{r loaddata_dplyr, echo=FALSE, purl=FALSE, message=FALSE} +if (!file.exists("data/rnaseq.csv")) +download.file(url = "https://github.com/carpentries-incubator/bioc-intro/raw/main/episodes/data/rnaseq.csv", + destfile = "data/rnaseq.csv") +``` + +## Data manipulation using **`dplyr`** and **`tidyr`** + +Bracket subsetting is handy, but it can be cumbersome and difficult to +read, especially for complicated operations. + +Some packages can greatly facilitate our task when we manipulate data. +Packages in R are basically sets of additional functions that let you +do more stuff. The functions we've been using so far, like `str()` or +`data.frame()`, come built into R; Loading packages can give you access to other +specific functions. Before you use a package for the first time you need to install +it on your machine, and then you should import it in every subsequent +R session when you need it. + +- The package **`dplyr`** provides powerful tools for data manipulation tasks. + It is built to work directly with data frames, with many manipulation tasks + optimised. + +- As we will see latter on, sometimes we want a data frame to be reshaped to be able + to do some specific analyses or for visualisation. The package **`tidyr`** addresses + this common problem of reshaping data and provides tools for manipulating + data in a tidy way. + +To learn more about **`dplyr`** and **`tidyr`** after the workshop, +you may want to check out this handy data transformation with + +and this one about +. + +- The **`tidyverse`** package is an "umbrella-package" that installs + several useful packages for data analysis which work well together, + such as **`tidyr`**, **`dplyr`**, **`ggplot2`**, **`tibble`**, etc. + These packages help us to work and interact with the data. + They allow us to do many things with your data, such as subsetting, transforming, + visualising, etc. + +If you did the set up, you should have already installed the tidyverse package. +Check to see if you have it by trying to load in from the library: + +```{r, message=FALSE, purl=TRUE} +## load the tidyverse packages, incl. dplyr +library("tidyverse") +``` + +If you got an error message `there is no package called ‘tidyverse’` then you have not +installed the package yet for this version of R. To install the **`tidyverse`** package type: + +```{r, eval=FALSE, purl=TRUE} +BiocManager::install("tidyverse") +``` + +If you had to install the **`tidyverse`** package, do not forget to load it in this R session by using the `library()` command above! + +## Loading data with tidyverse + +Instead of `read.csv()`, we will read in our data using the `read_csv()` +function (notice the `_` instead of the `.`), from the tidyverse package +**`readr`**. + +```{r, message=FALSE, purl=TRUE} +rna <- read_csv("data/rnaseq.csv") + +## view the data +rna +``` + +Notice that the class of the data is now referred to as a "tibble". + +Tibbles tweak some of the behaviors of the data frame objects we introduced in the +previously. The data structure is very similar to a data frame. For our purposes +the only differences are that: + +1. It displays the data type of each column under its name. + Note that <`dbl`> is a data type defined to hold numeric values with + decimal points. + +2. It only prints the first few rows of data and only as many columns as fit on + one screen. + +We are now going to learn some of the most common **`dplyr`** functions: + +- `select()`: subset columns +- `filter()`: subset rows on conditions +- `mutate()`: create new columns by using information from other columns +- `group_by()` and `summarise()`: create summary statistics on grouped data +- `arrange()`: sort results +- `count()`: count discrete values + +## Selecting columns and filtering rows + +To select columns of a data frame, use `select()`. The first argument +to this function is the data frame (`rna`), and the subsequent +arguments are the columns to keep. + +```{r, purl=TRUE} +select(rna, gene, sample, tissue, expression) +``` + +To select all columns _except_ certain ones, put a "-" in front of +the variable to exclude it. + +```{r, purl=TRUE} +select(rna, -tissue, -organism) +``` + +This will select all the variables in `rna` except `tissue` +and `organism`. + +To choose rows based on a specific criteria, use `filter()`: + +```{r, purl=TRUE} +filter(rna, sex == "Male") +filter(rna, sex == "Male" & infection == "NonInfected") +``` + +Now let's imagine we are interested in the human homologs of the mouse +genes analysed in this dataset. This information can be found in the +last column of the `rna` tibble, named +`hsapiens_homolog_associated_gene_name`. To visualise it easily, we +will create a new table containing just the 2 columns `gene` and +`hsapiens_homolog_associated_gene_name`. + +```{r} +genes <- select(rna, gene, hsapiens_homolog_associated_gene_name) +genes +``` + +Some mouse genes have no human homologs. These can be retrieved using +`filter()` and the `is.na()` function, that determines whether +something is an `NA`. + +```{r, purl=TRUE} +filter(genes, is.na(hsapiens_homolog_associated_gene_name)) +``` + +If we want to keep only mouse genes that have a human homolog, we can +insert a "!" symbol that negates the result, so we're asking for +every row where hsapiens_homolog_associated_gene_name _is not_ an +`NA`. + +```{r, purl=TRUE} +filter(genes, !is.na(hsapiens_homolog_associated_gene_name)) +``` + +## Pipes + +What if you want to select and filter at the same time? There are three +ways to do this: use intermediate steps, nested functions, or pipes. + +With intermediate steps, you create a temporary data frame and use +that as input to the next function, like this: + +```{r, purl=TRUE} +rna2 <- filter(rna, sex == "Male") +rna3 <- select(rna2, gene, sample, tissue, expression) +rna3 +``` + +This is readable, but can clutter up your workspace with lots of +intermediate objects that you have to name individually. With multiple +steps, that can be hard to keep track of. + +You can also nest functions (i.e. one function inside of another), +like this: + +```{r, purl=TRUE} +rna3 <- select(filter(rna, sex == "Male"), gene, sample, tissue, expression) +rna3 +``` + +This is handy, but can be difficult to read if too many functions are nested, as +R evaluates the expression from the inside out (in this case, filtering, then selecting). + +The last option, _pipes_, are a recent addition to R. Pipes let you take +the output of one function and send it directly to the next, which is useful +when you need to do many things to the same dataset. + +Pipes in R look like `%>%` (made available via the **`magrittr`** +package) or `|>` (through base R). If you use RStudio, you can type +the pipe with <kbd>Ctrl</kbd> + <kbd>Shift</kbd> + <kbd>M</kbd> if you +have a PC or <kbd>Cmd</kbd> + <kbd>Shift</kbd> + <kbd>M</kbd> if you +have a Mac. + +In the above code, we use the pipe to send the `rna` dataset first +through `filter()` to keep rows where `sex` is Male, then through +`select()` to keep only the `gene`, `sample`, `tissue`, and +`expression`columns. + +The pipe `%>%` takes the object on its left and passes it directly as +the first argument to the function on its right, we don't need to +explicitly include the data frame as an argument to the `filter()` and +`select()` functions any more. + +```{r, purl=TRUE} +rna %>% + filter(sex == "Male") %>% + select(gene, sample, tissue, expression) +``` + +Some may find it helpful to read the pipe like the word "then". For instance, +in the above example, we took the data frame `rna`, _then_ we `filter`ed +for rows with `sex == "Male"`, _then_ we `select`ed columns `gene`, `sample`, +`tissue`, and `expression`. + +The **`dplyr`** functions by themselves are somewhat simple, but by +combining them into linear workflows with the pipe, we can accomplish +more complex manipulations of data frames. + +If we want to create a new object with this smaller version of the data, we +can assign it a new name: + +```{r, purl=TRUE} +rna3 <- rna %>% + filter(sex == "Male") %>% + select(gene, sample, tissue, expression) + +rna3 +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge: + +Using pipes, subset the `rna` data to keep observations in female mice at time 0, +where the gene has an expression higher than 50000, and retain only the columns +`gene`, `sample`, `time`, `expression` and `age`. + +::::::::::::::: solution + +## Solution + +```{r} +rna %>% + filter(expression > 50000, + sex == "Female", + time == 0 ) %>% + select(gene, sample, time, expression, age) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Mutate + +Frequently you'll want to create new columns based on the values of existing +columns, for example to do unit conversions, or to find the ratio of values in two +columns. For this we'll use `mutate()`. + +To create a new column of time in hours: + +```{r, purl=TRUE} +rna %>% + mutate(time_hours = time * 24) %>% + select(time, time_hours) +``` + +You can also create a second new column based on the first new column within the same call of `mutate()`: + +```{r, purl=TRUE} +rna %>% + mutate(time_hours = time * 24, + time_mn = time_hours * 60) %>% + select(time, time_hours, time_mn) +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge + +Create a new data frame from the `rna` data that meets the following +criteria: contains only the `gene`, `chromosome_name`, +`phenotype_description`, `sample`, and `expression` columns. The expression +values should be log-transformed. This data frame must +only contain genes located on sex chromosomes, associated with a +phenotype_description, and with a log expression higher than 5. + +**Hint**: think about how the commands should be ordered to produce +this data frame! + +::::::::::::::: solution + +## Solution + +```{r, eval=TRUE, purl=TRUE} +rna %>% + mutate(expression = log(expression)) %>% + select(gene, chromosome_name, phenotype_description, sample, expression) %>% + filter(chromosome_name == "X" | chromosome_name == "Y") %>% + filter(!is.na(phenotype_description)) %>% + filter(expression > 5) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Split-apply-combine data analysis + +Many data analysis tasks can be approached using the +_split-apply-combine_ paradigm: split the data into groups, apply some +analysis to each group, and then combine the results. **`dplyr`** +makes this very easy through the use of the `group_by()` function. + +```{r} +rna %>% + group_by(gene) +``` + +The `group_by()` function doesn't perform any data processing, it +groups the data into subsets: in the example above, our initial +`tibble` of `r nrow(rna)` observations is split into +`r length(unique(rna$gene))` groups based on the `gene` variable. + +We could similarly decide to group the tibble by the samples: + +```{r} +rna %>% + group_by(sample) +``` + +Here our initial `tibble` of `r nrow(rna)` observations is split into +`r length(unique(rna$sample))` groups based on the `sample` variable. + +Once the data has been grouped, subsequent operations will be +applied on each group independently. + +### The `summarise()` function + +`group_by()` is often used together with `summarise()`, which +collapses each group into a single-row summary of that group. + +`group_by()` takes as arguments the column names that contain the +**categorical** variables for which you want to calculate the summary +statistics. So to compute the mean `expression` by gene: + +```{r} +rna %>% + group_by(gene) %>% + summarise(mean_expression = mean(expression)) +``` + +We could also want to calculate the mean expression levels of all genes in each sample: + +```{r} +rna %>% + group_by(sample) %>% + summarise(mean_expression = mean(expression)) +``` + +But we can can also group by multiple columns: + +```{r} +rna %>% + group_by(gene, infection, time) %>% + summarise(mean_expression = mean(expression)) +``` + +Once the data is grouped, you can also summarise multiple variables at the same +time (and not necessarily on the same variable). For instance, we could add a +column indicating the median `expression` by gene and by condition: + +```{r, purl=TRUE} +rna %>% + group_by(gene, infection, time) %>% + summarise(mean_expression = mean(expression), + median_expression = median(expression)) +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge + +Calculate the mean expression level of gene "Dok3" by timepoints. + +::::::::::::::: solution + +## Solution + +```{r, purl=TRUE} +rna %>% + filter(gene == "Dok3") %>% + group_by(time) %>% + summarise(mean = mean(expression)) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +### Counting + +When working with data, we often want to know the number of observations found +for each factor or combination of factors. For this task, **`dplyr`** provides +`count()`. For example, if we wanted to count the number of rows of data for +each infected and non-infected samples, we would do: + +```{r, purl=TRUE} +rna %>% + count(infection) +``` + +The `count()` function is shorthand for something we've already seen: grouping by a variable, and summarising it by counting the number of observations in that group. In other words, `rna %>% count(infection)` is equivalent to: + +```{r, purl=TRUE} +rna %>% + group_by(infection) %>% + summarise(n = n()) +``` + +The previous example shows the use of `count()` to count the number of rows/observations +for _one_ factor (i.e., `infection`). +If we wanted to count a _combination of factors_, such as `infection` and `time`, +we would specify the first and the second factor as the arguments of `count()`: + +```{r, purl=TRUE} +rna %>% + count(infection, time) +``` + +which is equivalent to this: + +```{r, purl=TRUE} +rna %>% + group_by(infection, time) %>% + summarise(n = n()) +``` + +It is sometimes useful to sort the result to facilitate the comparisons. +We can use `arrange()` to sort the table. +For instance, we might want to arrange the table above by time: + +```{r, purl=TRUE} +rna %>% + count(infection, time) %>% + arrange(time) +``` + +or by counts: + +```{r, purl=TRUE} +rna %>% + count(infection, time) %>% + arrange(n) +``` + +To sort in descending order, we need to add the `desc()` function: + +```{r, purl=TRUE} +rna %>% + count(infection, time) %>% + arrange(desc(n)) +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge + +1. How many genes were analysed in each sample? +2. Use `group_by()` and `summarise()` to evaluate the sequencing depth (the sum of all counts) in each sample. Which sample has the highest sequencing depth? +3. Pick one sample and evaluate the number of genes by biotype. +4. Identify genes associated with the "abnormal DNA methylation" phenotype description, and calculate their mean expression (in log) at time 0, time 4 and time 8. + +::::::::::::::: solution + +## Solution + +```{r} +## 1. +rna %>% + count(sample) +## 2. +rna %>% + group_by(sample) %>% + summarise(seq_depth = sum(expression)) %>% + arrange(desc(seq_depth)) +## 3. +rna %>% + filter(sample == "GSM2545336") %>% + count(gene_biotype) %>% + arrange(desc(n)) +## 4. +rna %>% + filter(phenotype_description == "abnormal DNA methylation") %>% + group_by(gene, time) %>% + summarise(mean_expression = mean(log(expression))) %>% + arrange() +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Reshaping data + +In the `rna` tibble, the rows contain expression values (the unit) that are +associated with a combination of 2 other variables: `gene` and `sample`. + +All the other columns correspond to variables describing either +the sample (organism, age, sex, ...) or the gene (gene_biotype, ENTREZ_ID, product, ...). +The variables that don't change with genes or with samples will have the same value in all the rows. + +```{r} +rna %>% + arrange(gene) +``` + +This structure is called a `long-format`, as one column contains all the values, +and other column(s) list(s) the context of the value. + +In certain cases, the `long-format` is not really "human-readable", and another format, +a `wide-format` is preferred, as a more compact way of representing the data. +This is typically the case with gene expression values that scientists are used to +look as matrices, were rows represent genes and columns represent samples. + +In this format, it would therefore become straightforward +to explore the relationship between the gene expression levels within, and +between, the samples. + +```{r, echo=FALSE} +rna %>% + select(gene, sample, expression) %>% + pivot_wider(names_from = sample, + values_from = expression) +``` + +To convert the gene expression values from `rna` into a wide-format, +we need to create a new table where the values of the `sample` column would +become the names of column variables. + +The key point here is that we are still following +a tidy data structure, but we have **reshaped** the data according to +the observations of interest: expression levels per gene instead +of recording them per gene and per sample. + +The opposite transformation would be to transform column names into +values of a new variable. + +We can do both these of transformations with two `tidyr` functions, +`pivot_longer()` and `pivot_wider()` (see +[here](https://tidyr.tidyverse.org/dev/articles/pivot.html) for +details). + +### Pivoting the data into a wider format + +Let's select the first 3 columns of `rna` and use `pivot_wider()` +to transform the data into a wide-format. + +```{r, purl=TRUE} +rna_exp <- rna %>% + select(gene, sample, expression) +rna_exp +``` + +`pivot_wider` takes three main arguments: + +1. the data to be transformed; +2. the `names_from` : the column whose values will become new column + names; +3. the `values_from`: the column whose values will fill the new + columns. + +\`\`\`{r, fig.cap="Wide pivot of the `rna` data.", echo=FALSE, message=FALSE} +knitr::include_graphics("fig/pivot_wider.png") + +```` + +```{r, purl=TRUE} +rna_wide <- rna_exp %>% + pivot_wider(names_from = sample, + values_from = expression) +rna_wide +```` + +Note that by default, the `pivot_wider()` function will add `NA` for missing values. + +Let's imagine that for some reason, we had some missing expression values for some +genes in certain samples. In the following fictive example, the gene Cyp2d22 has only +one expression value, in GSM2545338 sample. + +```{r, purl=TRUE} +rna_with_missing_values <- rna %>% + select(gene, sample, expression) %>% + filter(gene %in% c("Asl", "Apod", "Cyp2d22")) %>% + filter(sample %in% c("GSM2545336", "GSM2545337", "GSM2545338")) %>% + arrange(sample) %>% + filter(!(gene == "Cyp2d22" & sample != "GSM2545338")) +rna_with_missing_values +``` + +By default, the `pivot_wider()` function will add `NA` for missing +values. This can be parameterised with the `values_fill` argument of +the `pivot_wider()` function. + +```{r, purl=TRUE} +rna_with_missing_values %>% + pivot_wider(names_from = sample, + values_from = expression) + +rna_with_missing_values %>% + pivot_wider(names_from = sample, + values_from = expression, + values_fill = 0) +``` + +### Pivoting data into a longer format + +In the opposite situation we are using the column names and turning them into +a pair of new variables. One variable represents the column names as +values, and the other variable contains the values previously +associated with the column names. + +`pivot_longer()` takes four main arguments: + +1. the data to be transformed; +2. the `names_to`: the new column name we wish to create and populate with the + current column names; +3. the `values_to`: the new column name we wish to create and populate with + current values; +4. the names of the columns to be used to populate the `names_to` and + `values_to` variables (or to drop). + +\`\`\`{r, fig.cap="Long pivot of the `rna` data.", echo=FALSE, message=FALSE} +knitr::include_graphics("fig/pivot_longer.png") + +```` + +To recreate `rna_long` from `rna_wide` we would create a key +called `sample` and value called `expression` and use all columns +except `gene` for the key variable. Here we drop `gene` column +with a minus sign. + +Notice how the new variable names are to be quoted here. + +```{r} +rna_long <- rna_wide %>% + pivot_longer(names_to = "sample", + values_to = "expression", + -gene) +rna_long +```` + +We could also have used a specification for what columns to +include. This can be useful if you have a large number of identifying +columns, and it's easier to specify what to gather than what to leave +alone. Here the `starts_with()` function can help to retrieve sample +names without having to list them all! +Another possibility would be to use the `:` operator! + +```{r} +rna_wide %>% + pivot_longer(names_to = "sample", + values_to = "expression", + cols = starts_with("GSM")) +rna_wide %>% + pivot_longer(names_to = "sample", + values_to = "expression", + GSM2545336:GSM2545380) +``` + +Note that if we had missing values in the wide-format, the `NA` would be +included in the new long format. + +Remember our previous fictive tibble containing missing values: + +```{r} +rna_with_missing_values + +wide_with_NA <- rna_with_missing_values %>% + pivot_wider(names_from = sample, + values_from = expression) +wide_with_NA + +wide_with_NA %>% + pivot_longer(names_to = "sample", + values_to = "expression", + -gene) +``` + +Pivoting to wider and longer formats can be a useful way to balance out a dataset +so every replicate has the same composition. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Question + +Starting from the rna table, use the `pivot_wider()` function to create +a wide-format table giving the gene expression levels in each mouse. +Then use the `pivot_longer()` function to restore a long-format table. + +::::::::::::::: solution + +## Solution + +```{r, answer=TRUE, purl=TRUE} +rna1 <- rna %>% +select(gene, mouse, expression) %>% +pivot_wider(names_from = mouse, values_from = expression) +rna1 + +rna1 %>% +pivot_longer(names_to = "mouse_id", values_to = "counts", -gene) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Question + +Subset genes located on X and Y chromosomes from the `rna` data frame and +spread the data frame with `sex` as columns, `chromosome_name` as +rows, and the mean expression of genes located in each chromosome as the values, +as in the following tibble: + +```{r, echo=FALSE, message=FALSE} +knitr::include_graphics("fig/Exercise_pivot_W.png") +``` + +You will need to summarise before reshaping! + +::::::::::::::: solution + +## Solution + +Let's first calculate the mean expression level of X and Y linked genes from +male and female samples... + +```{r} + rna %>% + filter(chromosome_name == "Y" | chromosome_name == "X") %>% + group_by(sex, chromosome_name) %>% + summarise(mean = mean(expression)) +``` + +And pivot the table to wide format + +```{r, answer=TRUE, purl=TRUE} +rna_1 <- rna %>% + filter(chromosome_name == "Y" | chromosome_name == "X") %>% + group_by(sex, chromosome_name) %>% + summarise(mean = mean(expression)) %>% + pivot_wider(names_from = sex, + values_from = mean) + +rna_1 +``` + +Now take that data frame and transform it with `pivot_longer()` so +each row is a unique `chromosome_name` by `gender` combination. + +```{r, answer=TRUE, purl=TRUE} +rna_1 %>% + pivot_longer(names_to = "gender", + values_to = "mean", + -chromosome_name) + +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Question + +Use the `rna` dataset to create an expression matrix where each row +represents the mean expression levels of genes and columns represent +the different timepoints. + +::::::::::::::: solution + +## Solution + +Let's first calculate the mean expression by gene and by time + +```{r} +rna %>% + group_by(gene, time) %>% + summarise(mean_exp = mean(expression)) +``` + +before using the pivot_wider() function + +```{r} +rna_time <- rna %>% + group_by(gene, time) %>% + summarise(mean_exp = mean(expression)) %>% + pivot_wider(names_from = time, + values_from = mean_exp) +rna_time +``` + +Notice that this generates a tibble with some column names starting by a number. +If we wanted to select the column corresponding to the timepoints, +we could not use the column names directly... What happens when we select the column 4? + +```{r} +rna %>% + group_by(gene, time) %>% + summarise(mean_exp = mean(expression)) %>% + pivot_wider(names_from = time, + values_from = mean_exp) %>% + select(gene, 4) +``` + +To select the timepoint 4, we would have to quote the column name, with backticks "\`" + +```{r} +rna %>% + group_by(gene, time) %>% + summarise(mean_exp = mean(expression)) %>% + pivot_wider(names_from = time, + values_from = mean_exp) %>% + select(gene, `4`) +``` + +Another possibility would be to rename the column, +choosing a name that doesn't start by a number : + +```{r} +rna %>% + group_by(gene, time) %>% + summarise(mean_exp = mean(expression)) %>% + pivot_wider(names_from = time, + values_from = mean_exp) %>% + rename("time0" = `0`, "time4" = `4`, "time8" = `8`) %>% + select(gene, time4) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Question + +Use the previous data frame containing mean expression levels per timepoint and create +a new column containing fold-changes between timepoint 8 and timepoint 0, and fold-changes +between timepoint 8 and timepoint 4. +Convert this table into a long-format table gathering the fold-changes calculated. + +::::::::::::::: solution + +## Solution + +Starting from the rna_time tibble: + +```{r} +rna_time +``` + +Calculate fold-changes: + +```{r} +rna_time %>% + mutate(time_8_vs_0 = `8` / `0`, time_8_vs_4 = `8` / `4`) +``` + +And use the pivot_longer() function: + +```{r} +rna_time %>% + mutate(time_8_vs_0 = `8` / `0`, time_8_vs_4 = `8` / `4`) %>% + pivot_longer(names_to = "comparisons", + values_to = "Fold_changes", + time_8_vs_0:time_8_vs_4) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Joining tables + +In many real life situations, data are spread across multiple tables. +Usually this occurs because different types of information are +collected from different sources. + +It may be desirable for some analyses to combine data from two or more +tables into a single data frame based on a column that would be common +to all the tables. + +The `dplyr` package provides a set of join functions for combining two +data frames based on matches within specified columns. Here, we +provide a short introduction to joins. For further reading, please +refer to the chapter about table +joins. The +Data Transformation Cheat +Sheet +also provides a short overview on table joins. + +We are going to illustrate join using a small table, `rna_mini` that +we will create by subsetting the original `rna` table, keeping only 3 +columns and 10 lines. + +```{r} +rna_mini <- rna %>% + select(gene, sample, expression) %>% + head(10) +rna_mini +``` + +The second table, `annot1`, contains 2 columns, gene and +gene_description. You can either +[download annot1.csv](https://carpentries-incubator.github.io/bioc-intro/data/annot1.csv) +by clicking on the link and then moving it to the `data/` folder, or +you can use the R code below to download it directly to the folder. + +```{r, message=FALSE} +download.file(url = "https://carpentries-incubator.github.io/bioc-intro/data/annot1.csv", + destfile = "data/annot1.csv") +annot1 <- read_csv(file = "data/annot1.csv") +annot1 +``` + +We now want to join these two tables into a single one containing all +variables using the `full_join()` function from the `dplyr` package. The +function will automatically find the common variable to match columns +from the first and second table. In this case, `gene` is the common +variable. Such variables are called keys. Keys are used to match +observations across different tables. + +```{r} +full_join(rna_mini, annot1) +``` + +In real life, gene annotations are sometimes labelled differently. + +The `annot2` table is exactly the same than `annot1` except that the +variable containing gene names is labelled differently. Again, either +[download annot2.csv](https://carpentries-incubator.github.io/bioc-intro/data/annot2.csv) +yourself and move it to `data/` or use the R code below. + +```{r, message=FALSE} +download.file(url = "https://carpentries-incubator.github.io/bioc-intro/data/annot2.csv", + destfile = "data/annot2.csv") +annot2 <- read_csv(file = "data/annot2.csv") +annot2 +``` + +In case none of the variable names match, we can set manually the +variables to use for the matching. These variables can be set using +the `by` argument, as shown below with `rna_mini` and `annot2` tables. + +```{r} +full_join(rna_mini, annot2, by = c("gene" = "external_gene_name")) +``` + +As can be seen above, the variable name of the first table is retained +in the joined one. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge: + +Download the `annot3` table by clicking +[here](https://carpentries-incubator.github.io/bioc-intro/data/annot3.csv) +and put the table in your data/ repository. Using the `full_join()` +function, join tables `rna_mini` and `annot3`. What has happened for +genes _Klk6_, _mt-Tf_, _mt-Rnr1_, _mt-Tv_, _mt-Rnr2_, and _mt-Tl1_ ? + +::::::::::::::: solution + +## Solution + +```{r, message=FALSE} +annot3 <- read_csv("data/annot3.csv") +full_join(rna_mini, annot3) +``` + +Genes _Klk6_ is only present in `rna_mini`, while genes _mt-Tf_, _mt-Rnr1_, _mt-Tv_, +_mt-Rnr2_, and _mt-Tl1_ are only present in `annot3` table. Their respective values for the +variables of the table have been encoded as missing. + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Exporting data + +Now that you have learned how to use `dplyr` to extract information from +or summarise your raw data, you may want to export these new data sets to share +them with your collaborators or for archival. + +Similar to the `read_csv()` function used for reading CSV files into R, there is +a `write_csv()` function that generates CSV files from data frames. + +Before using `write_csv()`, we are going to create a new folder, `data_output`, +in our working directory that will store this generated dataset. We don't want +to write generated datasets in the same directory as our raw data. +It's good practice to keep them separate. The `data` folder should only contain +the raw, unaltered data, and should be left alone to make sure we don't delete +or modify it. In contrast, our script will generate the contents of the `data_output` +directory, so even if the files it contains are deleted, we can always +re-generate them. + +Let's use `write_csv()` to save the rna_wide table that we have created previously. + +```{r, purl=TRUE, eval=FALSE} +write_csv(rna_wide, file = "data_output/rna_wide.csv") +``` + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- Tabular data in R using the tidyverse meta-package + +:::::::::::::::::::::::::::::::::::::::::::::::::: From a6c7cff5f8fea577441beca9ee843c0120cb02fb Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Fri, 29 Dec 2023 06:11:00 +0900 Subject: [PATCH 024/334] New translations 30-dplyr.md (Japanese) --- locale/ja/episodes/30-dplyr.Rmd | 1044 +++++++++++++++++++++++++++++++ 1 file changed, 1044 insertions(+) create mode 100644 locale/ja/episodes/30-dplyr.Rmd diff --git a/locale/ja/episodes/30-dplyr.Rmd b/locale/ja/episodes/30-dplyr.Rmd new file mode 100644 index 000000000..d26d30d12 --- /dev/null +++ b/locale/ja/episodes/30-dplyr.Rmd @@ -0,0 +1,1044 @@ +--- +source: Rmd +title: Manipulating and analysing data with dplyr +teaching: 75 +exercises: 75 +--- + +```{r, include=FALSE} +``` + +::::::::::::::::::::::::::::::::::::::: 目的 + +- Describe the purpose of the **`dplyr`** and **`tidyr`** packages. +- Describe several of their functions that are extremely useful to + manipulate data. +- Describe the concept of a wide and a long table format, and see + how to reshape a data frame from one format to the other one. +- Demonstrate how to join tables. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- Data analysis in R using the tidyverse meta-package + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +```{r loaddata_dplyr, echo=FALSE, purl=FALSE, message=FALSE} +if (!file.exists("data/rnaseq.csv")) +download.file(url = "https://github.com/carpentries-incubator/bioc-intro/raw/main/episodes/data/rnaseq.csv", + destfile = "data/rnaseq.csv") +``` + +## Data manipulation using **`dplyr`** and **`tidyr`** + +Bracket subsetting is handy, but it can be cumbersome and difficult to +read, especially for complicated operations. + +Some packages can greatly facilitate our task when we manipulate data. +Packages in R are basically sets of additional functions that let you +do more stuff. The functions we've been using so far, like `str()` or +`data.frame()`, come built into R; Loading packages can give you access to other +specific functions. Before you use a package for the first time you need to install +it on your machine, and then you should import it in every subsequent +R session when you need it. + +- The package **`dplyr`** provides powerful tools for data manipulation tasks. + It is built to work directly with data frames, with many manipulation tasks + optimised. + +- As we will see latter on, sometimes we want a data frame to be reshaped to be able + to do some specific analyses or for visualisation. The package **`tidyr`** addresses + this common problem of reshaping data and provides tools for manipulating + data in a tidy way. + +To learn more about **`dplyr`** and **`tidyr`** after the workshop, +you may want to check out this handy data transformation with + +and this one about +. + +- The **`tidyverse`** package is an "umbrella-package" that installs + several useful packages for data analysis which work well together, + such as **`tidyr`**, **`dplyr`**, **`ggplot2`**, **`tibble`**, etc. + These packages help us to work and interact with the data. + They allow us to do many things with your data, such as subsetting, transforming, + visualising, etc. + +If you did the set up, you should have already installed the tidyverse package. +Check to see if you have it by trying to load in from the library: + +```{r, message=FALSE, purl=TRUE} +## load the tidyverse packages, incl. dplyr +library("tidyverse") +``` + +If you got an error message `there is no package called ‘tidyverse’` then you have not +installed the package yet for this version of R. To install the **`tidyverse`** package type: + +```{r, eval=FALSE, purl=TRUE} +BiocManager::install("tidyverse") +``` + +If you had to install the **`tidyverse`** package, do not forget to load it in this R session by using the `library()` command above! + +## Loading data with tidyverse + +Instead of `read.csv()`, we will read in our data using the `read_csv()` +function (notice the `_` instead of the `.`), from the tidyverse package +**`readr`**. + +```{r, message=FALSE, purl=TRUE} +rna <- read_csv("data/rnaseq.csv") + +## view the data +rna +``` + +Notice that the class of the data is now referred to as a "tibble". + +Tibbles tweak some of the behaviors of the data frame objects we introduced in the +previously. The data structure is very similar to a data frame. For our purposes +the only differences are that: + +1. It displays the data type of each column under its name. + Note that <`dbl`> is a data type defined to hold numeric values with + decimal points. + +2. It only prints the first few rows of data and only as many columns as fit on + one screen. + +We are now going to learn some of the most common **`dplyr`** functions: + +- `select()`: subset columns +- `filter()`: subset rows on conditions +- `mutate()`: create new columns by using information from other columns +- `group_by()` and `summarise()`: create summary statistics on grouped data +- `arrange()`: sort results +- `count()`: count discrete values + +## Selecting columns and filtering rows + +To select columns of a data frame, use `select()`. The first argument +to this function is the data frame (`rna`), and the subsequent +arguments are the columns to keep. + +```{r, purl=TRUE} +select(rna, gene, sample, tissue, expression) +``` + +To select all columns _except_ certain ones, put a "-" in front of +the variable to exclude it. + +```{r, purl=TRUE} +select(rna, -tissue, -organism) +``` + +This will select all the variables in `rna` except `tissue` +and `organism`. + +To choose rows based on a specific criteria, use `filter()`: + +```{r, purl=TRUE} +filter(rna, sex == "Male") +filter(rna, sex == "Male" & infection == "NonInfected") +``` + +Now let's imagine we are interested in the human homologs of the mouse +genes analysed in this dataset. This information can be found in the +last column of the `rna` tibble, named +`hsapiens_homolog_associated_gene_name`. To visualise it easily, we +will create a new table containing just the 2 columns `gene` and +`hsapiens_homolog_associated_gene_name`. + +```{r} +genes <- select(rna, gene, hsapiens_homolog_associated_gene_name) +genes +``` + +Some mouse genes have no human homologs. These can be retrieved using +`filter()` and the `is.na()` function, that determines whether +something is an `NA`. + +```{r, purl=TRUE} +filter(genes, is.na(hsapiens_homolog_associated_gene_name)) +``` + +If we want to keep only mouse genes that have a human homolog, we can +insert a "!" symbol that negates the result, so we're asking for +every row where hsapiens_homolog_associated_gene_name _is not_ an +`NA`. + +```{r, purl=TRUE} +filter(genes, !is.na(hsapiens_homolog_associated_gene_name)) +``` + +## Pipes + +What if you want to select and filter at the same time? There are three +ways to do this: use intermediate steps, nested functions, or pipes. + +With intermediate steps, you create a temporary data frame and use +that as input to the next function, like this: + +```{r, purl=TRUE} +rna2 <- filter(rna, sex == "Male") +rna3 <- select(rna2, gene, sample, tissue, expression) +rna3 +``` + +This is readable, but can clutter up your workspace with lots of +intermediate objects that you have to name individually. With multiple +steps, that can be hard to keep track of. + +You can also nest functions (i.e. one function inside of another), +like this: + +```{r, purl=TRUE} +rna3 <- select(filter(rna, sex == "Male"), gene, sample, tissue, expression) +rna3 +``` + +This is handy, but can be difficult to read if too many functions are nested, as +R evaluates the expression from the inside out (in this case, filtering, then selecting). + +The last option, _pipes_, are a recent addition to R. Pipes let you take +the output of one function and send it directly to the next, which is useful +when you need to do many things to the same dataset. + +Pipes in R look like `%>%` (made available via the **`magrittr`** +package) or `|>` (through base R). If you use RStudio, you can type +the pipe with <kbd>Ctrl</kbd> + <kbd>Shift</kbd> + <kbd>M</kbd> if you +have a PC or <kbd>Cmd</kbd> + <kbd>Shift</kbd> + <kbd>M</kbd> if you +have a Mac. + +In the above code, we use the pipe to send the `rna` dataset first +through `filter()` to keep rows where `sex` is Male, then through +`select()` to keep only the `gene`, `sample`, `tissue`, and +`expression`columns. + +The pipe `%>%` takes the object on its left and passes it directly as +the first argument to the function on its right, we don't need to +explicitly include the data frame as an argument to the `filter()` and +`select()` functions any more. + +```{r, purl=TRUE} +rna %>% + filter(sex == "Male") %>% + select(gene, sample, tissue, expression) +``` + +Some may find it helpful to read the pipe like the word "then". For instance, +in the above example, we took the data frame `rna`, _then_ we `filter`ed +for rows with `sex == "Male"`, _then_ we `select`ed columns `gene`, `sample`, +`tissue`, and `expression`. + +The **`dplyr`** functions by themselves are somewhat simple, but by +combining them into linear workflows with the pipe, we can accomplish +more complex manipulations of data frames. + +If we want to create a new object with this smaller version of the data, we +can assign it a new name: + +```{r, purl=TRUE} +rna3 <- rna %>% + filter(sex == "Male") %>% + select(gene, sample, tissue, expression) + +rna3 +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge: + +Using pipes, subset the `rna` data to keep observations in female mice at time 0, +where the gene has an expression higher than 50000, and retain only the columns +`gene`, `sample`, `time`, `expression` and `age`. + +::::::::::::::: solution + +## Solution + +```{r} +rna %>% + filter(expression > 50000, + sex == "Female", + time == 0 ) %>% + select(gene, sample, time, expression, age) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Mutate + +Frequently you'll want to create new columns based on the values of existing +columns, for example to do unit conversions, or to find the ratio of values in two +columns. For this we'll use `mutate()`. + +To create a new column of time in hours: + +```{r, purl=TRUE} +rna %>% + mutate(time_hours = time * 24) %>% + select(time, time_hours) +``` + +You can also create a second new column based on the first new column within the same call of `mutate()`: + +```{r, purl=TRUE} +rna %>% + mutate(time_hours = time * 24, + time_mn = time_hours * 60) %>% + select(time, time_hours, time_mn) +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge + +Create a new data frame from the `rna` data that meets the following +criteria: contains only the `gene`, `chromosome_name`, +`phenotype_description`, `sample`, and `expression` columns. The expression +values should be log-transformed. This data frame must +only contain genes located on sex chromosomes, associated with a +phenotype_description, and with a log expression higher than 5. + +**Hint**: think about how the commands should be ordered to produce +this data frame! + +::::::::::::::: solution + +## Solution + +```{r, eval=TRUE, purl=TRUE} +rna %>% + mutate(expression = log(expression)) %>% + select(gene, chromosome_name, phenotype_description, sample, expression) %>% + filter(chromosome_name == "X" | chromosome_name == "Y") %>% + filter(!is.na(phenotype_description)) %>% + filter(expression > 5) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Split-apply-combine data analysis + +Many data analysis tasks can be approached using the +_split-apply-combine_ paradigm: split the data into groups, apply some +analysis to each group, and then combine the results. **`dplyr`** +makes this very easy through the use of the `group_by()` function. + +```{r} +rna %>% + group_by(gene) +``` + +The `group_by()` function doesn't perform any data processing, it +groups the data into subsets: in the example above, our initial +`tibble` of `r nrow(rna)` observations is split into +`r length(unique(rna$gene))` groups based on the `gene` variable. + +We could similarly decide to group the tibble by the samples: + +```{r} +rna %>% + group_by(sample) +``` + +Here our initial `tibble` of `r nrow(rna)` observations is split into +`r length(unique(rna$sample))` groups based on the `sample` variable. + +Once the data has been grouped, subsequent operations will be +applied on each group independently. + +### The `summarise()` function + +`group_by()` is often used together with `summarise()`, which +collapses each group into a single-row summary of that group. + +`group_by()` takes as arguments the column names that contain the +**categorical** variables for which you want to calculate the summary +statistics. So to compute the mean `expression` by gene: + +```{r} +rna %>% + group_by(gene) %>% + summarise(mean_expression = mean(expression)) +``` + +We could also want to calculate the mean expression levels of all genes in each sample: + +```{r} +rna %>% + group_by(sample) %>% + summarise(mean_expression = mean(expression)) +``` + +But we can can also group by multiple columns: + +```{r} +rna %>% + group_by(gene, infection, time) %>% + summarise(mean_expression = mean(expression)) +``` + +Once the data is grouped, you can also summarise multiple variables at the same +time (and not necessarily on the same variable). For instance, we could add a +column indicating the median `expression` by gene and by condition: + +```{r, purl=TRUE} +rna %>% + group_by(gene, infection, time) %>% + summarise(mean_expression = mean(expression), + median_expression = median(expression)) +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge + +Calculate the mean expression level of gene "Dok3" by timepoints. + +::::::::::::::: solution + +## Solution + +```{r, purl=TRUE} +rna %>% + filter(gene == "Dok3") %>% + group_by(time) %>% + summarise(mean = mean(expression)) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +### Counting + +When working with data, we often want to know the number of observations found +for each factor or combination of factors. For this task, **`dplyr`** provides +`count()`. For example, if we wanted to count the number of rows of data for +each infected and non-infected samples, we would do: + +```{r, purl=TRUE} +rna %>% + count(infection) +``` + +The `count()` function is shorthand for something we've already seen: grouping by a variable, and summarising it by counting the number of observations in that group. In other words, `rna %>% count(infection)` is equivalent to: + +```{r, purl=TRUE} +rna %>% + group_by(infection) %>% + summarise(n = n()) +``` + +The previous example shows the use of `count()` to count the number of rows/observations +for _one_ factor (i.e., `infection`). +If we wanted to count a _combination of factors_, such as `infection` and `time`, +we would specify the first and the second factor as the arguments of `count()`: + +```{r, purl=TRUE} +rna %>% + count(infection, time) +``` + +which is equivalent to this: + +```{r, purl=TRUE} +rna %>% + group_by(infection, time) %>% + summarise(n = n()) +``` + +It is sometimes useful to sort the result to facilitate the comparisons. +We can use `arrange()` to sort the table. +For instance, we might want to arrange the table above by time: + +```{r, purl=TRUE} +rna %>% + count(infection, time) %>% + arrange(time) +``` + +or by counts: + +```{r, purl=TRUE} +rna %>% + count(infection, time) %>% + arrange(n) +``` + +To sort in descending order, we need to add the `desc()` function: + +```{r, purl=TRUE} +rna %>% + count(infection, time) %>% + arrange(desc(n)) +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge + +1. How many genes were analysed in each sample? +2. Use `group_by()` and `summarise()` to evaluate the sequencing depth (the sum of all counts) in each sample. Which sample has the highest sequencing depth? +3. Pick one sample and evaluate the number of genes by biotype. +4. Identify genes associated with the "abnormal DNA methylation" phenotype description, and calculate their mean expression (in log) at time 0, time 4 and time 8. + +::::::::::::::: solution + +## Solution + +```{r} +## 1. +rna %>% + count(sample) +## 2. +rna %>% + group_by(sample) %>% + summarise(seq_depth = sum(expression)) %>% + arrange(desc(seq_depth)) +## 3. +rna %>% + filter(sample == "GSM2545336") %>% + count(gene_biotype) %>% + arrange(desc(n)) +## 4. +rna %>% + filter(phenotype_description == "abnormal DNA methylation") %>% + group_by(gene, time) %>% + summarise(mean_expression = mean(log(expression))) %>% + arrange() +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Reshaping data + +In the `rna` tibble, the rows contain expression values (the unit) that are +associated with a combination of 2 other variables: `gene` and `sample`. + +All the other columns correspond to variables describing either +the sample (organism, age, sex, ...) or the gene (gene_biotype, ENTREZ_ID, product, ...). +The variables that don't change with genes or with samples will have the same value in all the rows. + +```{r} +rna %>% + arrange(gene) +``` + +This structure is called a `long-format`, as one column contains all the values, +and other column(s) list(s) the context of the value. + +In certain cases, the `long-format` is not really "human-readable", and another format, +a `wide-format` is preferred, as a more compact way of representing the data. +This is typically the case with gene expression values that scientists are used to +look as matrices, were rows represent genes and columns represent samples. + +In this format, it would therefore become straightforward +to explore the relationship between the gene expression levels within, and +between, the samples. + +```{r, echo=FALSE} +rna %>% + select(gene, sample, expression) %>% + pivot_wider(names_from = sample, + values_from = expression) +``` + +To convert the gene expression values from `rna` into a wide-format, +we need to create a new table where the values of the `sample` column would +become the names of column variables. + +The key point here is that we are still following +a tidy data structure, but we have **reshaped** the data according to +the observations of interest: expression levels per gene instead +of recording them per gene and per sample. + +The opposite transformation would be to transform column names into +values of a new variable. + +We can do both these of transformations with two `tidyr` functions, +`pivot_longer()` and `pivot_wider()` (see +[here](https://tidyr.tidyverse.org/dev/articles/pivot.html) for +details). + +### Pivoting the data into a wider format + +Let's select the first 3 columns of `rna` and use `pivot_wider()` +to transform the data into a wide-format. + +```{r, purl=TRUE} +rna_exp <- rna %>% + select(gene, sample, expression) +rna_exp +``` + +`pivot_wider` takes three main arguments: + +1. the data to be transformed; +2. the `names_from` : the column whose values will become new column + names; +3. the `values_from`: the column whose values will fill the new + columns. + +\`\`\`{r, fig.cap="Wide pivot of the `rna` data.", echo=FALSE, message=FALSE} +knitr::include_graphics("fig/pivot_wider.png") + +```` + +```{r, purl=TRUE} +rna_wide <- rna_exp %>% + pivot_wider(names_from = sample, + values_from = expression) +rna_wide +```` + +Note that by default, the `pivot_wider()` function will add `NA` for missing values. + +Let's imagine that for some reason, we had some missing expression values for some +genes in certain samples. In the following fictive example, the gene Cyp2d22 has only +one expression value, in GSM2545338 sample. + +```{r, purl=TRUE} +rna_with_missing_values <- rna %>% + select(gene, sample, expression) %>% + filter(gene %in% c("Asl", "Apod", "Cyp2d22")) %>% + filter(sample %in% c("GSM2545336", "GSM2545337", "GSM2545338")) %>% + arrange(sample) %>% + filter(!(gene == "Cyp2d22" & sample != "GSM2545338")) +rna_with_missing_values +``` + +By default, the `pivot_wider()` function will add `NA` for missing +values. This can be parameterised with the `values_fill` argument of +the `pivot_wider()` function. + +```{r, purl=TRUE} +rna_with_missing_values %>% + pivot_wider(names_from = sample, + values_from = expression) + +rna_with_missing_values %>% + pivot_wider(names_from = sample, + values_from = expression, + values_fill = 0) +``` + +### Pivoting data into a longer format + +In the opposite situation we are using the column names and turning them into +a pair of new variables. One variable represents the column names as +values, and the other variable contains the values previously +associated with the column names. + +`pivot_longer()` takes four main arguments: + +1. the data to be transformed; +2. the `names_to`: the new column name we wish to create and populate with the + current column names; +3. the `values_to`: the new column name we wish to create and populate with + current values; +4. the names of the columns to be used to populate the `names_to` and + `values_to` variables (or to drop). + +\`\`\`{r, fig.cap="Long pivot of the `rna` data.", echo=FALSE, message=FALSE} +knitr::include_graphics("fig/pivot_longer.png") + +```` + +To recreate `rna_long` from `rna_wide` we would create a key +called `sample` and value called `expression` and use all columns +except `gene` for the key variable. Here we drop `gene` column +with a minus sign. + +Notice how the new variable names are to be quoted here. + +```{r} +rna_long <- rna_wide %>% + pivot_longer(names_to = "sample", + values_to = "expression", + -gene) +rna_long +```` + +We could also have used a specification for what columns to +include. This can be useful if you have a large number of identifying +columns, and it's easier to specify what to gather than what to leave +alone. Here the `starts_with()` function can help to retrieve sample +names without having to list them all! +Another possibility would be to use the `:` operator! + +```{r} +rna_wide %>% + pivot_longer(names_to = "sample", + values_to = "expression", + cols = starts_with("GSM")) +rna_wide %>% + pivot_longer(names_to = "sample", + values_to = "expression", + GSM2545336:GSM2545380) +``` + +Note that if we had missing values in the wide-format, the `NA` would be +included in the new long format. + +Remember our previous fictive tibble containing missing values: + +```{r} +rna_with_missing_values + +wide_with_NA <- rna_with_missing_values %>% + pivot_wider(names_from = sample, + values_from = expression) +wide_with_NA + +wide_with_NA %>% + pivot_longer(names_to = "sample", + values_to = "expression", + -gene) +``` + +Pivoting to wider and longer formats can be a useful way to balance out a dataset +so every replicate has the same composition. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Question + +Starting from the rna table, use the `pivot_wider()` function to create +a wide-format table giving the gene expression levels in each mouse. +Then use the `pivot_longer()` function to restore a long-format table. + +::::::::::::::: solution + +## Solution + +```{r, answer=TRUE, purl=TRUE} +rna1 <- rna %>% +select(gene, mouse, expression) %>% +pivot_wider(names_from = mouse, values_from = expression) +rna1 + +rna1 %>% +pivot_longer(names_to = "mouse_id", values_to = "counts", -gene) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Question + +Subset genes located on X and Y chromosomes from the `rna` data frame and +spread the data frame with `sex` as columns, `chromosome_name` as +rows, and the mean expression of genes located in each chromosome as the values, +as in the following tibble: + +```{r, echo=FALSE, message=FALSE} +knitr::include_graphics("fig/Exercise_pivot_W.png") +``` + +You will need to summarise before reshaping! + +::::::::::::::: solution + +## Solution + +Let's first calculate the mean expression level of X and Y linked genes from +male and female samples... + +```{r} + rna %>% + filter(chromosome_name == "Y" | chromosome_name == "X") %>% + group_by(sex, chromosome_name) %>% + summarise(mean = mean(expression)) +``` + +And pivot the table to wide format + +```{r, answer=TRUE, purl=TRUE} +rna_1 <- rna %>% + filter(chromosome_name == "Y" | chromosome_name == "X") %>% + group_by(sex, chromosome_name) %>% + summarise(mean = mean(expression)) %>% + pivot_wider(names_from = sex, + values_from = mean) + +rna_1 +``` + +Now take that data frame and transform it with `pivot_longer()` so +each row is a unique `chromosome_name` by `gender` combination. + +```{r, answer=TRUE, purl=TRUE} +rna_1 %>% + pivot_longer(names_to = "gender", + values_to = "mean", + -chromosome_name) + +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Question + +Use the `rna` dataset to create an expression matrix where each row +represents the mean expression levels of genes and columns represent +the different timepoints. + +::::::::::::::: solution + +## Solution + +Let's first calculate the mean expression by gene and by time + +```{r} +rna %>% + group_by(gene, time) %>% + summarise(mean_exp = mean(expression)) +``` + +before using the pivot_wider() function + +```{r} +rna_time <- rna %>% + group_by(gene, time) %>% + summarise(mean_exp = mean(expression)) %>% + pivot_wider(names_from = time, + values_from = mean_exp) +rna_time +``` + +Notice that this generates a tibble with some column names starting by a number. +If we wanted to select the column corresponding to the timepoints, +we could not use the column names directly... What happens when we select the column 4? + +```{r} +rna %>% + group_by(gene, time) %>% + summarise(mean_exp = mean(expression)) %>% + pivot_wider(names_from = time, + values_from = mean_exp) %>% + select(gene, 4) +``` + +To select the timepoint 4, we would have to quote the column name, with backticks "\`" + +```{r} +rna %>% + group_by(gene, time) %>% + summarise(mean_exp = mean(expression)) %>% + pivot_wider(names_from = time, + values_from = mean_exp) %>% + select(gene, `4`) +``` + +Another possibility would be to rename the column, +choosing a name that doesn't start by a number : + +```{r} +rna %>% + group_by(gene, time) %>% + summarise(mean_exp = mean(expression)) %>% + pivot_wider(names_from = time, + values_from = mean_exp) %>% + rename("time0" = `0`, "time4" = `4`, "time8" = `8`) %>% + select(gene, time4) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Question + +Use the previous data frame containing mean expression levels per timepoint and create +a new column containing fold-changes between timepoint 8 and timepoint 0, and fold-changes +between timepoint 8 and timepoint 4. +Convert this table into a long-format table gathering the fold-changes calculated. + +::::::::::::::: solution + +## Solution + +Starting from the rna_time tibble: + +```{r} +rna_time +``` + +Calculate fold-changes: + +```{r} +rna_time %>% + mutate(time_8_vs_0 = `8` / `0`, time_8_vs_4 = `8` / `4`) +``` + +And use the pivot_longer() function: + +```{r} +rna_time %>% + mutate(time_8_vs_0 = `8` / `0`, time_8_vs_4 = `8` / `4`) %>% + pivot_longer(names_to = "comparisons", + values_to = "Fold_changes", + time_8_vs_0:time_8_vs_4) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Joining tables + +In many real life situations, data are spread across multiple tables. +Usually this occurs because different types of information are +collected from different sources. + +It may be desirable for some analyses to combine data from two or more +tables into a single data frame based on a column that would be common +to all the tables. + +The `dplyr` package provides a set of join functions for combining two +data frames based on matches within specified columns. Here, we +provide a short introduction to joins. For further reading, please +refer to the chapter about table +joins. The +Data Transformation Cheat +Sheet +also provides a short overview on table joins. + +We are going to illustrate join using a small table, `rna_mini` that +we will create by subsetting the original `rna` table, keeping only 3 +columns and 10 lines. + +```{r} +rna_mini <- rna %>% + select(gene, sample, expression) %>% + head(10) +rna_mini +``` + +The second table, `annot1`, contains 2 columns, gene and +gene_description. You can either +[download annot1.csv](https://carpentries-incubator.github.io/bioc-intro/data/annot1.csv) +by clicking on the link and then moving it to the `data/` folder, or +you can use the R code below to download it directly to the folder. + +```{r, message=FALSE} +download.file(url = "https://carpentries-incubator.github.io/bioc-intro/data/annot1.csv", + destfile = "data/annot1.csv") +annot1 <- read_csv(file = "data/annot1.csv") +annot1 +``` + +We now want to join these two tables into a single one containing all +variables using the `full_join()` function from the `dplyr` package. The +function will automatically find the common variable to match columns +from the first and second table. In this case, `gene` is the common +variable. Such variables are called keys. Keys are used to match +observations across different tables. + +```{r} +full_join(rna_mini, annot1) +``` + +In real life, gene annotations are sometimes labelled differently. + +The `annot2` table is exactly the same than `annot1` except that the +variable containing gene names is labelled differently. Again, either +[download annot2.csv](https://carpentries-incubator.github.io/bioc-intro/data/annot2.csv) +yourself and move it to `data/` or use the R code below. + +```{r, message=FALSE} +download.file(url = "https://carpentries-incubator.github.io/bioc-intro/data/annot2.csv", + destfile = "data/annot2.csv") +annot2 <- read_csv(file = "data/annot2.csv") +annot2 +``` + +In case none of the variable names match, we can set manually the +variables to use for the matching. These variables can be set using +the `by` argument, as shown below with `rna_mini` and `annot2` tables. + +```{r} +full_join(rna_mini, annot2, by = c("gene" = "external_gene_name")) +``` + +As can be seen above, the variable name of the first table is retained +in the joined one. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge: + +Download the `annot3` table by clicking +[here](https://carpentries-incubator.github.io/bioc-intro/data/annot3.csv) +and put the table in your data/ repository. Using the `full_join()` +function, join tables `rna_mini` and `annot3`. What has happened for +genes _Klk6_, _mt-Tf_, _mt-Rnr1_, _mt-Tv_, _mt-Rnr2_, and _mt-Tl1_ ? + +::::::::::::::: solution + +## Solution + +```{r, message=FALSE} +annot3 <- read_csv("data/annot3.csv") +full_join(rna_mini, annot3) +``` + +Genes _Klk6_ is only present in `rna_mini`, while genes _mt-Tf_, _mt-Rnr1_, _mt-Tv_, +_mt-Rnr2_, and _mt-Tl1_ are only present in `annot3` table. Their respective values for the +variables of the table have been encoded as missing. + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Exporting data + +Now that you have learned how to use `dplyr` to extract information from +or summarise your raw data, you may want to export these new data sets to share +them with your collaborators or for archival. + +Similar to the `read_csv()` function used for reading CSV files into R, there is +a `write_csv()` function that generates CSV files from data frames. + +Before using `write_csv()`, we are going to create a new folder, `data_output`, +in our working directory that will store this generated dataset. We don't want +to write generated datasets in the same directory as our raw data. +It's good practice to keep them separate. The `data` folder should only contain +the raw, unaltered data, and should be left alone to make sure we don't delete +or modify it. In contrast, our script will generate the contents of the `data_output` +directory, so even if the files it contains are deleted, we can always +re-generate them. + +Let's use `write_csv()` to save the rna_wide table that we have created previously. + +```{r, purl=TRUE, eval=FALSE} +write_csv(rna_wide, file = "data_output/rna_wide.csv") +``` + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- Tabular data in R using the tidyverse meta-package + +:::::::::::::::::::::::::::::::::::::::::::::::::: From 971e5fff31b03e5b7c953821de14e1fc44f0dda5 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Fri, 29 Dec 2023 06:11:02 +0900 Subject: [PATCH 025/334] New translations 30-dplyr.md (Portuguese) --- locale/pt/episodes/30-dplyr.Rmd | 1044 +++++++++++++++++++++++++++++++ 1 file changed, 1044 insertions(+) create mode 100644 locale/pt/episodes/30-dplyr.Rmd diff --git a/locale/pt/episodes/30-dplyr.Rmd b/locale/pt/episodes/30-dplyr.Rmd new file mode 100644 index 000000000..d41f82e5f --- /dev/null +++ b/locale/pt/episodes/30-dplyr.Rmd @@ -0,0 +1,1044 @@ +--- +source: Rmd +title: Manipulating and analysing data with dplyr +teaching: 75 +exercises: 75 +--- + +```{r, include=FALSE} +``` + +::::::::::::::::::::::::::::::::::::::: objectives + +- Describe the purpose of the **`dplyr`** and **`tidyr`** packages. +- Describe several of their functions that are extremely useful to + manipulate data. +- Describe the concept of a wide and a long table format, and see + how to reshape a data frame from one format to the other one. +- Demonstrate how to join tables. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- Data analysis in R using the tidyverse meta-package + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +```{r loaddata_dplyr, echo=FALSE, purl=FALSE, message=FALSE} +if (!file.exists("data/rnaseq.csv")) +download.file(url = "https://github.com/carpentries-incubator/bioc-intro/raw/main/episodes/data/rnaseq.csv", + destfile = "data/rnaseq.csv") +``` + +## Data manipulation using **`dplyr`** and **`tidyr`** + +Bracket subsetting is handy, but it can be cumbersome and difficult to +read, especially for complicated operations. + +Some packages can greatly facilitate our task when we manipulate data. +Packages in R are basically sets of additional functions that let you +do more stuff. The functions we've been using so far, like `str()` or +`data.frame()`, come built into R; Loading packages can give you access to other +specific functions. Before you use a package for the first time you need to install +it on your machine, and then you should import it in every subsequent +R session when you need it. + +- The package **`dplyr`** provides powerful tools for data manipulation tasks. + It is built to work directly with data frames, with many manipulation tasks + optimised. + +- As we will see latter on, sometimes we want a data frame to be reshaped to be able + to do some specific analyses or for visualisation. The package **`tidyr`** addresses + this common problem of reshaping data and provides tools for manipulating + data in a tidy way. + +To learn more about **`dplyr`** and **`tidyr`** after the workshop, +you may want to check out this handy data transformation with + +and this one about +. + +- The **`tidyverse`** package is an "umbrella-package" that installs + several useful packages for data analysis which work well together, + such as **`tidyr`**, **`dplyr`**, **`ggplot2`**, **`tibble`**, etc. + These packages help us to work and interact with the data. + They allow us to do many things with your data, such as subsetting, transforming, + visualising, etc. + +If you did the set up, you should have already installed the tidyverse package. +Check to see if you have it by trying to load in from the library: + +```{r, message=FALSE, purl=TRUE} +## load the tidyverse packages, incl. dplyr +library("tidyverse") +``` + +If you got an error message `there is no package called ‘tidyverse’` then you have not +installed the package yet for this version of R. To install the **`tidyverse`** package type: + +```{r, eval=FALSE, purl=TRUE} +BiocManager::install("tidyverse") +``` + +If you had to install the **`tidyverse`** package, do not forget to load it in this R session by using the `library()` command above! + +## Loading data with tidyverse + +Instead of `read.csv()`, we will read in our data using the `read_csv()` +function (notice the `_` instead of the `.`), from the tidyverse package +**`readr`**. + +```{r, message=FALSE, purl=TRUE} +rna <- read_csv("data/rnaseq.csv") + +## view the data +rna +``` + +Notice that the class of the data is now referred to as a "tibble". + +Tibbles tweak some of the behaviors of the data frame objects we introduced in the +previously. The data structure is very similar to a data frame. For our purposes +the only differences are that: + +1. It displays the data type of each column under its name. + Note that <`dbl`> is a data type defined to hold numeric values with + decimal points. + +2. It only prints the first few rows of data and only as many columns as fit on + one screen. + +We are now going to learn some of the most common **`dplyr`** functions: + +- `select()`: subset columns +- `filter()`: subset rows on conditions +- `mutate()`: create new columns by using information from other columns +- `group_by()` and `summarise()`: create summary statistics on grouped data +- `arrange()`: sort results +- `count()`: count discrete values + +## Selecting columns and filtering rows + +To select columns of a data frame, use `select()`. The first argument +to this function is the data frame (`rna`), and the subsequent +arguments are the columns to keep. + +```{r, purl=TRUE} +select(rna, gene, sample, tissue, expression) +``` + +To select all columns _except_ certain ones, put a "-" in front of +the variable to exclude it. + +```{r, purl=TRUE} +select(rna, -tissue, -organism) +``` + +This will select all the variables in `rna` except `tissue` +and `organism`. + +To choose rows based on a specific criteria, use `filter()`: + +```{r, purl=TRUE} +filter(rna, sex == "Male") +filter(rna, sex == "Male" & infection == "NonInfected") +``` + +Now let's imagine we are interested in the human homologs of the mouse +genes analysed in this dataset. This information can be found in the +last column of the `rna` tibble, named +`hsapiens_homolog_associated_gene_name`. To visualise it easily, we +will create a new table containing just the 2 columns `gene` and +`hsapiens_homolog_associated_gene_name`. + +```{r} +genes <- select(rna, gene, hsapiens_homolog_associated_gene_name) +genes +``` + +Some mouse genes have no human homologs. These can be retrieved using +`filter()` and the `is.na()` function, that determines whether +something is an `NA`. + +```{r, purl=TRUE} +filter(genes, is.na(hsapiens_homolog_associated_gene_name)) +``` + +If we want to keep only mouse genes that have a human homolog, we can +insert a "!" symbol that negates the result, so we're asking for +every row where hsapiens_homolog_associated_gene_name _is not_ an +`NA`. + +```{r, purl=TRUE} +filter(genes, !is.na(hsapiens_homolog_associated_gene_name)) +``` + +## Pipes + +What if you want to select and filter at the same time? There are three +ways to do this: use intermediate steps, nested functions, or pipes. + +With intermediate steps, you create a temporary data frame and use +that as input to the next function, like this: + +```{r, purl=TRUE} +rna2 <- filter(rna, sex == "Male") +rna3 <- select(rna2, gene, sample, tissue, expression) +rna3 +``` + +This is readable, but can clutter up your workspace with lots of +intermediate objects that you have to name individually. With multiple +steps, that can be hard to keep track of. + +You can also nest functions (i.e. one function inside of another), +like this: + +```{r, purl=TRUE} +rna3 <- select(filter(rna, sex == "Male"), gene, sample, tissue, expression) +rna3 +``` + +This is handy, but can be difficult to read if too many functions are nested, as +R evaluates the expression from the inside out (in this case, filtering, then selecting). + +The last option, _pipes_, are a recent addition to R. Pipes let you take +the output of one function and send it directly to the next, which is useful +when you need to do many things to the same dataset. + +Pipes in R look like `%>%` (made available via the **`magrittr`** +package) or `|>` (through base R). If you use RStudio, you can type +the pipe with <kbd>Ctrl</kbd> + <kbd>Shift</kbd> + <kbd>M</kbd> if you +have a PC or <kbd>Cmd</kbd> + <kbd>Shift</kbd> + <kbd>M</kbd> if you +have a Mac. + +In the above code, we use the pipe to send the `rna` dataset first +through `filter()` to keep rows where `sex` is Male, then through +`select()` to keep only the `gene`, `sample`, `tissue`, and +`expression`columns. + +The pipe `%>%` takes the object on its left and passes it directly as +the first argument to the function on its right, we don't need to +explicitly include the data frame as an argument to the `filter()` and +`select()` functions any more. + +```{r, purl=TRUE} +rna %>% + filter(sex == "Male") %>% + select(gene, sample, tissue, expression) +``` + +Some may find it helpful to read the pipe like the word "then". For instance, +in the above example, we took the data frame `rna`, _then_ we `filter`ed +for rows with `sex == "Male"`, _then_ we `select`ed columns `gene`, `sample`, +`tissue`, and `expression`. + +The **`dplyr`** functions by themselves are somewhat simple, but by +combining them into linear workflows with the pipe, we can accomplish +more complex manipulations of data frames. + +If we want to create a new object with this smaller version of the data, we +can assign it a new name: + +```{r, purl=TRUE} +rna3 <- rna %>% + filter(sex == "Male") %>% + select(gene, sample, tissue, expression) + +rna3 +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge: + +Using pipes, subset the `rna` data to keep observations in female mice at time 0, +where the gene has an expression higher than 50000, and retain only the columns +`gene`, `sample`, `time`, `expression` and `age`. + +::::::::::::::: solution + +## Solution + +```{r} +rna %>% + filter(expression > 50000, + sex == "Female", + time == 0 ) %>% + select(gene, sample, time, expression, age) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Mutate + +Frequently you'll want to create new columns based on the values of existing +columns, for example to do unit conversions, or to find the ratio of values in two +columns. For this we'll use `mutate()`. + +To create a new column of time in hours: + +```{r, purl=TRUE} +rna %>% + mutate(time_hours = time * 24) %>% + select(time, time_hours) +``` + +You can also create a second new column based on the first new column within the same call of `mutate()`: + +```{r, purl=TRUE} +rna %>% + mutate(time_hours = time * 24, + time_mn = time_hours * 60) %>% + select(time, time_hours, time_mn) +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge + +Create a new data frame from the `rna` data that meets the following +criteria: contains only the `gene`, `chromosome_name`, +`phenotype_description`, `sample`, and `expression` columns. The expression +values should be log-transformed. This data frame must +only contain genes located on sex chromosomes, associated with a +phenotype_description, and with a log expression higher than 5. + +**Hint**: think about how the commands should be ordered to produce +this data frame! + +::::::::::::::: solution + +## Solution + +```{r, eval=TRUE, purl=TRUE} +rna %>% + mutate(expression = log(expression)) %>% + select(gene, chromosome_name, phenotype_description, sample, expression) %>% + filter(chromosome_name == "X" | chromosome_name == "Y") %>% + filter(!is.na(phenotype_description)) %>% + filter(expression > 5) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Split-apply-combine data analysis + +Many data analysis tasks can be approached using the +_split-apply-combine_ paradigm: split the data into groups, apply some +analysis to each group, and then combine the results. **`dplyr`** +makes this very easy through the use of the `group_by()` function. + +```{r} +rna %>% + group_by(gene) +``` + +The `group_by()` function doesn't perform any data processing, it +groups the data into subsets: in the example above, our initial +`tibble` of `r nrow(rna)` observations is split into +`r length(unique(rna$gene))` groups based on the `gene` variable. + +We could similarly decide to group the tibble by the samples: + +```{r} +rna %>% + group_by(sample) +``` + +Here our initial `tibble` of `r nrow(rna)` observations is split into +`r length(unique(rna$sample))` groups based on the `sample` variable. + +Once the data has been grouped, subsequent operations will be +applied on each group independently. + +### The `summarise()` function + +`group_by()` is often used together with `summarise()`, which +collapses each group into a single-row summary of that group. + +`group_by()` takes as arguments the column names that contain the +**categorical** variables for which you want to calculate the summary +statistics. So to compute the mean `expression` by gene: + +```{r} +rna %>% + group_by(gene) %>% + summarise(mean_expression = mean(expression)) +``` + +We could also want to calculate the mean expression levels of all genes in each sample: + +```{r} +rna %>% + group_by(sample) %>% + summarise(mean_expression = mean(expression)) +``` + +But we can can also group by multiple columns: + +```{r} +rna %>% + group_by(gene, infection, time) %>% + summarise(mean_expression = mean(expression)) +``` + +Once the data is grouped, you can also summarise multiple variables at the same +time (and not necessarily on the same variable). For instance, we could add a +column indicating the median `expression` by gene and by condition: + +```{r, purl=TRUE} +rna %>% + group_by(gene, infection, time) %>% + summarise(mean_expression = mean(expression), + median_expression = median(expression)) +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge + +Calculate the mean expression level of gene "Dok3" by timepoints. + +::::::::::::::: solution + +## Solution + +```{r, purl=TRUE} +rna %>% + filter(gene == "Dok3") %>% + group_by(time) %>% + summarise(mean = mean(expression)) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +### Counting + +When working with data, we often want to know the number of observations found +for each factor or combination of factors. For this task, **`dplyr`** provides +`count()`. For example, if we wanted to count the number of rows of data for +each infected and non-infected samples, we would do: + +```{r, purl=TRUE} +rna %>% + count(infection) +``` + +The `count()` function is shorthand for something we've already seen: grouping by a variable, and summarising it by counting the number of observations in that group. In other words, `rna %>% count(infection)` is equivalent to: + +```{r, purl=TRUE} +rna %>% + group_by(infection) %>% + summarise(n = n()) +``` + +The previous example shows the use of `count()` to count the number of rows/observations +for _one_ factor (i.e., `infection`). +If we wanted to count a _combination of factors_, such as `infection` and `time`, +we would specify the first and the second factor as the arguments of `count()`: + +```{r, purl=TRUE} +rna %>% + count(infection, time) +``` + +which is equivalent to this: + +```{r, purl=TRUE} +rna %>% + group_by(infection, time) %>% + summarise(n = n()) +``` + +It is sometimes useful to sort the result to facilitate the comparisons. +We can use `arrange()` to sort the table. +For instance, we might want to arrange the table above by time: + +```{r, purl=TRUE} +rna %>% + count(infection, time) %>% + arrange(time) +``` + +or by counts: + +```{r, purl=TRUE} +rna %>% + count(infection, time) %>% + arrange(n) +``` + +To sort in descending order, we need to add the `desc()` function: + +```{r, purl=TRUE} +rna %>% + count(infection, time) %>% + arrange(desc(n)) +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge + +1. How many genes were analysed in each sample? +2. Use `group_by()` and `summarise()` to evaluate the sequencing depth (the sum of all counts) in each sample. Which sample has the highest sequencing depth? +3. Pick one sample and evaluate the number of genes by biotype. +4. Identify genes associated with the "abnormal DNA methylation" phenotype description, and calculate their mean expression (in log) at time 0, time 4 and time 8. + +::::::::::::::: solution + +## Solution + +```{r} +## 1. +rna %>% + count(sample) +## 2. +rna %>% + group_by(sample) %>% + summarise(seq_depth = sum(expression)) %>% + arrange(desc(seq_depth)) +## 3. +rna %>% + filter(sample == "GSM2545336") %>% + count(gene_biotype) %>% + arrange(desc(n)) +## 4. +rna %>% + filter(phenotype_description == "abnormal DNA methylation") %>% + group_by(gene, time) %>% + summarise(mean_expression = mean(log(expression))) %>% + arrange() +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Reshaping data + +In the `rna` tibble, the rows contain expression values (the unit) that are +associated with a combination of 2 other variables: `gene` and `sample`. + +All the other columns correspond to variables describing either +the sample (organism, age, sex, ...) or the gene (gene_biotype, ENTREZ_ID, product, ...). +The variables that don't change with genes or with samples will have the same value in all the rows. + +```{r} +rna %>% + arrange(gene) +``` + +This structure is called a `long-format`, as one column contains all the values, +and other column(s) list(s) the context of the value. + +In certain cases, the `long-format` is not really "human-readable", and another format, +a `wide-format` is preferred, as a more compact way of representing the data. +This is typically the case with gene expression values that scientists are used to +look as matrices, were rows represent genes and columns represent samples. + +In this format, it would therefore become straightforward +to explore the relationship between the gene expression levels within, and +between, the samples. + +```{r, echo=FALSE} +rna %>% + select(gene, sample, expression) %>% + pivot_wider(names_from = sample, + values_from = expression) +``` + +To convert the gene expression values from `rna` into a wide-format, +we need to create a new table where the values of the `sample` column would +become the names of column variables. + +The key point here is that we are still following +a tidy data structure, but we have **reshaped** the data according to +the observations of interest: expression levels per gene instead +of recording them per gene and per sample. + +The opposite transformation would be to transform column names into +values of a new variable. + +We can do both these of transformations with two `tidyr` functions, +`pivot_longer()` and `pivot_wider()` (see +[here](https://tidyr.tidyverse.org/dev/articles/pivot.html) for +details). + +### Pivoting the data into a wider format + +Let's select the first 3 columns of `rna` and use `pivot_wider()` +to transform the data into a wide-format. + +```{r, purl=TRUE} +rna_exp <- rna %>% + select(gene, sample, expression) +rna_exp +``` + +`pivot_wider` takes three main arguments: + +1. the data to be transformed; +2. the `names_from` : the column whose values will become new column + names; +3. the `values_from`: the column whose values will fill the new + columns. + +\`\`\`{r, fig.cap="Wide pivot of the `rna` data.", echo=FALSE, message=FALSE} +knitr::include_graphics("fig/pivot_wider.png") + +```` + +```{r, purl=TRUE} +rna_wide <- rna_exp %>% + pivot_wider(names_from = sample, + values_from = expression) +rna_wide +```` + +Note that by default, the `pivot_wider()` function will add `NA` for missing values. + +Let's imagine that for some reason, we had some missing expression values for some +genes in certain samples. In the following fictive example, the gene Cyp2d22 has only +one expression value, in GSM2545338 sample. + +```{r, purl=TRUE} +rna_with_missing_values <- rna %>% + select(gene, sample, expression) %>% + filter(gene %in% c("Asl", "Apod", "Cyp2d22")) %>% + filter(sample %in% c("GSM2545336", "GSM2545337", "GSM2545338")) %>% + arrange(sample) %>% + filter(!(gene == "Cyp2d22" & sample != "GSM2545338")) +rna_with_missing_values +``` + +By default, the `pivot_wider()` function will add `NA` for missing +values. This can be parameterised with the `values_fill` argument of +the `pivot_wider()` function. + +```{r, purl=TRUE} +rna_with_missing_values %>% + pivot_wider(names_from = sample, + values_from = expression) + +rna_with_missing_values %>% + pivot_wider(names_from = sample, + values_from = expression, + values_fill = 0) +``` + +### Pivoting data into a longer format + +In the opposite situation we are using the column names and turning them into +a pair of new variables. One variable represents the column names as +values, and the other variable contains the values previously +associated with the column names. + +`pivot_longer()` takes four main arguments: + +1. the data to be transformed; +2. the `names_to`: the new column name we wish to create and populate with the + current column names; +3. the `values_to`: the new column name we wish to create and populate with + current values; +4. the names of the columns to be used to populate the `names_to` and + `values_to` variables (or to drop). + +\`\`\`{r, fig.cap="Long pivot of the `rna` data.", echo=FALSE, message=FALSE} +knitr::include_graphics("fig/pivot_longer.png") + +```` + +To recreate `rna_long` from `rna_wide` we would create a key +called `sample` and value called `expression` and use all columns +except `gene` for the key variable. Here we drop `gene` column +with a minus sign. + +Notice how the new variable names are to be quoted here. + +```{r} +rna_long <- rna_wide %>% + pivot_longer(names_to = "sample", + values_to = "expression", + -gene) +rna_long +```` + +We could also have used a specification for what columns to +include. This can be useful if you have a large number of identifying +columns, and it's easier to specify what to gather than what to leave +alone. Here the `starts_with()` function can help to retrieve sample +names without having to list them all! +Another possibility would be to use the `:` operator! + +```{r} +rna_wide %>% + pivot_longer(names_to = "sample", + values_to = "expression", + cols = starts_with("GSM")) +rna_wide %>% + pivot_longer(names_to = "sample", + values_to = "expression", + GSM2545336:GSM2545380) +``` + +Note that if we had missing values in the wide-format, the `NA` would be +included in the new long format. + +Remember our previous fictive tibble containing missing values: + +```{r} +rna_with_missing_values + +wide_with_NA <- rna_with_missing_values %>% + pivot_wider(names_from = sample, + values_from = expression) +wide_with_NA + +wide_with_NA %>% + pivot_longer(names_to = "sample", + values_to = "expression", + -gene) +``` + +Pivoting to wider and longer formats can be a useful way to balance out a dataset +so every replicate has the same composition. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Question + +Starting from the rna table, use the `pivot_wider()` function to create +a wide-format table giving the gene expression levels in each mouse. +Then use the `pivot_longer()` function to restore a long-format table. + +::::::::::::::: solution + +## Solution + +```{r, answer=TRUE, purl=TRUE} +rna1 <- rna %>% +select(gene, mouse, expression) %>% +pivot_wider(names_from = mouse, values_from = expression) +rna1 + +rna1 %>% +pivot_longer(names_to = "mouse_id", values_to = "counts", -gene) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Question + +Subset genes located on X and Y chromosomes from the `rna` data frame and +spread the data frame with `sex` as columns, `chromosome_name` as +rows, and the mean expression of genes located in each chromosome as the values, +as in the following tibble: + +```{r, echo=FALSE, message=FALSE} +knitr::include_graphics("fig/Exercise_pivot_W.png") +``` + +You will need to summarise before reshaping! + +::::::::::::::: solution + +## Solution + +Let's first calculate the mean expression level of X and Y linked genes from +male and female samples... + +```{r} + rna %>% + filter(chromosome_name == "Y" | chromosome_name == "X") %>% + group_by(sex, chromosome_name) %>% + summarise(mean = mean(expression)) +``` + +And pivot the table to wide format + +```{r, answer=TRUE, purl=TRUE} +rna_1 <- rna %>% + filter(chromosome_name == "Y" | chromosome_name == "X") %>% + group_by(sex, chromosome_name) %>% + summarise(mean = mean(expression)) %>% + pivot_wider(names_from = sex, + values_from = mean) + +rna_1 +``` + +Now take that data frame and transform it with `pivot_longer()` so +each row is a unique `chromosome_name` by `gender` combination. + +```{r, answer=TRUE, purl=TRUE} +rna_1 %>% + pivot_longer(names_to = "gender", + values_to = "mean", + -chromosome_name) + +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Question + +Use the `rna` dataset to create an expression matrix where each row +represents the mean expression levels of genes and columns represent +the different timepoints. + +::::::::::::::: solution + +## Solution + +Let's first calculate the mean expression by gene and by time + +```{r} +rna %>% + group_by(gene, time) %>% + summarise(mean_exp = mean(expression)) +``` + +before using the pivot_wider() function + +```{r} +rna_time <- rna %>% + group_by(gene, time) %>% + summarise(mean_exp = mean(expression)) %>% + pivot_wider(names_from = time, + values_from = mean_exp) +rna_time +``` + +Notice that this generates a tibble with some column names starting by a number. +If we wanted to select the column corresponding to the timepoints, +we could not use the column names directly... What happens when we select the column 4? + +```{r} +rna %>% + group_by(gene, time) %>% + summarise(mean_exp = mean(expression)) %>% + pivot_wider(names_from = time, + values_from = mean_exp) %>% + select(gene, 4) +``` + +To select the timepoint 4, we would have to quote the column name, with backticks "\`" + +```{r} +rna %>% + group_by(gene, time) %>% + summarise(mean_exp = mean(expression)) %>% + pivot_wider(names_from = time, + values_from = mean_exp) %>% + select(gene, `4`) +``` + +Another possibility would be to rename the column, +choosing a name that doesn't start by a number : + +```{r} +rna %>% + group_by(gene, time) %>% + summarise(mean_exp = mean(expression)) %>% + pivot_wider(names_from = time, + values_from = mean_exp) %>% + rename("time0" = `0`, "time4" = `4`, "time8" = `8`) %>% + select(gene, time4) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Question + +Use the previous data frame containing mean expression levels per timepoint and create +a new column containing fold-changes between timepoint 8 and timepoint 0, and fold-changes +between timepoint 8 and timepoint 4. +Convert this table into a long-format table gathering the fold-changes calculated. + +::::::::::::::: solution + +## Solution + +Starting from the rna_time tibble: + +```{r} +rna_time +``` + +Calculate fold-changes: + +```{r} +rna_time %>% + mutate(time_8_vs_0 = `8` / `0`, time_8_vs_4 = `8` / `4`) +``` + +And use the pivot_longer() function: + +```{r} +rna_time %>% + mutate(time_8_vs_0 = `8` / `0`, time_8_vs_4 = `8` / `4`) %>% + pivot_longer(names_to = "comparisons", + values_to = "Fold_changes", + time_8_vs_0:time_8_vs_4) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Joining tables + +In many real life situations, data are spread across multiple tables. +Usually this occurs because different types of information are +collected from different sources. + +It may be desirable for some analyses to combine data from two or more +tables into a single data frame based on a column that would be common +to all the tables. + +The `dplyr` package provides a set of join functions for combining two +data frames based on matches within specified columns. Here, we +provide a short introduction to joins. For further reading, please +refer to the chapter about table +joins. The +Data Transformation Cheat +Sheet +also provides a short overview on table joins. + +We are going to illustrate join using a small table, `rna_mini` that +we will create by subsetting the original `rna` table, keeping only 3 +columns and 10 lines. + +```{r} +rna_mini <- rna %>% + select(gene, sample, expression) %>% + head(10) +rna_mini +``` + +The second table, `annot1`, contains 2 columns, gene and +gene_description. You can either +[download annot1.csv](https://carpentries-incubator.github.io/bioc-intro/data/annot1.csv) +by clicking on the link and then moving it to the `data/` folder, or +you can use the R code below to download it directly to the folder. + +```{r, message=FALSE} +download.file(url = "https://carpentries-incubator.github.io/bioc-intro/data/annot1.csv", + destfile = "data/annot1.csv") +annot1 <- read_csv(file = "data/annot1.csv") +annot1 +``` + +We now want to join these two tables into a single one containing all +variables using the `full_join()` function from the `dplyr` package. The +function will automatically find the common variable to match columns +from the first and second table. In this case, `gene` is the common +variable. Such variables are called keys. Keys are used to match +observations across different tables. + +```{r} +full_join(rna_mini, annot1) +``` + +In real life, gene annotations are sometimes labelled differently. + +The `annot2` table is exactly the same than `annot1` except that the +variable containing gene names is labelled differently. Again, either +[download annot2.csv](https://carpentries-incubator.github.io/bioc-intro/data/annot2.csv) +yourself and move it to `data/` or use the R code below. + +```{r, message=FALSE} +download.file(url = "https://carpentries-incubator.github.io/bioc-intro/data/annot2.csv", + destfile = "data/annot2.csv") +annot2 <- read_csv(file = "data/annot2.csv") +annot2 +``` + +In case none of the variable names match, we can set manually the +variables to use for the matching. These variables can be set using +the `by` argument, as shown below with `rna_mini` and `annot2` tables. + +```{r} +full_join(rna_mini, annot2, by = c("gene" = "external_gene_name")) +``` + +As can be seen above, the variable name of the first table is retained +in the joined one. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge: + +Download the `annot3` table by clicking +[here](https://carpentries-incubator.github.io/bioc-intro/data/annot3.csv) +and put the table in your data/ repository. Using the `full_join()` +function, join tables `rna_mini` and `annot3`. What has happened for +genes _Klk6_, _mt-Tf_, _mt-Rnr1_, _mt-Tv_, _mt-Rnr2_, and _mt-Tl1_ ? + +::::::::::::::: solution + +## Solution + +```{r, message=FALSE} +annot3 <- read_csv("data/annot3.csv") +full_join(rna_mini, annot3) +``` + +Genes _Klk6_ is only present in `rna_mini`, while genes _mt-Tf_, _mt-Rnr1_, _mt-Tv_, +_mt-Rnr2_, and _mt-Tl1_ are only present in `annot3` table. Their respective values for the +variables of the table have been encoded as missing. + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Exporting data + +Now that you have learned how to use `dplyr` to extract information from +or summarise your raw data, you may want to export these new data sets to share +them with your collaborators or for archival. + +Similar to the `read_csv()` function used for reading CSV files into R, there is +a `write_csv()` function that generates CSV files from data frames. + +Before using `write_csv()`, we are going to create a new folder, `data_output`, +in our working directory that will store this generated dataset. We don't want +to write generated datasets in the same directory as our raw data. +It's good practice to keep them separate. The `data` folder should only contain +the raw, unaltered data, and should be left alone to make sure we don't delete +or modify it. In contrast, our script will generate the contents of the `data_output` +directory, so even if the files it contains are deleted, we can always +re-generate them. + +Let's use `write_csv()` to save the rna_wide table that we have created previously. + +```{r, purl=TRUE, eval=FALSE} +write_csv(rna_wide, file = "data_output/rna_wide.csv") +``` + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- Tabular data in R using the tidyverse meta-package + +:::::::::::::::::::::::::::::::::::::::::::::::::: From 949857586da8c18da4684217ce514baf7c2af8c4 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Fri, 29 Dec 2023 06:11:04 +0900 Subject: [PATCH 026/334] New translations 30-dplyr.md (Chinese Simplified) --- locale/zh/episodes/30-dplyr.Rmd | 1044 +++++++++++++++++++++++++++++++ 1 file changed, 1044 insertions(+) create mode 100644 locale/zh/episodes/30-dplyr.Rmd diff --git a/locale/zh/episodes/30-dplyr.Rmd b/locale/zh/episodes/30-dplyr.Rmd new file mode 100644 index 000000000..d41f82e5f --- /dev/null +++ b/locale/zh/episodes/30-dplyr.Rmd @@ -0,0 +1,1044 @@ +--- +source: Rmd +title: Manipulating and analysing data with dplyr +teaching: 75 +exercises: 75 +--- + +```{r, include=FALSE} +``` + +::::::::::::::::::::::::::::::::::::::: objectives + +- Describe the purpose of the **`dplyr`** and **`tidyr`** packages. +- Describe several of their functions that are extremely useful to + manipulate data. +- Describe the concept of a wide and a long table format, and see + how to reshape a data frame from one format to the other one. +- Demonstrate how to join tables. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- Data analysis in R using the tidyverse meta-package + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +```{r loaddata_dplyr, echo=FALSE, purl=FALSE, message=FALSE} +if (!file.exists("data/rnaseq.csv")) +download.file(url = "https://github.com/carpentries-incubator/bioc-intro/raw/main/episodes/data/rnaseq.csv", + destfile = "data/rnaseq.csv") +``` + +## Data manipulation using **`dplyr`** and **`tidyr`** + +Bracket subsetting is handy, but it can be cumbersome and difficult to +read, especially for complicated operations. + +Some packages can greatly facilitate our task when we manipulate data. +Packages in R are basically sets of additional functions that let you +do more stuff. The functions we've been using so far, like `str()` or +`data.frame()`, come built into R; Loading packages can give you access to other +specific functions. Before you use a package for the first time you need to install +it on your machine, and then you should import it in every subsequent +R session when you need it. + +- The package **`dplyr`** provides powerful tools for data manipulation tasks. + It is built to work directly with data frames, with many manipulation tasks + optimised. + +- As we will see latter on, sometimes we want a data frame to be reshaped to be able + to do some specific analyses or for visualisation. The package **`tidyr`** addresses + this common problem of reshaping data and provides tools for manipulating + data in a tidy way. + +To learn more about **`dplyr`** and **`tidyr`** after the workshop, +you may want to check out this handy data transformation with + +and this one about +. + +- The **`tidyverse`** package is an "umbrella-package" that installs + several useful packages for data analysis which work well together, + such as **`tidyr`**, **`dplyr`**, **`ggplot2`**, **`tibble`**, etc. + These packages help us to work and interact with the data. + They allow us to do many things with your data, such as subsetting, transforming, + visualising, etc. + +If you did the set up, you should have already installed the tidyverse package. +Check to see if you have it by trying to load in from the library: + +```{r, message=FALSE, purl=TRUE} +## load the tidyverse packages, incl. dplyr +library("tidyverse") +``` + +If you got an error message `there is no package called ‘tidyverse’` then you have not +installed the package yet for this version of R. To install the **`tidyverse`** package type: + +```{r, eval=FALSE, purl=TRUE} +BiocManager::install("tidyverse") +``` + +If you had to install the **`tidyverse`** package, do not forget to load it in this R session by using the `library()` command above! + +## Loading data with tidyverse + +Instead of `read.csv()`, we will read in our data using the `read_csv()` +function (notice the `_` instead of the `.`), from the tidyverse package +**`readr`**. + +```{r, message=FALSE, purl=TRUE} +rna <- read_csv("data/rnaseq.csv") + +## view the data +rna +``` + +Notice that the class of the data is now referred to as a "tibble". + +Tibbles tweak some of the behaviors of the data frame objects we introduced in the +previously. The data structure is very similar to a data frame. For our purposes +the only differences are that: + +1. It displays the data type of each column under its name. + Note that <`dbl`> is a data type defined to hold numeric values with + decimal points. + +2. It only prints the first few rows of data and only as many columns as fit on + one screen. + +We are now going to learn some of the most common **`dplyr`** functions: + +- `select()`: subset columns +- `filter()`: subset rows on conditions +- `mutate()`: create new columns by using information from other columns +- `group_by()` and `summarise()`: create summary statistics on grouped data +- `arrange()`: sort results +- `count()`: count discrete values + +## Selecting columns and filtering rows + +To select columns of a data frame, use `select()`. The first argument +to this function is the data frame (`rna`), and the subsequent +arguments are the columns to keep. + +```{r, purl=TRUE} +select(rna, gene, sample, tissue, expression) +``` + +To select all columns _except_ certain ones, put a "-" in front of +the variable to exclude it. + +```{r, purl=TRUE} +select(rna, -tissue, -organism) +``` + +This will select all the variables in `rna` except `tissue` +and `organism`. + +To choose rows based on a specific criteria, use `filter()`: + +```{r, purl=TRUE} +filter(rna, sex == "Male") +filter(rna, sex == "Male" & infection == "NonInfected") +``` + +Now let's imagine we are interested in the human homologs of the mouse +genes analysed in this dataset. This information can be found in the +last column of the `rna` tibble, named +`hsapiens_homolog_associated_gene_name`. To visualise it easily, we +will create a new table containing just the 2 columns `gene` and +`hsapiens_homolog_associated_gene_name`. + +```{r} +genes <- select(rna, gene, hsapiens_homolog_associated_gene_name) +genes +``` + +Some mouse genes have no human homologs. These can be retrieved using +`filter()` and the `is.na()` function, that determines whether +something is an `NA`. + +```{r, purl=TRUE} +filter(genes, is.na(hsapiens_homolog_associated_gene_name)) +``` + +If we want to keep only mouse genes that have a human homolog, we can +insert a "!" symbol that negates the result, so we're asking for +every row where hsapiens_homolog_associated_gene_name _is not_ an +`NA`. + +```{r, purl=TRUE} +filter(genes, !is.na(hsapiens_homolog_associated_gene_name)) +``` + +## Pipes + +What if you want to select and filter at the same time? There are three +ways to do this: use intermediate steps, nested functions, or pipes. + +With intermediate steps, you create a temporary data frame and use +that as input to the next function, like this: + +```{r, purl=TRUE} +rna2 <- filter(rna, sex == "Male") +rna3 <- select(rna2, gene, sample, tissue, expression) +rna3 +``` + +This is readable, but can clutter up your workspace with lots of +intermediate objects that you have to name individually. With multiple +steps, that can be hard to keep track of. + +You can also nest functions (i.e. one function inside of another), +like this: + +```{r, purl=TRUE} +rna3 <- select(filter(rna, sex == "Male"), gene, sample, tissue, expression) +rna3 +``` + +This is handy, but can be difficult to read if too many functions are nested, as +R evaluates the expression from the inside out (in this case, filtering, then selecting). + +The last option, _pipes_, are a recent addition to R. Pipes let you take +the output of one function and send it directly to the next, which is useful +when you need to do many things to the same dataset. + +Pipes in R look like `%>%` (made available via the **`magrittr`** +package) or `|>` (through base R). If you use RStudio, you can type +the pipe with <kbd>Ctrl</kbd> + <kbd>Shift</kbd> + <kbd>M</kbd> if you +have a PC or <kbd>Cmd</kbd> + <kbd>Shift</kbd> + <kbd>M</kbd> if you +have a Mac. + +In the above code, we use the pipe to send the `rna` dataset first +through `filter()` to keep rows where `sex` is Male, then through +`select()` to keep only the `gene`, `sample`, `tissue`, and +`expression`columns. + +The pipe `%>%` takes the object on its left and passes it directly as +the first argument to the function on its right, we don't need to +explicitly include the data frame as an argument to the `filter()` and +`select()` functions any more. + +```{r, purl=TRUE} +rna %>% + filter(sex == "Male") %>% + select(gene, sample, tissue, expression) +``` + +Some may find it helpful to read the pipe like the word "then". For instance, +in the above example, we took the data frame `rna`, _then_ we `filter`ed +for rows with `sex == "Male"`, _then_ we `select`ed columns `gene`, `sample`, +`tissue`, and `expression`. + +The **`dplyr`** functions by themselves are somewhat simple, but by +combining them into linear workflows with the pipe, we can accomplish +more complex manipulations of data frames. + +If we want to create a new object with this smaller version of the data, we +can assign it a new name: + +```{r, purl=TRUE} +rna3 <- rna %>% + filter(sex == "Male") %>% + select(gene, sample, tissue, expression) + +rna3 +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge: + +Using pipes, subset the `rna` data to keep observations in female mice at time 0, +where the gene has an expression higher than 50000, and retain only the columns +`gene`, `sample`, `time`, `expression` and `age`. + +::::::::::::::: solution + +## Solution + +```{r} +rna %>% + filter(expression > 50000, + sex == "Female", + time == 0 ) %>% + select(gene, sample, time, expression, age) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Mutate + +Frequently you'll want to create new columns based on the values of existing +columns, for example to do unit conversions, or to find the ratio of values in two +columns. For this we'll use `mutate()`. + +To create a new column of time in hours: + +```{r, purl=TRUE} +rna %>% + mutate(time_hours = time * 24) %>% + select(time, time_hours) +``` + +You can also create a second new column based on the first new column within the same call of `mutate()`: + +```{r, purl=TRUE} +rna %>% + mutate(time_hours = time * 24, + time_mn = time_hours * 60) %>% + select(time, time_hours, time_mn) +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge + +Create a new data frame from the `rna` data that meets the following +criteria: contains only the `gene`, `chromosome_name`, +`phenotype_description`, `sample`, and `expression` columns. The expression +values should be log-transformed. This data frame must +only contain genes located on sex chromosomes, associated with a +phenotype_description, and with a log expression higher than 5. + +**Hint**: think about how the commands should be ordered to produce +this data frame! + +::::::::::::::: solution + +## Solution + +```{r, eval=TRUE, purl=TRUE} +rna %>% + mutate(expression = log(expression)) %>% + select(gene, chromosome_name, phenotype_description, sample, expression) %>% + filter(chromosome_name == "X" | chromosome_name == "Y") %>% + filter(!is.na(phenotype_description)) %>% + filter(expression > 5) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Split-apply-combine data analysis + +Many data analysis tasks can be approached using the +_split-apply-combine_ paradigm: split the data into groups, apply some +analysis to each group, and then combine the results. **`dplyr`** +makes this very easy through the use of the `group_by()` function. + +```{r} +rna %>% + group_by(gene) +``` + +The `group_by()` function doesn't perform any data processing, it +groups the data into subsets: in the example above, our initial +`tibble` of `r nrow(rna)` observations is split into +`r length(unique(rna$gene))` groups based on the `gene` variable. + +We could similarly decide to group the tibble by the samples: + +```{r} +rna %>% + group_by(sample) +``` + +Here our initial `tibble` of `r nrow(rna)` observations is split into +`r length(unique(rna$sample))` groups based on the `sample` variable. + +Once the data has been grouped, subsequent operations will be +applied on each group independently. + +### The `summarise()` function + +`group_by()` is often used together with `summarise()`, which +collapses each group into a single-row summary of that group. + +`group_by()` takes as arguments the column names that contain the +**categorical** variables for which you want to calculate the summary +statistics. So to compute the mean `expression` by gene: + +```{r} +rna %>% + group_by(gene) %>% + summarise(mean_expression = mean(expression)) +``` + +We could also want to calculate the mean expression levels of all genes in each sample: + +```{r} +rna %>% + group_by(sample) %>% + summarise(mean_expression = mean(expression)) +``` + +But we can can also group by multiple columns: + +```{r} +rna %>% + group_by(gene, infection, time) %>% + summarise(mean_expression = mean(expression)) +``` + +Once the data is grouped, you can also summarise multiple variables at the same +time (and not necessarily on the same variable). For instance, we could add a +column indicating the median `expression` by gene and by condition: + +```{r, purl=TRUE} +rna %>% + group_by(gene, infection, time) %>% + summarise(mean_expression = mean(expression), + median_expression = median(expression)) +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge + +Calculate the mean expression level of gene "Dok3" by timepoints. + +::::::::::::::: solution + +## Solution + +```{r, purl=TRUE} +rna %>% + filter(gene == "Dok3") %>% + group_by(time) %>% + summarise(mean = mean(expression)) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +### Counting + +When working with data, we often want to know the number of observations found +for each factor or combination of factors. For this task, **`dplyr`** provides +`count()`. For example, if we wanted to count the number of rows of data for +each infected and non-infected samples, we would do: + +```{r, purl=TRUE} +rna %>% + count(infection) +``` + +The `count()` function is shorthand for something we've already seen: grouping by a variable, and summarising it by counting the number of observations in that group. In other words, `rna %>% count(infection)` is equivalent to: + +```{r, purl=TRUE} +rna %>% + group_by(infection) %>% + summarise(n = n()) +``` + +The previous example shows the use of `count()` to count the number of rows/observations +for _one_ factor (i.e., `infection`). +If we wanted to count a _combination of factors_, such as `infection` and `time`, +we would specify the first and the second factor as the arguments of `count()`: + +```{r, purl=TRUE} +rna %>% + count(infection, time) +``` + +which is equivalent to this: + +```{r, purl=TRUE} +rna %>% + group_by(infection, time) %>% + summarise(n = n()) +``` + +It is sometimes useful to sort the result to facilitate the comparisons. +We can use `arrange()` to sort the table. +For instance, we might want to arrange the table above by time: + +```{r, purl=TRUE} +rna %>% + count(infection, time) %>% + arrange(time) +``` + +or by counts: + +```{r, purl=TRUE} +rna %>% + count(infection, time) %>% + arrange(n) +``` + +To sort in descending order, we need to add the `desc()` function: + +```{r, purl=TRUE} +rna %>% + count(infection, time) %>% + arrange(desc(n)) +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge + +1. How many genes were analysed in each sample? +2. Use `group_by()` and `summarise()` to evaluate the sequencing depth (the sum of all counts) in each sample. Which sample has the highest sequencing depth? +3. Pick one sample and evaluate the number of genes by biotype. +4. Identify genes associated with the "abnormal DNA methylation" phenotype description, and calculate their mean expression (in log) at time 0, time 4 and time 8. + +::::::::::::::: solution + +## Solution + +```{r} +## 1. +rna %>% + count(sample) +## 2. +rna %>% + group_by(sample) %>% + summarise(seq_depth = sum(expression)) %>% + arrange(desc(seq_depth)) +## 3. +rna %>% + filter(sample == "GSM2545336") %>% + count(gene_biotype) %>% + arrange(desc(n)) +## 4. +rna %>% + filter(phenotype_description == "abnormal DNA methylation") %>% + group_by(gene, time) %>% + summarise(mean_expression = mean(log(expression))) %>% + arrange() +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Reshaping data + +In the `rna` tibble, the rows contain expression values (the unit) that are +associated with a combination of 2 other variables: `gene` and `sample`. + +All the other columns correspond to variables describing either +the sample (organism, age, sex, ...) or the gene (gene_biotype, ENTREZ_ID, product, ...). +The variables that don't change with genes or with samples will have the same value in all the rows. + +```{r} +rna %>% + arrange(gene) +``` + +This structure is called a `long-format`, as one column contains all the values, +and other column(s) list(s) the context of the value. + +In certain cases, the `long-format` is not really "human-readable", and another format, +a `wide-format` is preferred, as a more compact way of representing the data. +This is typically the case with gene expression values that scientists are used to +look as matrices, were rows represent genes and columns represent samples. + +In this format, it would therefore become straightforward +to explore the relationship between the gene expression levels within, and +between, the samples. + +```{r, echo=FALSE} +rna %>% + select(gene, sample, expression) %>% + pivot_wider(names_from = sample, + values_from = expression) +``` + +To convert the gene expression values from `rna` into a wide-format, +we need to create a new table where the values of the `sample` column would +become the names of column variables. + +The key point here is that we are still following +a tidy data structure, but we have **reshaped** the data according to +the observations of interest: expression levels per gene instead +of recording them per gene and per sample. + +The opposite transformation would be to transform column names into +values of a new variable. + +We can do both these of transformations with two `tidyr` functions, +`pivot_longer()` and `pivot_wider()` (see +[here](https://tidyr.tidyverse.org/dev/articles/pivot.html) for +details). + +### Pivoting the data into a wider format + +Let's select the first 3 columns of `rna` and use `pivot_wider()` +to transform the data into a wide-format. + +```{r, purl=TRUE} +rna_exp <- rna %>% + select(gene, sample, expression) +rna_exp +``` + +`pivot_wider` takes three main arguments: + +1. the data to be transformed; +2. the `names_from` : the column whose values will become new column + names; +3. the `values_from`: the column whose values will fill the new + columns. + +\`\`\`{r, fig.cap="Wide pivot of the `rna` data.", echo=FALSE, message=FALSE} +knitr::include_graphics("fig/pivot_wider.png") + +```` + +```{r, purl=TRUE} +rna_wide <- rna_exp %>% + pivot_wider(names_from = sample, + values_from = expression) +rna_wide +```` + +Note that by default, the `pivot_wider()` function will add `NA` for missing values. + +Let's imagine that for some reason, we had some missing expression values for some +genes in certain samples. In the following fictive example, the gene Cyp2d22 has only +one expression value, in GSM2545338 sample. + +```{r, purl=TRUE} +rna_with_missing_values <- rna %>% + select(gene, sample, expression) %>% + filter(gene %in% c("Asl", "Apod", "Cyp2d22")) %>% + filter(sample %in% c("GSM2545336", "GSM2545337", "GSM2545338")) %>% + arrange(sample) %>% + filter(!(gene == "Cyp2d22" & sample != "GSM2545338")) +rna_with_missing_values +``` + +By default, the `pivot_wider()` function will add `NA` for missing +values. This can be parameterised with the `values_fill` argument of +the `pivot_wider()` function. + +```{r, purl=TRUE} +rna_with_missing_values %>% + pivot_wider(names_from = sample, + values_from = expression) + +rna_with_missing_values %>% + pivot_wider(names_from = sample, + values_from = expression, + values_fill = 0) +``` + +### Pivoting data into a longer format + +In the opposite situation we are using the column names and turning them into +a pair of new variables. One variable represents the column names as +values, and the other variable contains the values previously +associated with the column names. + +`pivot_longer()` takes four main arguments: + +1. the data to be transformed; +2. the `names_to`: the new column name we wish to create and populate with the + current column names; +3. the `values_to`: the new column name we wish to create and populate with + current values; +4. the names of the columns to be used to populate the `names_to` and + `values_to` variables (or to drop). + +\`\`\`{r, fig.cap="Long pivot of the `rna` data.", echo=FALSE, message=FALSE} +knitr::include_graphics("fig/pivot_longer.png") + +```` + +To recreate `rna_long` from `rna_wide` we would create a key +called `sample` and value called `expression` and use all columns +except `gene` for the key variable. Here we drop `gene` column +with a minus sign. + +Notice how the new variable names are to be quoted here. + +```{r} +rna_long <- rna_wide %>% + pivot_longer(names_to = "sample", + values_to = "expression", + -gene) +rna_long +```` + +We could also have used a specification for what columns to +include. This can be useful if you have a large number of identifying +columns, and it's easier to specify what to gather than what to leave +alone. Here the `starts_with()` function can help to retrieve sample +names without having to list them all! +Another possibility would be to use the `:` operator! + +```{r} +rna_wide %>% + pivot_longer(names_to = "sample", + values_to = "expression", + cols = starts_with("GSM")) +rna_wide %>% + pivot_longer(names_to = "sample", + values_to = "expression", + GSM2545336:GSM2545380) +``` + +Note that if we had missing values in the wide-format, the `NA` would be +included in the new long format. + +Remember our previous fictive tibble containing missing values: + +```{r} +rna_with_missing_values + +wide_with_NA <- rna_with_missing_values %>% + pivot_wider(names_from = sample, + values_from = expression) +wide_with_NA + +wide_with_NA %>% + pivot_longer(names_to = "sample", + values_to = "expression", + -gene) +``` + +Pivoting to wider and longer formats can be a useful way to balance out a dataset +so every replicate has the same composition. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Question + +Starting from the rna table, use the `pivot_wider()` function to create +a wide-format table giving the gene expression levels in each mouse. +Then use the `pivot_longer()` function to restore a long-format table. + +::::::::::::::: solution + +## Solution + +```{r, answer=TRUE, purl=TRUE} +rna1 <- rna %>% +select(gene, mouse, expression) %>% +pivot_wider(names_from = mouse, values_from = expression) +rna1 + +rna1 %>% +pivot_longer(names_to = "mouse_id", values_to = "counts", -gene) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Question + +Subset genes located on X and Y chromosomes from the `rna` data frame and +spread the data frame with `sex` as columns, `chromosome_name` as +rows, and the mean expression of genes located in each chromosome as the values, +as in the following tibble: + +```{r, echo=FALSE, message=FALSE} +knitr::include_graphics("fig/Exercise_pivot_W.png") +``` + +You will need to summarise before reshaping! + +::::::::::::::: solution + +## Solution + +Let's first calculate the mean expression level of X and Y linked genes from +male and female samples... + +```{r} + rna %>% + filter(chromosome_name == "Y" | chromosome_name == "X") %>% + group_by(sex, chromosome_name) %>% + summarise(mean = mean(expression)) +``` + +And pivot the table to wide format + +```{r, answer=TRUE, purl=TRUE} +rna_1 <- rna %>% + filter(chromosome_name == "Y" | chromosome_name == "X") %>% + group_by(sex, chromosome_name) %>% + summarise(mean = mean(expression)) %>% + pivot_wider(names_from = sex, + values_from = mean) + +rna_1 +``` + +Now take that data frame and transform it with `pivot_longer()` so +each row is a unique `chromosome_name` by `gender` combination. + +```{r, answer=TRUE, purl=TRUE} +rna_1 %>% + pivot_longer(names_to = "gender", + values_to = "mean", + -chromosome_name) + +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Question + +Use the `rna` dataset to create an expression matrix where each row +represents the mean expression levels of genes and columns represent +the different timepoints. + +::::::::::::::: solution + +## Solution + +Let's first calculate the mean expression by gene and by time + +```{r} +rna %>% + group_by(gene, time) %>% + summarise(mean_exp = mean(expression)) +``` + +before using the pivot_wider() function + +```{r} +rna_time <- rna %>% + group_by(gene, time) %>% + summarise(mean_exp = mean(expression)) %>% + pivot_wider(names_from = time, + values_from = mean_exp) +rna_time +``` + +Notice that this generates a tibble with some column names starting by a number. +If we wanted to select the column corresponding to the timepoints, +we could not use the column names directly... What happens when we select the column 4? + +```{r} +rna %>% + group_by(gene, time) %>% + summarise(mean_exp = mean(expression)) %>% + pivot_wider(names_from = time, + values_from = mean_exp) %>% + select(gene, 4) +``` + +To select the timepoint 4, we would have to quote the column name, with backticks "\`" + +```{r} +rna %>% + group_by(gene, time) %>% + summarise(mean_exp = mean(expression)) %>% + pivot_wider(names_from = time, + values_from = mean_exp) %>% + select(gene, `4`) +``` + +Another possibility would be to rename the column, +choosing a name that doesn't start by a number : + +```{r} +rna %>% + group_by(gene, time) %>% + summarise(mean_exp = mean(expression)) %>% + pivot_wider(names_from = time, + values_from = mean_exp) %>% + rename("time0" = `0`, "time4" = `4`, "time8" = `8`) %>% + select(gene, time4) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Question + +Use the previous data frame containing mean expression levels per timepoint and create +a new column containing fold-changes between timepoint 8 and timepoint 0, and fold-changes +between timepoint 8 and timepoint 4. +Convert this table into a long-format table gathering the fold-changes calculated. + +::::::::::::::: solution + +## Solution + +Starting from the rna_time tibble: + +```{r} +rna_time +``` + +Calculate fold-changes: + +```{r} +rna_time %>% + mutate(time_8_vs_0 = `8` / `0`, time_8_vs_4 = `8` / `4`) +``` + +And use the pivot_longer() function: + +```{r} +rna_time %>% + mutate(time_8_vs_0 = `8` / `0`, time_8_vs_4 = `8` / `4`) %>% + pivot_longer(names_to = "comparisons", + values_to = "Fold_changes", + time_8_vs_0:time_8_vs_4) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Joining tables + +In many real life situations, data are spread across multiple tables. +Usually this occurs because different types of information are +collected from different sources. + +It may be desirable for some analyses to combine data from two or more +tables into a single data frame based on a column that would be common +to all the tables. + +The `dplyr` package provides a set of join functions for combining two +data frames based on matches within specified columns. Here, we +provide a short introduction to joins. For further reading, please +refer to the chapter about table +joins. The +Data Transformation Cheat +Sheet +also provides a short overview on table joins. + +We are going to illustrate join using a small table, `rna_mini` that +we will create by subsetting the original `rna` table, keeping only 3 +columns and 10 lines. + +```{r} +rna_mini <- rna %>% + select(gene, sample, expression) %>% + head(10) +rna_mini +``` + +The second table, `annot1`, contains 2 columns, gene and +gene_description. You can either +[download annot1.csv](https://carpentries-incubator.github.io/bioc-intro/data/annot1.csv) +by clicking on the link and then moving it to the `data/` folder, or +you can use the R code below to download it directly to the folder. + +```{r, message=FALSE} +download.file(url = "https://carpentries-incubator.github.io/bioc-intro/data/annot1.csv", + destfile = "data/annot1.csv") +annot1 <- read_csv(file = "data/annot1.csv") +annot1 +``` + +We now want to join these two tables into a single one containing all +variables using the `full_join()` function from the `dplyr` package. The +function will automatically find the common variable to match columns +from the first and second table. In this case, `gene` is the common +variable. Such variables are called keys. Keys are used to match +observations across different tables. + +```{r} +full_join(rna_mini, annot1) +``` + +In real life, gene annotations are sometimes labelled differently. + +The `annot2` table is exactly the same than `annot1` except that the +variable containing gene names is labelled differently. Again, either +[download annot2.csv](https://carpentries-incubator.github.io/bioc-intro/data/annot2.csv) +yourself and move it to `data/` or use the R code below. + +```{r, message=FALSE} +download.file(url = "https://carpentries-incubator.github.io/bioc-intro/data/annot2.csv", + destfile = "data/annot2.csv") +annot2 <- read_csv(file = "data/annot2.csv") +annot2 +``` + +In case none of the variable names match, we can set manually the +variables to use for the matching. These variables can be set using +the `by` argument, as shown below with `rna_mini` and `annot2` tables. + +```{r} +full_join(rna_mini, annot2, by = c("gene" = "external_gene_name")) +``` + +As can be seen above, the variable name of the first table is retained +in the joined one. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge: + +Download the `annot3` table by clicking +[here](https://carpentries-incubator.github.io/bioc-intro/data/annot3.csv) +and put the table in your data/ repository. Using the `full_join()` +function, join tables `rna_mini` and `annot3`. What has happened for +genes _Klk6_, _mt-Tf_, _mt-Rnr1_, _mt-Tv_, _mt-Rnr2_, and _mt-Tl1_ ? + +::::::::::::::: solution + +## Solution + +```{r, message=FALSE} +annot3 <- read_csv("data/annot3.csv") +full_join(rna_mini, annot3) +``` + +Genes _Klk6_ is only present in `rna_mini`, while genes _mt-Tf_, _mt-Rnr1_, _mt-Tv_, +_mt-Rnr2_, and _mt-Tl1_ are only present in `annot3` table. Their respective values for the +variables of the table have been encoded as missing. + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Exporting data + +Now that you have learned how to use `dplyr` to extract information from +or summarise your raw data, you may want to export these new data sets to share +them with your collaborators or for archival. + +Similar to the `read_csv()` function used for reading CSV files into R, there is +a `write_csv()` function that generates CSV files from data frames. + +Before using `write_csv()`, we are going to create a new folder, `data_output`, +in our working directory that will store this generated dataset. We don't want +to write generated datasets in the same directory as our raw data. +It's good practice to keep them separate. The `data` folder should only contain +the raw, unaltered data, and should be left alone to make sure we don't delete +or modify it. In contrast, our script will generate the contents of the `data_output` +directory, so even if the files it contains are deleted, we can always +re-generate them. + +Let's use `write_csv()` to save the rna_wide table that we have created previously. + +```{r, purl=TRUE, eval=FALSE} +write_csv(rna_wide, file = "data_output/rna_wide.csv") +``` + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- Tabular data in R using the tidyverse meta-package + +:::::::::::::::::::::::::::::::::::::::::::::::::: From bb0e053e42cd343e4ab8f5ad4ee0ebfd3a8dbff1 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Fri, 29 Dec 2023 06:11:06 +0900 Subject: [PATCH 027/334] New translations 40-visualization.md (French) --- locale/fr/episodes/40-visualization.Rmd | 1103 +++++++++++++++++++++++ 1 file changed, 1103 insertions(+) create mode 100644 locale/fr/episodes/40-visualization.Rmd diff --git a/locale/fr/episodes/40-visualization.Rmd b/locale/fr/episodes/40-visualization.Rmd new file mode 100644 index 000000000..b1ab2920c --- /dev/null +++ b/locale/fr/episodes/40-visualization.Rmd @@ -0,0 +1,1103 @@ +--- +source: Rmd +title: Data visualization +teaching: 60 +exercises: 60 +--- + +```{r loaddata_vis, echo=FALSE, purl=FALSE, message=FALSE} +if (!file.exists("data/rnaseq.csv")) +download.file(url = "https://github.com/carpentries-incubator/bioc-intro/raw/main/episodes/data/rnaseq.csv", + destfile = "data/rnaseq.csv") +``` + +::::::::::::::::::::::::::::::::::::::: objectives + +- Produce scatter plots, boxplots, line plots, etc. using ggplot. +- Set universal plot settings. +- Describe what faceting is and apply faceting in ggplot. +- Modify the aesthetics of an existing ggplot plot (including axis labels and color). +- Build complex and customized plots from data in a data frame. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- Visualization in R + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +```{r vis_setup, echo=FALSE} +rna <- read.csv("data/rnaseq.csv") +``` + +## Data Visualization + +We start by loading the required packages. **`ggplot2`** is included in +the **`tidyverse`** package. + +```{r load-package, message=FALSE, purl=TRUE} +library("tidyverse") +``` + +If not still in the workspace, load the data we saved in the previous +lesson. + +```{r load-data, eval=FALSE, purl=TRUE} +rna <- read.csv("data/rnaseq.csv") +``` + +The Data Visualization Cheat +Sheet +will cover the basics and more advanced features of `ggplot2` and will +help, in addition to serve as a reminder, getting an overview of the +many data representations available in the package. The following video +tutorials ([part 1](https://www.youtube.com/watch?v=h29g21z0a68) and +[2](https://www.youtube.com/watch?v=0m4yywqNPVY)) by Thomas Lin Pedersen +are also very instructive. + +## Plotting with `ggplot2` + +`ggplot2` is a plotting package that makes it simple to create complex +plots from data in a data frame. It provides a more programmatic +interface for specifying what variables to plot, how they are displayed, +and general visual properties. The theoretical foundation that supports +the `ggplot2` is the _Grammar of Graphics_ (@Wilkinson:2005). Using this +approach, we only need minimal changes if the underlying data change or +if we decide to change from a bar plot to a scatterplot. This helps in +creating publication quality plots with minimal amounts of adjustments +and tweaking. + +There is a book about `ggplot2` (@ggplot2book) that provides a good +overview, but it is outdated. The 3rd edition is in preparation and will +be [freely available online](https://ggplot2-book.org/). The `ggplot2` +webpage ([https://ggplot2.tidyverse.org](https://ggplot2.tidyverse.org)) provides ample documentation. + +`ggplot2` functions like data in the 'long' format, i.e., a column for +every dimension, and a row for every observation. Well-structured data +will save you lots of time when making figures with `ggplot2`. + +ggplot graphics are built step by step by adding new elements. Adding +layers in this fashion allows for extensive flexibility and +customization of plots. + +> The idea behind the Grammar of Graphics it is that you can build every +> graph from the same 3 components: (1) a data set, (2) a coordinate system, +> and (3) geoms — i.e. visual marks that represent data points \[^three\_comp\_ggplot2] + +[^three_comp_ggplot2]: Source: [Data Visualization Cheat Sheet](https://raw.githubusercontent.com/rstudio/cheatsheets/main/data-visualization.pdf). + +To build a ggplot, we will use the following basic template that can be +used for different types of plots: + +``` +ggplot(data = <DATA>, mapping = aes(<MAPPINGS>)) + <GEOM_FUNCTION>() +``` + +- use the `ggplot()` function and bind the plot to a specific **data + frame** using the `data` argument + +```{r, eval=FALSE} +ggplot(data = rna) +``` + +- define a **mapping** (using the aesthetic (`aes`) function), by + selecting the variables to be plotted and specifying how to present + them in the graph, e.g. as x/y positions or characteristics such as + size, shape, color, etc. + +```{r, eval=FALSE} +ggplot(data = rna, mapping = aes(x = expression)) +``` + +- add '**geoms**' - geometries, or graphical representations of the + data in the plot (points, lines, bars). `ggplot2` offers many + different geoms; we will use some common ones today, including: + + ``` + * `geom_point()` for scatter plots, dot plots, etc. + * `geom_histogram()` for histograms + * `geom_boxplot()` for, well, boxplots! + * `geom_line()` for trend lines, time series, etc. + ``` + +To add a geom(etry) to the plot use the `+` operator. Let's use +`geom_histogram()` first: + +```{r first-ggplot, cache=FALSE, purl=TRUE} +ggplot(data = rna, mapping = aes(x = expression)) + + geom_histogram() +``` + +The `+` in the `ggplot2` package is particularly useful because it +allows you to modify existing `ggplot` objects. This means you can +easily set up plot templates and conveniently explore different types of +plots, so the above plot can also be generated with code like this: + +```{r, eval=FALSE, purl=TRUE} +# Assign plot to a variable +rna_plot <- ggplot(data = rna, + mapping = aes(x = expression)) + +# Draw the plot +rna_plot + geom_histogram() +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge + +You have probably noticed an automatic message that appears when +drawing the histogram: + +```{r, echo=FALSE, fig.show="hide"} +ggplot(rna, aes(x = expression)) + + geom_histogram() +``` + +Change the arguments `bins` or `binwidth` of `geom_histogram()` to +change the number or width of the bins. + +::::::::::::::: solution + +## Solution + +```{r, purl=TRUE} +# change bins +ggplot(rna, aes(x = expression)) + + geom_histogram(bins = 15) + +# change binwidth +ggplot(rna, aes(x = expression)) + + geom_histogram(binwidth = 2000) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +We can observe here that the data are skewed to the right. We can apply +log2 transformation to have a more symmetric distribution. Note that we +add here a small constant value (`+1`) to avoid having `-Inf` values +returned for expression values equal to 0. + +```{r log-transfo, cache=FALSE, purl=TRUE} +rna <- rna %>% + mutate(expression_log = log2(expression + 1)) +``` + +If we now draw the histogram of the log2-transformed expressions, the +distribution is indeed closer to a normal distribution. + +```{r second-ggplot, cache=FALSE, purl=TRUE} +ggplot(rna, aes(x = expression_log)) + geom_histogram() +``` + +From now on we will work on the log-transformed expression values. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge + +Another way to visualize this transformation is to consider the scale +of the observations. For example, it may be worth changing the scale +of the axis to better distribute the observations in the space of the +plot. Changing the scale of the axes is done similarly to +adding/modifying other components (i.e., by incrementally adding +commands). Try making this modification: + +- Represent the un-transformed expression on the log10 scale; see + `scale_x_log10()`. Compare it with the previous graph. Why do you + now have warning messages appearing? + +::::::::::::::: solution + +## Solution + +```{r, eval=TRUE, purl=TRUE, echo=TRUE} +ggplot(data = rna,mapping = aes(x = expression))+ + geom_histogram() + + scale_x_log10() +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +**Notes** + +- Anything you put in the `ggplot()` function can be seen by any geom + layers that you add (i.e., these are global plot settings). This + includes the x- and y-axis mapping you set up in `aes()`. +- You can also specify mappings for a given geom independently of the + mappings defined globally in the `ggplot()` function. +- The `+` sign used to add new layers must be placed at the end of the + line containing the _previous_ layer. If, instead, the `+` sign is + added at the beginning of the line containing the new layer, + `ggplot2` will not add the new layer and will return an error + message. + +```{r, eval=FALSE} +# This is the correct syntax for adding layers +rna_plot + + geom_histogram() + +# This will not add the new layer and will return an error message +rna_plot + + geom_histogram() +``` + +## Building your plots iteratively + +We will now draw a scatter plot with two continuous variables and the +`geom_point()` function. This graph will represent the log2 fold changes +of expression comparing time 8 versus time 0, and time 4 versus time 0. +To this end, we first need to compute the means of the log-transformed +expression values by gene and time, then the log fold changes by +subtracting the mean log expressions between time 8 and time 0 and +between time 4 and time 0. Note that we also include here the gene +biotype that we will use later on to represent the genes. We will save +the fold changes in a new data frame called `rna_fc.` + +```{r rna_fc, cache=FALSE, purl=TRUE} +rna_fc <- rna %>% select(gene, time, + gene_biotype, expression_log) %>% + group_by(gene, time, gene_biotype) %>% + summarize(mean_exp = mean(expression_log)) %>% + pivot_wider(names_from = time, + values_from = mean_exp) %>% + mutate(time_8_vs_0 = `8` - `0`, time_4_vs_0 = `4` - `0`) + +``` + +We can then build a ggplot with the newly created dataset `rna_fc`. +Building plots with `ggplot2` is typically an iterative process. We +start by defining the dataset we'll use, lay out the axes, and choose a +geom: + +```{r create-ggplot-object, cache=FALSE, purl=TRUE} +ggplot(data = rna_fc, mapping = aes(x = time_4_vs_0, y = time_8_vs_0)) + + geom_point() +``` + +Then, we start modifying this plot to extract more information from it. +For instance, we can add transparency (`alpha`) to avoid overplotting: + +```{r adding-transparency, cache=FALSE, purl=TRUE} +ggplot(data = rna_fc, mapping = aes(x = time_4_vs_0, y = time_8_vs_0)) + + geom_point(alpha = 0.3) +``` + +We can also add colors for all the points: + +```{r adding-colors, cache=FALSE, purl=TRUE} +ggplot(data = rna_fc, mapping = aes(x = time_4_vs_0, y = time_8_vs_0)) + + geom_point(alpha = 0.3, color = "blue") +``` + +Or to color each gene in the plot differently, you could use a vector as +an input to the argument **color**. `ggplot2` will provide a different +color corresponding to different values in the vector. Here is an +example where we color with `gene_biotype`: + +```{r color-by-gene_biotype1, cache=FALSE, purl=TRUE} +ggplot(data = rna_fc, mapping = aes(x = time_4_vs_0, y = time_8_vs_0)) + + geom_point(alpha = 0.3, aes(color = gene_biotype)) + +``` + +We can also specify the colors directly inside the mapping provided in +the `ggplot()` function. This will be seen by any geom layers and the +mapping will be determined by the x- and y-axis set up in `aes()`. + +```{r color-by-gene_biotype2, cache=FALSE, purl=TRUE} +ggplot(data = rna_fc, mapping = aes(x = time_4_vs_0, y = time_8_vs_0, + color = gene_biotype)) + + geom_point(alpha = 0.3) +``` + +Finally, we could also add a diagonal line with the `geom_abline()` +function: + +```{r adding-diag, cache=FALSE, purl=TRUE} +ggplot(data = rna_fc, mapping = aes(x = time_4_vs_0, y = time_8_vs_0, + color = gene_biotype)) + + geom_point(alpha = 0.3) + + geom_abline(intercept = 0) +``` + +Notice that we can change the geom layer from `geom_point` to +`geom_jitter` and colors will still be determined by `gene_biotype`. + +```{r color-by-gene_biotype3, cache=FALSE, purl=TRUE} +ggplot(data = rna_fc, mapping = aes(x = time_4_vs_0, y = time_8_vs_0, + color = gene_biotype)) + + geom_jitter(alpha = 0.3) + + geom_abline(intercept = 0) +``` + +```{r, echo=FALSE, message=FALSE} +library("hexbin") +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge + +Scatter plots can be useful exploratory tools for small datasets. For +data sets with large numbers of observations, such as the `rna_fc` +data set, overplotting of points can be a limitation of scatter plots. +One strategy for handling such settings is to use hexagonal binning of +observations. The plot space is tessellated into hexagons. Each +hexagon is assigned a color based on the number of observations that +fall within its boundaries. + +- To use hexagonal binning in `ggplot2`, first install the R package + `hexbin` from CRAN and load it. + +- Then use the `geom_hex()` function to produce the hexbin figure. + +- What are the relative strengths and weaknesses of a hexagonal bin + plot compared to a scatter plot? Examine the above scatter plot + and compare it with the hexagonal bin plot that you created. + +::::::::::::::: solution + +## Solution + +```{r, eval=FALSE, purl=TRUE} +install.packages("hexbin") +``` + +```{r, purl=TRUE} +library("hexbin") + +ggplot(data = rna_fc, mapping = aes(x = time_4_vs_0, y = time_8_vs_0)) + + geom_hex() + + geom_abline(intercept = 0) + +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge + +Use what you just learned to create a scatter plot of `expression_log` +over `sample` from the `rna` dataset with the time showing in +different colors. Is this a good way to show this type of data? + +::::::::::::::: solution + +## Solution + +```{r, eval=TRUE, purl=TRUE} +ggplot(data = rna, mapping = aes(y = expression_log, x = sample)) + + geom_point(aes(color = time)) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Boxplot + +We can use boxplots to visualize the distribution of gene expressions +within each sample: + +```{r boxplot, cache=FALSE, purl=TRUE} +ggplot(data = rna, + mapping = aes(y = expression_log, x = sample)) + + geom_boxplot() +``` + +By adding points to boxplot, we can have a better idea of the number of +measurements and of their distribution: + +```{r boxplot-with-points, cache=FALSE, purl=TRUE} +ggplot(data = rna, + mapping = aes(y = expression_log, x = sample)) + + geom_jitter(alpha = 0.2, color = "tomato") + + geom_boxplot(alpha = 0) +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge + +Note how the boxplot layer is in front of the jitter layer? What do +you need to change in the code to put the boxplot below the points? + +::::::::::::::: solution + +## Solution + +We should switch the order of these two geoms: + +```{r boxplot-with-points2, cache=FALSE, purl=TRUE} +ggplot(data = rna, + mapping = aes(y = expression_log, x = sample)) + + geom_boxplot(alpha = 0) + + geom_jitter(alpha = 0.2, color = "tomato") +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +You may notice that the values on the x-axis are still not properly +readable. Let's change the orientation of the labels and adjust them +vertically and horizontally so they don't overlap. You can use a +90-degree angle, or experiment to find the appropriate angle for +diagonally oriented labels: + +```{r boxplot-xaxis-rotated, cache=FALSE, purl=TRUE} +ggplot(data = rna, + mapping = aes(y = expression_log, x = sample)) + + geom_jitter(alpha = 0.2, color = "tomato") + + geom_boxplot(alpha = 0) + + theme(axis.text.x = element_text(angle = 90, hjust = 0.5, vjust = 0.5)) +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge + +Add color to the data points on your boxplot according to the duration +of the infection (`time`). + +_Hint:_ Check the class for `time`. Consider changing the class of +`time` from integer to factor directly in the ggplot mapping. Why does +this change how R makes the graph? + +::::::::::::::: solution + +## Solution + +```{r boxplot-color-time, cache=FALSE, purl=TRUE} +# time as integer +ggplot(data = rna, + mapping = aes(y = expression_log, + x = sample)) + + geom_jitter(alpha = 0.2, aes(color = time)) + + geom_boxplot(alpha = 0) + + theme(axis.text.x = element_text(angle = 90, hjust = 0.5, vjust = 0.5)) + +# time as factor +ggplot(data = rna, + mapping = aes(y = expression_log, + x = sample)) + + geom_jitter(alpha = 0.2, aes(color = as.factor(time))) + + geom_boxplot(alpha = 0) + + theme(axis.text.x = element_text(angle = 90, hjust = 0.5, vjust = 0.5)) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge + +Boxplots are useful summaries, but hide the _shape_ of the +distribution. For example, if the distribution is bimodal, we would +not see it in a boxplot. An alternative to the boxplot is the violin +plot, where the shape (of the density of points) is drawn. + +- Replace the box plot with a violin plot; see `geom_violin()`. Fill + in the violins according to the time with the argument `fill`. + +::::::::::::::: solution + +## Solution + +```{r, eval=TRUE, echo=TRUE, cache=FALSE, purl=TRUE} +ggplot(data = rna, + mapping = aes(y = expression_log, x = sample)) + + geom_violin(aes(fill = as.factor(time))) + + theme(axis.text.x = element_text(angle = 90, hjust = 0.5, vjust = 0.5)) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge + +- Modify the violin plot to fill in the violins by `sex`. + +::::::::::::::: solution + +## Solution + +```{r, eval=TRUE, echo=TRUE, cache=FALSE, purl=TRUE} +ggplot(data = rna, + mapping = aes(y = expression_log, x = sample)) + + geom_violin(aes(fill = sex)) + + theme(axis.text.x = element_text(angle = 90, hjust = 0.5, vjust = 0.5)) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Line plots + +Let's calculate the mean expression per duration of the infection for +the 10 genes having the highest log fold changes comparing time 8 versus +time 0. First, we need to select the genes and create a subset of `rna` +called `sub_rna` containing the 10 selected genes, then we need to group +the data and calculate the mean gene expression within each group: + +```{r, purl=TRUE} +rna_fc <- rna_fc %>% arrange(desc(time_8_vs_0)) + +genes_selected <- rna_fc$gene[1:10] + +sub_rna <- rna %>% + filter(gene %in% genes_selected) + +mean_exp_by_time <- sub_rna %>% + group_by(gene,time) %>% + summarize(mean_exp = mean(expression_log)) + +mean_exp_by_time +``` + +We can build the line plot with duration of the infection on the x-axis +and the mean expression on the y-axis: + +```{r first-time-series, purl=TRUE} +ggplot(data = mean_exp_by_time, mapping = aes(x = time, y = mean_exp)) + + geom_line() +``` + +Unfortunately, this does not work because we plotted data for all the +genes together. We need to tell ggplot to draw a line for each gene by +modifying the aesthetic function to include `group = gene`: + +```{r time-series-by-gene, purl=TRUE} +ggplot(data = mean_exp_by_time, + mapping = aes(x = time, y = mean_exp, group = gene)) + + geom_line() +``` + +We will be able to distinguish genes in the plot if we add colors (using +`color` also automatically groups the data): + +```{r time-series-with-colors, purl=TRUE} +ggplot(data = mean_exp_by_time, + mapping = aes(x = time, y = mean_exp, color = gene)) + + geom_line() +``` + +## Faceting + +`ggplot2` has a special technique called _faceting_ that allows the user +to split one plot into multiple (sub) plots based on a factor included +in the dataset. These different subplots inherit the same properties +(axes limits, ticks, ...) to facilitate their direct comparison. We will +use it to make a line plot across time for each gene: + +```{r first-facet, purl=TRUE} +ggplot(data = mean_exp_by_time, + mapping = aes(x = time, y = mean_exp)) + geom_line() + + facet_wrap(~ gene) +``` + +Here both x- and y-axis have the same scale for all the subplots. You +can change this default behavior by modifying `scales` in order to allow +a free scale for the y-axis: + +```{r first-facet-scales, purl=TRUE} +ggplot(data = mean_exp_by_time, + mapping = aes(x = time, y = mean_exp)) + + geom_line() + + facet_wrap(~ gene, scales = "free_y") +``` + +Now we would like to split the line in each plot by the sex of the mice. +To do that we need to calculate the mean expression in the data frame +grouped by `gene`, `time`, and `sex`: + +```{r data-facet-by-gene-and-sex, purl=TRUE} +mean_exp_by_time_sex <- sub_rna %>% + group_by(gene, time, sex) %>% + summarize(mean_exp = mean(expression_log)) + +mean_exp_by_time_sex +``` + +We can now make the faceted plot by splitting further by sex using +`color` (within a single plot): + +```{r facet-by-gene-and-sex, cache=FALSE, purl=TRUE} +ggplot(data = mean_exp_by_time_sex, + mapping = aes(x = time, y = mean_exp, color = sex)) + + geom_line() + + facet_wrap(~ gene, scales = "free_y") +``` + +Usually plots with white background look more readable when printed. We +can set the background to white using the function `theme_bw()`. +Additionally, we can remove the grid: + +```{r facet-by-gene-and-sex-white-bg, cache=FALSE, purl=TRUE} +ggplot(data = mean_exp_by_time_sex, + mapping = aes(x = time, y = mean_exp, color = sex)) + + geom_line() + + facet_wrap(~ gene, scales = "free_y") + + theme_bw() + + theme(panel.grid = element_blank()) +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge + +Use what you just learned to create a plot that depicts how the +average expression of each chromosome changes through the duration of +infection. + +::::::::::::::: solution + +## Solution + +```{r mean-exp-chromosome-time-series, purl=TRUE} +mean_exp_by_chromosome <- rna %>% + group_by(chromosome_name, time) %>% + summarize(mean_exp = mean(expression_log)) + +ggplot(data = mean_exp_by_chromosome, mapping = aes(x = time, + y = mean_exp)) + + geom_line() + + facet_wrap(~ chromosome_name, scales = "free_y") +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +The `facet_wrap` geometry extracts plots into an arbitrary number of +dimensions to allow them to cleanly fit on one page. On the other hand, +the `facet_grid` geometry allows you to explicitly specify how you want +your plots to be arranged via formula notation (`rows ~ columns`; a `.` +can be used as a placeholder that indicates only one row or column). + +Let's modify the previous plot to compare how the mean gene expression +of males and females has changed through time: + +```{r mean-exp-time-facet-sex-rows, purl=TRUE} +# One column, facet by rows +ggplot(data = mean_exp_by_time_sex, + mapping = aes(x = time, y = mean_exp, color = gene)) + + geom_line() + + facet_grid(sex ~ .) +``` + +```{r mean-exp-time-facet-sex-columns, purl=TRUE} +# One row, facet by column +ggplot(data = mean_exp_by_time_sex, + mapping = aes(x = time, y = mean_exp, color = gene)) + + geom_line() + + facet_grid(. ~ sex) +``` + +## `ggplot2` themes + +In addition to `theme_bw()`, which changes the plot background to white, +`ggplot2` comes with several other themes which can be useful to quickly +change the look of your visualization. The complete list of themes is +available at [https://ggplot2.tidyverse.org/reference/ggtheme.html](https://ggplot2.tidyverse.org/reference/ggtheme.html). +`theme_minimal()` and `theme_light()` are popular, and `theme_void()` +can be useful as a starting point to create a new hand-crafted theme. + +The [ggthemes](https://jrnold.github.io/ggthemes/reference/index.html) +package provides a wide variety of options (including an Excel 2003 +theme). The ggplot2 provides a list of +packages that extend the capabilities of `ggplot2`, including additional +themes. + +## Customisation + +Let's come back to the faceted plot of mean expression by time and gene, +colored by sex. + +Take a look at the ggplot2, +and think of ways you could improve the plot. + +Now, we can change names of axes to something more informative than +'time' and 'mean_exp', and add a title to the figure: + +```{r mean_exp-time-with-right-labels, cache=FALSE, purl=TRUE} +ggplot(data = mean_exp_by_time_sex, + mapping = aes(x = time, y = mean_exp, color = sex)) + + geom_line() + + facet_wrap(~ gene, scales = "free_y") + + theme_bw() + + theme(panel.grid = element_blank()) + + labs(title = "Mean gene expression by duration of the infection", + x = "Duration of the infection (in days)", + y = "Mean gene expression") +``` + +The axes have more informative names, but their readability can be +improved by increasing the font size: + +```{r mean_exp-time-with-right-labels-xfont-size, cache=FALSE, purl=TRUE} +ggplot(data = mean_exp_by_time_sex, + mapping = aes(x = time, y = mean_exp, color = sex)) + + geom_line() + + facet_wrap(~ gene, scales = "free_y") + + theme_bw() + + theme(panel.grid = element_blank()) + + labs(title = "Mean gene expression by duration of the infection", + x = "Duration of the infection (in days)", + y = "Mean gene expression") + + theme(text = element_text(size = 16)) +``` + +Note that it is also possible to change the fonts of your plots. If you +are on Windows, you may have to install the . + +We can further customize the color of x- and y-axis text, the color of +the grid, etc. We can also for example move the legend to the top by +setting `legend.position` to `"top"`. + +```{r mean_exp-time-with-theme, cache=FALSE, purl=TRUE} +ggplot(data = mean_exp_by_time_sex, + mapping = aes(x = time, y = mean_exp, color = sex)) + + geom_line() + + facet_wrap(~ gene, scales = "free_y") + + theme_bw() + + theme(panel.grid = element_blank()) + + labs(title = "Mean gene expression by duration of the infection", + x = "Duration of the infection (in days)", + y = "Mean gene expression") + + theme(text = element_text(size = 16), + axis.text.x = element_text(colour = "royalblue4", size = 12), + axis.text.y = element_text(colour = "royalblue4", size = 12), + panel.grid = element_line(colour="lightsteelblue1"), + legend.position = "top") +``` + +If you like the changes you created better than the default theme, you +can save them as an object to be able to easily apply them to other +plots you may create. Here is an example with the histogram we have +previously created. + +```{r mean_exp-time-with-right-labels-xfont, cache=FALSE, purl=TRUE} +blue_theme <- theme(axis.text.x = element_text(colour = "royalblue4", + size = 12), + axis.text.y = element_text(colour = "royalblue4", + size = 12), + text = element_text(size = 16), + panel.grid = element_line(colour="lightsteelblue1")) + +ggplot(rna, aes(x = expression_log)) + + geom_histogram(bins = 20) + + blue_theme +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge + +With all of this information in hand, please take another five minutes +to either improve one of the plots generated in this exercise or +create a beautiful graph of your own. Use the RStudio ggplot2 +for inspiration. Here are some ideas: + +- See if you can change the thickness of the lines. +- Can you find a way to change the name of the legend? What about + its labels? (hint: look for a ggplot function starting with + `scale_`) +- Try using a different color palette or manually specifying the + colors for the lines (see + [http://www.cookbook-r.com/Graphs/Colors_(ggplot2)/](https://www.cookbook-r.com/Graphs/Colors_\(ggplot2\)/)). + +::::::::::::::: solution + +## Solution + +For example, based on this plot: + +```{r, purl=TRUE} +ggplot(data = mean_exp_by_time_sex, + mapping = aes(x = time, y = mean_exp, color = sex)) + + geom_line() + + facet_wrap(~ gene, scales = "free_y") + + theme_bw() + + theme(panel.grid = element_blank()) +``` + +We can customize it the following ways: + +```{r, purl=TRUE} +# change the thickness of the lines +ggplot(data = mean_exp_by_time_sex, + mapping = aes(x = time, y = mean_exp, color = sex)) + + geom_line(size=1.5) + + facet_wrap(~ gene, scales = "free_y") + + theme_bw() + + theme(panel.grid = element_blank()) + +# change the name of the legend and the labels +ggplot(data = mean_exp_by_time_sex, + mapping = aes(x = time, y = mean_exp, color = sex)) + + geom_line() + + facet_wrap(~ gene, scales = "free_y") + + theme_bw() + + theme(panel.grid = element_blank()) + + scale_color_discrete(name = "Gender", labels = c("F", "M")) + +# using a different color palette +ggplot(data = mean_exp_by_time_sex, + mapping = aes(x = time, y = mean_exp, color = sex)) + + geom_line() + + facet_wrap(~ gene, scales = "free_y") + + theme_bw() + + theme(panel.grid = element_blank()) + + scale_color_brewer(name = "Gender", labels = c("F", "M"), palette = "Dark2") + +# manually specifying the colors +ggplot(data = mean_exp_by_time_sex, + mapping = aes(x = time, y = mean_exp, color = sex)) + + geom_line() + + facet_wrap(~ gene, scales = "free_y") + + theme_bw() + + theme(panel.grid = element_blank()) + + scale_color_manual(name = "Gender", labels = c("F", "M"), + values = c("royalblue", "deeppink")) + +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Composing plots + +Faceting is a great tool for splitting one plot into multiple subplots, +but sometimes you may want to produce a single figure that contains +multiple independent plots, i.e. plots that are based on different +variables or even different data frames. + +Let's start by creating the two plots that we want to arrange next to +each other: + +The first graph counts the number of unique genes per chromosome. We +first need to reorder the levels of `chromosome_name` and filter the +unique genes per chromosome. We also change the scale of the y-axis to a +log10 scale for better readability. + +```{r sub1, purl=TRUE} +rna$chromosome_name <- factor(rna$chromosome_name, + levels = c(1:19,"X","Y")) + +count_gene_chromosome <- rna %>% select(chromosome_name, gene) %>% + distinct() %>% ggplot() + + geom_bar(aes(x = chromosome_name), fill = "seagreen", + position = "dodge", stat = "count") + + labs(y = "log10(n genes)", x = "chromosome") + + scale_y_log10() + +count_gene_chromosome +``` + +Below, we also remove the legend altogether by setting the +`legend.position` to `"none"`. + +```{r sub2, purl=TRUE} +exp_boxplot_sex <- ggplot(rna, aes(y=expression_log, x = as.factor(time), + color=sex)) + + geom_boxplot(alpha = 0) + + labs(y = "Mean gene exp", + x = "time") + theme(legend.position = "none") + +exp_boxplot_sex +``` + +The [**patchwork**](https://github.com/thomasp85/patchwork) package +provides an elegant approach to combining figures using the `+` to +arrange figures (typically side by side). More specifically the `|` +explicitly arranges them side by side and `/` stacks them on top of each +other. + +```{r install-patchwork, message=FALSE, eval=FALSE, purl=TRUE} +install.packages("patchwork") +``` + +```{r patchworkplot1, purl=TRUE} +library("patchwork") +count_gene_chromosome + exp_boxplot_sex +## or count_gene_chromosome | exp_boxplot_sex +``` + +```{r patchwork2, purl=TRUE} +count_gene_chromosome / exp_boxplot_sex +``` + +We can combine further control the layout of the final composition with +`plot_layout` to create more complex layouts: + +```{r patchwork3, purl=TRUE} +count_gene_chromosome + exp_boxplot_sex + plot_layout(ncol = 1) +``` + +```{r patchwork4, purl=TRUE} +count_gene_chromosome + + (count_gene_chromosome + exp_boxplot_sex) + + exp_boxplot_sex + + plot_layout(ncol = 1) +``` + +The last plot can also be created using the `|` and `/` composers: + +```{r patchwork5, purl=TRUE} +count_gene_chromosome / + (count_gene_chromosome | exp_boxplot_sex) / + exp_boxplot_sex +``` + +Learn more about `patchwork` on its +[webpage](https://patchwork.data-imaginist.com/) or in this +[video](https://www.youtube.com/watch?v=0m4yywqNPVY). + +Another option is the **`gridExtra`** package that allows to combine +separate ggplots into a single figure using `grid.arrange()`: + +```{r install-gridextra, message=FALSE, eval=FALSE, purl=TRUE} +install.packages("gridExtra") +``` + +```{r gridarrange-example, message=FALSE, fig.width=10, purl=TRUE} +library("gridExtra") +grid.arrange(count_gene_chromosome, exp_boxplot_sex, ncol = 2) +``` + +In addition to the `ncol` and `nrow` arguments, used to make simple +arrangements, there are tools for constructing more complex +layouts. + +## Exporting plots + +After creating your plot, you can save it to a file in your favorite +format. The Export tab in the **Plot** pane in RStudio will save your +plots at low resolution, which will not be accepted by many journals and +will not scale well for posters. + +Instead, use the `ggsave()` function, which allows you easily change the +dimension and resolution of your plot by adjusting the appropriate +arguments (`width`, `height` and `dpi`). + +Make sure you have the `fig_output/` folder in your working directory. + +```{r ggsave-example, eval=FALSE, purl=TRUE} +my_plot <- ggplot(data = mean_exp_by_time_sex, + mapping = aes(x = time, y = mean_exp, color = sex)) + + geom_line() + + facet_wrap(~ gene, scales = "free_y") + + labs(title = "Mean gene expression by duration of the infection", + x = "Duration of the infection (in days)", + y = "Mean gene expression") + + guides(color=guide_legend(title="Gender")) + + theme_bw() + + theme(axis.text.x = element_text(colour = "royalblue4", size = 12), + axis.text.y = element_text(colour = "royalblue4", size = 12), + text = element_text(size = 16), + panel.grid = element_line(colour="lightsteelblue1"), + legend.position = "top") +ggsave("fig_output/mean_exp_by_time_sex.png", my_plot, width = 15, + height = 10) + +# This also works for grid.arrange() plots +combo_plot <- grid.arrange(count_gene_chromosome, exp_boxplot_sex, + ncol = 2, widths = c(4, 6)) +ggsave("fig_output/combo_plot_chromosome_sex.png", combo_plot, + width = 10, dpi = 300) +``` + +Note: The parameters `width` and `height` also determine the font size +in the saved plot. + +```{r final-challenge, eval=FALSE, purl=TRUE, echo=FALSE} +### Final plotting challenge: +## With all of this information in hand, please take another five +## minutes to either improve one of the plots generated in this +## exercise or create a beautiful graph of your own. Use the RStudio +## ggplot2 cheat sheet for inspiration: +## https://raw.githubusercontent.com/rstudio/cheatsheets/main/data-visualization.pdf +``` + +## Other packages for visualisation + +`ggplot2` is a very powerful package that fits very nicely in our _tidy +data_ and _tidy tools_ pipeline. There are other visualization packages +in R that shouldn't be ignored. + +### Base graphics + +The default graphics system that comes with R, often called _base R +graphics_ is simple and fast. It is based on the _painter's or canvas +model_, where different output are directly overlaid on top of each +other (see figure @ref(fig:paintermodel)). This is a fundamental +difference with `ggplot2` (and with `lattice`, described below), that +returns dedicated objects, that are rendered on screen or in a file, and +that can even be updated. + +```{r paintermodel, fig.width=12, fig.height=4, fig.cap="Successive layers added on top of each other."} +par(mfrow = c(1, 3)) +plot(1:20, main = "First layer, produced with plot(1:20)") + +plot(1:20, main = "A horizontal red line, added with abline(h = 10)") +abline(h = 10, col = "red") + +plot(1:20, main = "A rectangle, added with rect(5, 5, 15, 15)") +abline(h = 10, col = "red") +rect(5, 5, 15, 15, lwd = 3) +``` + +Another main difference is that base graphics' plotting function try to +do _the right_ thing based on their input type, i.e. they will adapt +their behaviour based on the class of their input. This is again very +different from what we have in `ggplot2`, that only accepts dataframes +as input, and that requires plots to be constructed bit by bit. + +```{r plotmethod, fig.width=8, fig.height=8, fig.cap="Plotting boxplots (top) and histograms (bottom) vectors (left) or a matrices (right)."} +par(mfrow = c(2, 2)) +boxplot(rnorm(100), + main = "Boxplot of rnorm(100)") +boxplot(matrix(rnorm(100), ncol = 10), + main = "Boxplot of matrix(rnorm(100), ncol = 10)") +hist(rnorm(100)) +hist(matrix(rnorm(100), ncol = 10)) +``` + +The out-of-the-box approach in base graphics can be very efficient for +simple, standard figures, that can be produced very quickly with a +single line of code and a single function such as `plot`, or `hist`, or +`boxplot`, ... The defaults are however not always the most appealing +and tuning of figures, especially when they become more complex (for +example to produce facets), can become lengthy and cumbersome. + +### The lattice package + +The **`lattice`** package is similar to `ggplot2` in that is uses +dataframes as input, returns graphical objects and supports faceting. +`lattice` however isn't based on the grammar of graphics and has a more +convoluted interface. + +A good reference for the `lattice` package is @latticebook. + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- Visualization in R + +:::::::::::::::::::::::::::::::::::::::::::::::::: From 7a6b1dcc7c99fc20bb044f88f7d19a7db038e6af Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Fri, 29 Dec 2023 06:11:07 +0900 Subject: [PATCH 028/334] New translations 40-visualization.md (Spanish) --- locale/es/episodes/40-visualization.Rmd | 1103 +++++++++++++++++++++++ 1 file changed, 1103 insertions(+) create mode 100644 locale/es/episodes/40-visualization.Rmd diff --git a/locale/es/episodes/40-visualization.Rmd b/locale/es/episodes/40-visualization.Rmd new file mode 100644 index 000000000..1c3b31c29 --- /dev/null +++ b/locale/es/episodes/40-visualization.Rmd @@ -0,0 +1,1103 @@ +--- +source: Rmd +title: Data visualization +teaching: 60 +exercises: 60 +--- + +```{r loaddata_vis, echo=FALSE, purl=FALSE, message=FALSE} +if (!file.exists("data/rnaseq.csv")) +download.file(url = "https://github.com/carpentries-incubator/bioc-intro/raw/main/episodes/data/rnaseq.csv", + destfile = "data/rnaseq.csv") +``` + +::::::::::::::::::::::::::::::::::::::: objetivos + +- Produce scatter plots, boxplots, line plots, etc. using ggplot. +- Set universal plot settings. +- Describe what faceting is and apply faceting in ggplot. +- Modify the aesthetics of an existing ggplot plot (including axis labels and color). +- Build complex and customized plots from data in a data frame. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- Visualization in R + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +```{r vis_setup, echo=FALSE} +rna <- read.csv("data/rnaseq.csv") +``` + +## Data Visualization + +We start by loading the required packages. **`ggplot2`** is included in +the **`tidyverse`** package. + +```{r load-package, message=FALSE, purl=TRUE} +library("tidyverse") +``` + +If not still in the workspace, load the data we saved in the previous +lesson. + +```{r load-data, eval=FALSE, purl=TRUE} +rna <- read.csv("data/rnaseq.csv") +``` + +The Data Visualization Cheat +Sheet +will cover the basics and more advanced features of `ggplot2` and will +help, in addition to serve as a reminder, getting an overview of the +many data representations available in the package. The following video +tutorials ([part 1](https://www.youtube.com/watch?v=h29g21z0a68) and +[2](https://www.youtube.com/watch?v=0m4yywqNPVY)) by Thomas Lin Pedersen +are also very instructive. + +## Plotting with `ggplot2` + +`ggplot2` is a plotting package that makes it simple to create complex +plots from data in a data frame. It provides a more programmatic +interface for specifying what variables to plot, how they are displayed, +and general visual properties. The theoretical foundation that supports +the `ggplot2` is the _Grammar of Graphics_ (@Wilkinson:2005). Using this +approach, we only need minimal changes if the underlying data change or +if we decide to change from a bar plot to a scatterplot. This helps in +creating publication quality plots with minimal amounts of adjustments +and tweaking. + +There is a book about `ggplot2` (@ggplot2book) that provides a good +overview, but it is outdated. The 3rd edition is in preparation and will +be [freely available online](https://ggplot2-book.org/). The `ggplot2` +webpage ([https://ggplot2.tidyverse.org](https://ggplot2.tidyverse.org)) provides ample documentation. + +`ggplot2` functions like data in the 'long' format, i.e., a column for +every dimension, and a row for every observation. Well-structured data +will save you lots of time when making figures with `ggplot2`. + +ggplot graphics are built step by step by adding new elements. Adding +layers in this fashion allows for extensive flexibility and +customization of plots. + +> The idea behind the Grammar of Graphics it is that you can build every +> graph from the same 3 components: (1) a data set, (2) a coordinate system, +> and (3) geoms — i.e. visual marks that represent data points \[^three\_comp\_ggplot2] + +[^three_comp_ggplot2]: Source: [Data Visualization Cheat Sheet](https://raw.githubusercontent.com/rstudio/cheatsheets/main/data-visualization.pdf). + +To build a ggplot, we will use the following basic template that can be +used for different types of plots: + +``` +ggplot(data = <DATA>, mapping = aes(<MAPPINGS>)) + <GEOM_FUNCTION>() +``` + +- use the `ggplot()` function and bind the plot to a specific **data + frame** using the `data` argument + +```{r, eval=FALSE} +ggplot(data = rna) +``` + +- define a **mapping** (using the aesthetic (`aes`) function), by + selecting the variables to be plotted and specifying how to present + them in the graph, e.g. as x/y positions or characteristics such as + size, shape, color, etc. + +```{r, eval=FALSE} +ggplot(data = rna, mapping = aes(x = expression)) +``` + +- add '**geoms**' - geometries, or graphical representations of the + data in the plot (points, lines, bars). `ggplot2` offers many + different geoms; we will use some common ones today, including: + + ``` + * `geom_point()` for scatter plots, dot plots, etc. + * `geom_histogram()` for histograms + * `geom_boxplot()` for, well, boxplots! + * `geom_line()` for trend lines, time series, etc. + ``` + +To add a geom(etry) to the plot use the `+` operator. Let's use +`geom_histogram()` first: + +```{r first-ggplot, cache=FALSE, purl=TRUE} +ggplot(data = rna, mapping = aes(x = expression)) + + geom_histogram() +``` + +The `+` in the `ggplot2` package is particularly useful because it +allows you to modify existing `ggplot` objects. This means you can +easily set up plot templates and conveniently explore different types of +plots, so the above plot can also be generated with code like this: + +```{r, eval=FALSE, purl=TRUE} +# Assign plot to a variable +rna_plot <- ggplot(data = rna, + mapping = aes(x = expression)) + +# Draw the plot +rna_plot + geom_histogram() +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge + +You have probably noticed an automatic message that appears when +drawing the histogram: + +```{r, echo=FALSE, fig.show="hide"} +ggplot(rna, aes(x = expression)) + + geom_histogram() +``` + +Change the arguments `bins` or `binwidth` of `geom_histogram()` to +change the number or width of the bins. + +::::::::::::::: solution + +## Solution + +```{r, purl=TRUE} +# change bins +ggplot(rna, aes(x = expression)) + + geom_histogram(bins = 15) + +# change binwidth +ggplot(rna, aes(x = expression)) + + geom_histogram(binwidth = 2000) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +We can observe here that the data are skewed to the right. We can apply +log2 transformation to have a more symmetric distribution. Note that we +add here a small constant value (`+1`) to avoid having `-Inf` values +returned for expression values equal to 0. + +```{r log-transfo, cache=FALSE, purl=TRUE} +rna <- rna %>% + mutate(expression_log = log2(expression + 1)) +``` + +If we now draw the histogram of the log2-transformed expressions, the +distribution is indeed closer to a normal distribution. + +```{r second-ggplot, cache=FALSE, purl=TRUE} +ggplot(rna, aes(x = expression_log)) + geom_histogram() +``` + +From now on we will work on the log-transformed expression values. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge + +Another way to visualize this transformation is to consider the scale +of the observations. For example, it may be worth changing the scale +of the axis to better distribute the observations in the space of the +plot. Changing the scale of the axes is done similarly to +adding/modifying other components (i.e., by incrementally adding +commands). Try making this modification: + +- Represent the un-transformed expression on the log10 scale; see + `scale_x_log10()`. Compare it with the previous graph. Why do you + now have warning messages appearing? + +::::::::::::::: solution + +## Solution + +```{r, eval=TRUE, purl=TRUE, echo=TRUE} +ggplot(data = rna,mapping = aes(x = expression))+ + geom_histogram() + + scale_x_log10() +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +**Notes** + +- Anything you put in the `ggplot()` function can be seen by any geom + layers that you add (i.e., these are global plot settings). This + includes the x- and y-axis mapping you set up in `aes()`. +- You can also specify mappings for a given geom independently of the + mappings defined globally in the `ggplot()` function. +- The `+` sign used to add new layers must be placed at the end of the + line containing the _previous_ layer. If, instead, the `+` sign is + added at the beginning of the line containing the new layer, + `ggplot2` will not add the new layer and will return an error + message. + +```{r, eval=FALSE} +# This is the correct syntax for adding layers +rna_plot + + geom_histogram() + +# This will not add the new layer and will return an error message +rna_plot + + geom_histogram() +``` + +## Building your plots iteratively + +We will now draw a scatter plot with two continuous variables and the +`geom_point()` function. This graph will represent the log2 fold changes +of expression comparing time 8 versus time 0, and time 4 versus time 0. +To this end, we first need to compute the means of the log-transformed +expression values by gene and time, then the log fold changes by +subtracting the mean log expressions between time 8 and time 0 and +between time 4 and time 0. Note that we also include here the gene +biotype that we will use later on to represent the genes. We will save +the fold changes in a new data frame called `rna_fc.` + +```{r rna_fc, cache=FALSE, purl=TRUE} +rna_fc <- rna %>% select(gene, time, + gene_biotype, expression_log) %>% + group_by(gene, time, gene_biotype) %>% + summarize(mean_exp = mean(expression_log)) %>% + pivot_wider(names_from = time, + values_from = mean_exp) %>% + mutate(time_8_vs_0 = `8` - `0`, time_4_vs_0 = `4` - `0`) + +``` + +We can then build a ggplot with the newly created dataset `rna_fc`. +Building plots with `ggplot2` is typically an iterative process. We +start by defining the dataset we'll use, lay out the axes, and choose a +geom: + +```{r create-ggplot-object, cache=FALSE, purl=TRUE} +ggplot(data = rna_fc, mapping = aes(x = time_4_vs_0, y = time_8_vs_0)) + + geom_point() +``` + +Then, we start modifying this plot to extract more information from it. +For instance, we can add transparency (`alpha`) to avoid overplotting: + +```{r adding-transparency, cache=FALSE, purl=TRUE} +ggplot(data = rna_fc, mapping = aes(x = time_4_vs_0, y = time_8_vs_0)) + + geom_point(alpha = 0.3) +``` + +We can also add colors for all the points: + +```{r adding-colors, cache=FALSE, purl=TRUE} +ggplot(data = rna_fc, mapping = aes(x = time_4_vs_0, y = time_8_vs_0)) + + geom_point(alpha = 0.3, color = "blue") +``` + +Or to color each gene in the plot differently, you could use a vector as +an input to the argument **color**. `ggplot2` will provide a different +color corresponding to different values in the vector. Here is an +example where we color with `gene_biotype`: + +```{r color-by-gene_biotype1, cache=FALSE, purl=TRUE} +ggplot(data = rna_fc, mapping = aes(x = time_4_vs_0, y = time_8_vs_0)) + + geom_point(alpha = 0.3, aes(color = gene_biotype)) + +``` + +We can also specify the colors directly inside the mapping provided in +the `ggplot()` function. This will be seen by any geom layers and the +mapping will be determined by the x- and y-axis set up in `aes()`. + +```{r color-by-gene_biotype2, cache=FALSE, purl=TRUE} +ggplot(data = rna_fc, mapping = aes(x = time_4_vs_0, y = time_8_vs_0, + color = gene_biotype)) + + geom_point(alpha = 0.3) +``` + +Finally, we could also add a diagonal line with the `geom_abline()` +function: + +```{r adding-diag, cache=FALSE, purl=TRUE} +ggplot(data = rna_fc, mapping = aes(x = time_4_vs_0, y = time_8_vs_0, + color = gene_biotype)) + + geom_point(alpha = 0.3) + + geom_abline(intercept = 0) +``` + +Notice that we can change the geom layer from `geom_point` to +`geom_jitter` and colors will still be determined by `gene_biotype`. + +```{r color-by-gene_biotype3, cache=FALSE, purl=TRUE} +ggplot(data = rna_fc, mapping = aes(x = time_4_vs_0, y = time_8_vs_0, + color = gene_biotype)) + + geom_jitter(alpha = 0.3) + + geom_abline(intercept = 0) +``` + +```{r, echo=FALSE, message=FALSE} +library("hexbin") +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge + +Scatter plots can be useful exploratory tools for small datasets. For +data sets with large numbers of observations, such as the `rna_fc` +data set, overplotting of points can be a limitation of scatter plots. +One strategy for handling such settings is to use hexagonal binning of +observations. The plot space is tessellated into hexagons. Each +hexagon is assigned a color based on the number of observations that +fall within its boundaries. + +- To use hexagonal binning in `ggplot2`, first install the R package + `hexbin` from CRAN and load it. + +- Then use the `geom_hex()` function to produce the hexbin figure. + +- What are the relative strengths and weaknesses of a hexagonal bin + plot compared to a scatter plot? Examine the above scatter plot + and compare it with the hexagonal bin plot that you created. + +::::::::::::::: solution + +## Solution + +```{r, eval=FALSE, purl=TRUE} +install.packages("hexbin") +``` + +```{r, purl=TRUE} +library("hexbin") + +ggplot(data = rna_fc, mapping = aes(x = time_4_vs_0, y = time_8_vs_0)) + + geom_hex() + + geom_abline(intercept = 0) + +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge + +Use what you just learned to create a scatter plot of `expression_log` +over `sample` from the `rna` dataset with the time showing in +different colors. Is this a good way to show this type of data? + +::::::::::::::: solution + +## Solution + +```{r, eval=TRUE, purl=TRUE} +ggplot(data = rna, mapping = aes(y = expression_log, x = sample)) + + geom_point(aes(color = time)) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Boxplot + +We can use boxplots to visualize the distribution of gene expressions +within each sample: + +```{r boxplot, cache=FALSE, purl=TRUE} +ggplot(data = rna, + mapping = aes(y = expression_log, x = sample)) + + geom_boxplot() +``` + +By adding points to boxplot, we can have a better idea of the number of +measurements and of their distribution: + +```{r boxplot-with-points, cache=FALSE, purl=TRUE} +ggplot(data = rna, + mapping = aes(y = expression_log, x = sample)) + + geom_jitter(alpha = 0.2, color = "tomato") + + geom_boxplot(alpha = 0) +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge + +Note how the boxplot layer is in front of the jitter layer? What do +you need to change in the code to put the boxplot below the points? + +::::::::::::::: solution + +## Solution + +We should switch the order of these two geoms: + +```{r boxplot-with-points2, cache=FALSE, purl=TRUE} +ggplot(data = rna, + mapping = aes(y = expression_log, x = sample)) + + geom_boxplot(alpha = 0) + + geom_jitter(alpha = 0.2, color = "tomato") +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +You may notice that the values on the x-axis are still not properly +readable. Let's change the orientation of the labels and adjust them +vertically and horizontally so they don't overlap. You can use a +90-degree angle, or experiment to find the appropriate angle for +diagonally oriented labels: + +```{r boxplot-xaxis-rotated, cache=FALSE, purl=TRUE} +ggplot(data = rna, + mapping = aes(y = expression_log, x = sample)) + + geom_jitter(alpha = 0.2, color = "tomato") + + geom_boxplot(alpha = 0) + + theme(axis.text.x = element_text(angle = 90, hjust = 0.5, vjust = 0.5)) +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge + +Add color to the data points on your boxplot according to the duration +of the infection (`time`). + +_Hint:_ Check the class for `time`. Consider changing the class of +`time` from integer to factor directly in the ggplot mapping. Why does +this change how R makes the graph? + +::::::::::::::: solution + +## Solution + +```{r boxplot-color-time, cache=FALSE, purl=TRUE} +# time as integer +ggplot(data = rna, + mapping = aes(y = expression_log, + x = sample)) + + geom_jitter(alpha = 0.2, aes(color = time)) + + geom_boxplot(alpha = 0) + + theme(axis.text.x = element_text(angle = 90, hjust = 0.5, vjust = 0.5)) + +# time as factor +ggplot(data = rna, + mapping = aes(y = expression_log, + x = sample)) + + geom_jitter(alpha = 0.2, aes(color = as.factor(time))) + + geom_boxplot(alpha = 0) + + theme(axis.text.x = element_text(angle = 90, hjust = 0.5, vjust = 0.5)) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge + +Boxplots are useful summaries, but hide the _shape_ of the +distribution. For example, if the distribution is bimodal, we would +not see it in a boxplot. An alternative to the boxplot is the violin +plot, where the shape (of the density of points) is drawn. + +- Replace the box plot with a violin plot; see `geom_violin()`. Fill + in the violins according to the time with the argument `fill`. + +::::::::::::::: solution + +## Solution + +```{r, eval=TRUE, echo=TRUE, cache=FALSE, purl=TRUE} +ggplot(data = rna, + mapping = aes(y = expression_log, x = sample)) + + geom_violin(aes(fill = as.factor(time))) + + theme(axis.text.x = element_text(angle = 90, hjust = 0.5, vjust = 0.5)) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge + +- Modify the violin plot to fill in the violins by `sex`. + +::::::::::::::: solution + +## Solution + +```{r, eval=TRUE, echo=TRUE, cache=FALSE, purl=TRUE} +ggplot(data = rna, + mapping = aes(y = expression_log, x = sample)) + + geom_violin(aes(fill = sex)) + + theme(axis.text.x = element_text(angle = 90, hjust = 0.5, vjust = 0.5)) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Line plots + +Let's calculate the mean expression per duration of the infection for +the 10 genes having the highest log fold changes comparing time 8 versus +time 0. First, we need to select the genes and create a subset of `rna` +called `sub_rna` containing the 10 selected genes, then we need to group +the data and calculate the mean gene expression within each group: + +```{r, purl=TRUE} +rna_fc <- rna_fc %>% arrange(desc(time_8_vs_0)) + +genes_selected <- rna_fc$gene[1:10] + +sub_rna <- rna %>% + filter(gene %in% genes_selected) + +mean_exp_by_time <- sub_rna %>% + group_by(gene,time) %>% + summarize(mean_exp = mean(expression_log)) + +mean_exp_by_time +``` + +We can build the line plot with duration of the infection on the x-axis +and the mean expression on the y-axis: + +```{r first-time-series, purl=TRUE} +ggplot(data = mean_exp_by_time, mapping = aes(x = time, y = mean_exp)) + + geom_line() +``` + +Unfortunately, this does not work because we plotted data for all the +genes together. We need to tell ggplot to draw a line for each gene by +modifying the aesthetic function to include `group = gene`: + +```{r time-series-by-gene, purl=TRUE} +ggplot(data = mean_exp_by_time, + mapping = aes(x = time, y = mean_exp, group = gene)) + + geom_line() +``` + +We will be able to distinguish genes in the plot if we add colors (using +`color` also automatically groups the data): + +```{r time-series-with-colors, purl=TRUE} +ggplot(data = mean_exp_by_time, + mapping = aes(x = time, y = mean_exp, color = gene)) + + geom_line() +``` + +## Faceting + +`ggplot2` has a special technique called _faceting_ that allows the user +to split one plot into multiple (sub) plots based on a factor included +in the dataset. These different subplots inherit the same properties +(axes limits, ticks, ...) to facilitate their direct comparison. We will +use it to make a line plot across time for each gene: + +```{r first-facet, purl=TRUE} +ggplot(data = mean_exp_by_time, + mapping = aes(x = time, y = mean_exp)) + geom_line() + + facet_wrap(~ gene) +``` + +Here both x- and y-axis have the same scale for all the subplots. You +can change this default behavior by modifying `scales` in order to allow +a free scale for the y-axis: + +```{r first-facet-scales, purl=TRUE} +ggplot(data = mean_exp_by_time, + mapping = aes(x = time, y = mean_exp)) + + geom_line() + + facet_wrap(~ gene, scales = "free_y") +``` + +Now we would like to split the line in each plot by the sex of the mice. +To do that we need to calculate the mean expression in the data frame +grouped by `gene`, `time`, and `sex`: + +```{r data-facet-by-gene-and-sex, purl=TRUE} +mean_exp_by_time_sex <- sub_rna %>% + group_by(gene, time, sex) %>% + summarize(mean_exp = mean(expression_log)) + +mean_exp_by_time_sex +``` + +We can now make the faceted plot by splitting further by sex using +`color` (within a single plot): + +```{r facet-by-gene-and-sex, cache=FALSE, purl=TRUE} +ggplot(data = mean_exp_by_time_sex, + mapping = aes(x = time, y = mean_exp, color = sex)) + + geom_line() + + facet_wrap(~ gene, scales = "free_y") +``` + +Usually plots with white background look more readable when printed. We +can set the background to white using the function `theme_bw()`. +Additionally, we can remove the grid: + +```{r facet-by-gene-and-sex-white-bg, cache=FALSE, purl=TRUE} +ggplot(data = mean_exp_by_time_sex, + mapping = aes(x = time, y = mean_exp, color = sex)) + + geom_line() + + facet_wrap(~ gene, scales = "free_y") + + theme_bw() + + theme(panel.grid = element_blank()) +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge + +Use what you just learned to create a plot that depicts how the +average expression of each chromosome changes through the duration of +infection. + +::::::::::::::: solution + +## Solution + +```{r mean-exp-chromosome-time-series, purl=TRUE} +mean_exp_by_chromosome <- rna %>% + group_by(chromosome_name, time) %>% + summarize(mean_exp = mean(expression_log)) + +ggplot(data = mean_exp_by_chromosome, mapping = aes(x = time, + y = mean_exp)) + + geom_line() + + facet_wrap(~ chromosome_name, scales = "free_y") +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +The `facet_wrap` geometry extracts plots into an arbitrary number of +dimensions to allow them to cleanly fit on one page. On the other hand, +the `facet_grid` geometry allows you to explicitly specify how you want +your plots to be arranged via formula notation (`rows ~ columns`; a `.` +can be used as a placeholder that indicates only one row or column). + +Let's modify the previous plot to compare how the mean gene expression +of males and females has changed through time: + +```{r mean-exp-time-facet-sex-rows, purl=TRUE} +# One column, facet by rows +ggplot(data = mean_exp_by_time_sex, + mapping = aes(x = time, y = mean_exp, color = gene)) + + geom_line() + + facet_grid(sex ~ .) +``` + +```{r mean-exp-time-facet-sex-columns, purl=TRUE} +# One row, facet by column +ggplot(data = mean_exp_by_time_sex, + mapping = aes(x = time, y = mean_exp, color = gene)) + + geom_line() + + facet_grid(. ~ sex) +``` + +## `ggplot2` themes + +In addition to `theme_bw()`, which changes the plot background to white, +`ggplot2` comes with several other themes which can be useful to quickly +change the look of your visualization. The complete list of themes is +available at [https://ggplot2.tidyverse.org/reference/ggtheme.html](https://ggplot2.tidyverse.org/reference/ggtheme.html). +`theme_minimal()` and `theme_light()` are popular, and `theme_void()` +can be useful as a starting point to create a new hand-crafted theme. + +The [ggthemes](https://jrnold.github.io/ggthemes/reference/index.html) +package provides a wide variety of options (including an Excel 2003 +theme). The ggplot2 provides a list of +packages that extend the capabilities of `ggplot2`, including additional +themes. + +## Customisation + +Let's come back to the faceted plot of mean expression by time and gene, +colored by sex. + +Take a look at the ggplot2, +and think of ways you could improve the plot. + +Now, we can change names of axes to something more informative than +'time' and 'mean_exp', and add a title to the figure: + +```{r mean_exp-time-with-right-labels, cache=FALSE, purl=TRUE} +ggplot(data = mean_exp_by_time_sex, + mapping = aes(x = time, y = mean_exp, color = sex)) + + geom_line() + + facet_wrap(~ gene, scales = "free_y") + + theme_bw() + + theme(panel.grid = element_blank()) + + labs(title = "Mean gene expression by duration of the infection", + x = "Duration of the infection (in days)", + y = "Mean gene expression") +``` + +The axes have more informative names, but their readability can be +improved by increasing the font size: + +```{r mean_exp-time-with-right-labels-xfont-size, cache=FALSE, purl=TRUE} +ggplot(data = mean_exp_by_time_sex, + mapping = aes(x = time, y = mean_exp, color = sex)) + + geom_line() + + facet_wrap(~ gene, scales = "free_y") + + theme_bw() + + theme(panel.grid = element_blank()) + + labs(title = "Mean gene expression by duration of the infection", + x = "Duration of the infection (in days)", + y = "Mean gene expression") + + theme(text = element_text(size = 16)) +``` + +Note that it is also possible to change the fonts of your plots. If you +are on Windows, you may have to install the . + +We can further customize the color of x- and y-axis text, the color of +the grid, etc. We can also for example move the legend to the top by +setting `legend.position` to `"top"`. + +```{r mean_exp-time-with-theme, cache=FALSE, purl=TRUE} +ggplot(data = mean_exp_by_time_sex, + mapping = aes(x = time, y = mean_exp, color = sex)) + + geom_line() + + facet_wrap(~ gene, scales = "free_y") + + theme_bw() + + theme(panel.grid = element_blank()) + + labs(title = "Mean gene expression by duration of the infection", + x = "Duration of the infection (in days)", + y = "Mean gene expression") + + theme(text = element_text(size = 16), + axis.text.x = element_text(colour = "royalblue4", size = 12), + axis.text.y = element_text(colour = "royalblue4", size = 12), + panel.grid = element_line(colour="lightsteelblue1"), + legend.position = "top") +``` + +If you like the changes you created better than the default theme, you +can save them as an object to be able to easily apply them to other +plots you may create. Here is an example with the histogram we have +previously created. + +```{r mean_exp-time-with-right-labels-xfont, cache=FALSE, purl=TRUE} +blue_theme <- theme(axis.text.x = element_text(colour = "royalblue4", + size = 12), + axis.text.y = element_text(colour = "royalblue4", + size = 12), + text = element_text(size = 16), + panel.grid = element_line(colour="lightsteelblue1")) + +ggplot(rna, aes(x = expression_log)) + + geom_histogram(bins = 20) + + blue_theme +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge + +With all of this information in hand, please take another five minutes +to either improve one of the plots generated in this exercise or +create a beautiful graph of your own. Use the RStudio ggplot2 +for inspiration. Here are some ideas: + +- See if you can change the thickness of the lines. +- Can you find a way to change the name of the legend? What about + its labels? (hint: look for a ggplot function starting with + `scale_`) +- Try using a different color palette or manually specifying the + colors for the lines (see + [http://www.cookbook-r.com/Graphs/Colors_(ggplot2)/](https://www.cookbook-r.com/Graphs/Colors_\(ggplot2\)/)). + +::::::::::::::: solution + +## Solution + +For example, based on this plot: + +```{r, purl=TRUE} +ggplot(data = mean_exp_by_time_sex, + mapping = aes(x = time, y = mean_exp, color = sex)) + + geom_line() + + facet_wrap(~ gene, scales = "free_y") + + theme_bw() + + theme(panel.grid = element_blank()) +``` + +We can customize it the following ways: + +```{r, purl=TRUE} +# change the thickness of the lines +ggplot(data = mean_exp_by_time_sex, + mapping = aes(x = time, y = mean_exp, color = sex)) + + geom_line(size=1.5) + + facet_wrap(~ gene, scales = "free_y") + + theme_bw() + + theme(panel.grid = element_blank()) + +# change the name of the legend and the labels +ggplot(data = mean_exp_by_time_sex, + mapping = aes(x = time, y = mean_exp, color = sex)) + + geom_line() + + facet_wrap(~ gene, scales = "free_y") + + theme_bw() + + theme(panel.grid = element_blank()) + + scale_color_discrete(name = "Gender", labels = c("F", "M")) + +# using a different color palette +ggplot(data = mean_exp_by_time_sex, + mapping = aes(x = time, y = mean_exp, color = sex)) + + geom_line() + + facet_wrap(~ gene, scales = "free_y") + + theme_bw() + + theme(panel.grid = element_blank()) + + scale_color_brewer(name = "Gender", labels = c("F", "M"), palette = "Dark2") + +# manually specifying the colors +ggplot(data = mean_exp_by_time_sex, + mapping = aes(x = time, y = mean_exp, color = sex)) + + geom_line() + + facet_wrap(~ gene, scales = "free_y") + + theme_bw() + + theme(panel.grid = element_blank()) + + scale_color_manual(name = "Gender", labels = c("F", "M"), + values = c("royalblue", "deeppink")) + +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Composing plots + +Faceting is a great tool for splitting one plot into multiple subplots, +but sometimes you may want to produce a single figure that contains +multiple independent plots, i.e. plots that are based on different +variables or even different data frames. + +Let's start by creating the two plots that we want to arrange next to +each other: + +The first graph counts the number of unique genes per chromosome. We +first need to reorder the levels of `chromosome_name` and filter the +unique genes per chromosome. We also change the scale of the y-axis to a +log10 scale for better readability. + +```{r sub1, purl=TRUE} +rna$chromosome_name <- factor(rna$chromosome_name, + levels = c(1:19,"X","Y")) + +count_gene_chromosome <- rna %>% select(chromosome_name, gene) %>% + distinct() %>% ggplot() + + geom_bar(aes(x = chromosome_name), fill = "seagreen", + position = "dodge", stat = "count") + + labs(y = "log10(n genes)", x = "chromosome") + + scale_y_log10() + +count_gene_chromosome +``` + +Below, we also remove the legend altogether by setting the +`legend.position` to `"none"`. + +```{r sub2, purl=TRUE} +exp_boxplot_sex <- ggplot(rna, aes(y=expression_log, x = as.factor(time), + color=sex)) + + geom_boxplot(alpha = 0) + + labs(y = "Mean gene exp", + x = "time") + theme(legend.position = "none") + +exp_boxplot_sex +``` + +The [**patchwork**](https://github.com/thomasp85/patchwork) package +provides an elegant approach to combining figures using the `+` to +arrange figures (typically side by side). More specifically the `|` +explicitly arranges them side by side and `/` stacks them on top of each +other. + +```{r install-patchwork, message=FALSE, eval=FALSE, purl=TRUE} +install.packages("patchwork") +``` + +```{r patchworkplot1, purl=TRUE} +library("patchwork") +count_gene_chromosome + exp_boxplot_sex +## or count_gene_chromosome | exp_boxplot_sex +``` + +```{r patchwork2, purl=TRUE} +count_gene_chromosome / exp_boxplot_sex +``` + +We can combine further control the layout of the final composition with +`plot_layout` to create more complex layouts: + +```{r patchwork3, purl=TRUE} +count_gene_chromosome + exp_boxplot_sex + plot_layout(ncol = 1) +``` + +```{r patchwork4, purl=TRUE} +count_gene_chromosome + + (count_gene_chromosome + exp_boxplot_sex) + + exp_boxplot_sex + + plot_layout(ncol = 1) +``` + +The last plot can also be created using the `|` and `/` composers: + +```{r patchwork5, purl=TRUE} +count_gene_chromosome / + (count_gene_chromosome | exp_boxplot_sex) / + exp_boxplot_sex +``` + +Learn more about `patchwork` on its +[webpage](https://patchwork.data-imaginist.com/) or in this +[video](https://www.youtube.com/watch?v=0m4yywqNPVY). + +Another option is the **`gridExtra`** package that allows to combine +separate ggplots into a single figure using `grid.arrange()`: + +```{r install-gridextra, message=FALSE, eval=FALSE, purl=TRUE} +install.packages("gridExtra") +``` + +```{r gridarrange-example, message=FALSE, fig.width=10, purl=TRUE} +library("gridExtra") +grid.arrange(count_gene_chromosome, exp_boxplot_sex, ncol = 2) +``` + +In addition to the `ncol` and `nrow` arguments, used to make simple +arrangements, there are tools for constructing more complex +layouts. + +## Exporting plots + +After creating your plot, you can save it to a file in your favorite +format. The Export tab in the **Plot** pane in RStudio will save your +plots at low resolution, which will not be accepted by many journals and +will not scale well for posters. + +Instead, use the `ggsave()` function, which allows you easily change the +dimension and resolution of your plot by adjusting the appropriate +arguments (`width`, `height` and `dpi`). + +Make sure you have the `fig_output/` folder in your working directory. + +```{r ggsave-example, eval=FALSE, purl=TRUE} +my_plot <- ggplot(data = mean_exp_by_time_sex, + mapping = aes(x = time, y = mean_exp, color = sex)) + + geom_line() + + facet_wrap(~ gene, scales = "free_y") + + labs(title = "Mean gene expression by duration of the infection", + x = "Duration of the infection (in days)", + y = "Mean gene expression") + + guides(color=guide_legend(title="Gender")) + + theme_bw() + + theme(axis.text.x = element_text(colour = "royalblue4", size = 12), + axis.text.y = element_text(colour = "royalblue4", size = 12), + text = element_text(size = 16), + panel.grid = element_line(colour="lightsteelblue1"), + legend.position = "top") +ggsave("fig_output/mean_exp_by_time_sex.png", my_plot, width = 15, + height = 10) + +# This also works for grid.arrange() plots +combo_plot <- grid.arrange(count_gene_chromosome, exp_boxplot_sex, + ncol = 2, widths = c(4, 6)) +ggsave("fig_output/combo_plot_chromosome_sex.png", combo_plot, + width = 10, dpi = 300) +``` + +Note: The parameters `width` and `height` also determine the font size +in the saved plot. + +```{r final-challenge, eval=FALSE, purl=TRUE, echo=FALSE} +### Final plotting challenge: +## With all of this information in hand, please take another five +## minutes to either improve one of the plots generated in this +## exercise or create a beautiful graph of your own. Use the RStudio +## ggplot2 cheat sheet for inspiration: +## https://raw.githubusercontent.com/rstudio/cheatsheets/main/data-visualization.pdf +``` + +## Other packages for visualisation + +`ggplot2` is a very powerful package that fits very nicely in our _tidy +data_ and _tidy tools_ pipeline. There are other visualization packages +in R that shouldn't be ignored. + +### Base graphics + +The default graphics system that comes with R, often called _base R +graphics_ is simple and fast. It is based on the _painter's or canvas +model_, where different output are directly overlaid on top of each +other (see figure @ref(fig:paintermodel)). This is a fundamental +difference with `ggplot2` (and with `lattice`, described below), that +returns dedicated objects, that are rendered on screen or in a file, and +that can even be updated. + +```{r paintermodel, fig.width=12, fig.height=4, fig.cap="Successive layers added on top of each other."} +par(mfrow = c(1, 3)) +plot(1:20, main = "First layer, produced with plot(1:20)") + +plot(1:20, main = "A horizontal red line, added with abline(h = 10)") +abline(h = 10, col = "red") + +plot(1:20, main = "A rectangle, added with rect(5, 5, 15, 15)") +abline(h = 10, col = "red") +rect(5, 5, 15, 15, lwd = 3) +``` + +Another main difference is that base graphics' plotting function try to +do _the right_ thing based on their input type, i.e. they will adapt +their behaviour based on the class of their input. This is again very +different from what we have in `ggplot2`, that only accepts dataframes +as input, and that requires plots to be constructed bit by bit. + +```{r plotmethod, fig.width=8, fig.height=8, fig.cap="Plotting boxplots (top) and histograms (bottom) vectors (left) or a matrices (right)."} +par(mfrow = c(2, 2)) +boxplot(rnorm(100), + main = "Boxplot of rnorm(100)") +boxplot(matrix(rnorm(100), ncol = 10), + main = "Boxplot of matrix(rnorm(100), ncol = 10)") +hist(rnorm(100)) +hist(matrix(rnorm(100), ncol = 10)) +``` + +The out-of-the-box approach in base graphics can be very efficient for +simple, standard figures, that can be produced very quickly with a +single line of code and a single function such as `plot`, or `hist`, or +`boxplot`, ... The defaults are however not always the most appealing +and tuning of figures, especially when they become more complex (for +example to produce facets), can become lengthy and cumbersome. + +### The lattice package + +The **`lattice`** package is similar to `ggplot2` in that is uses +dataframes as input, returns graphical objects and supports faceting. +`lattice` however isn't based on the grammar of graphics and has a more +convoluted interface. + +A good reference for the `lattice` package is @latticebook. + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- Visualization in R + +:::::::::::::::::::::::::::::::::::::::::::::::::: From de0d7701f751723524ee6219f2b30ffcd51cc446 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Fri, 29 Dec 2023 06:11:09 +0900 Subject: [PATCH 029/334] New translations 40-visualization.md (Japanese) --- locale/ja/episodes/40-visualization.Rmd | 1103 +++++++++++++++++++++++ 1 file changed, 1103 insertions(+) create mode 100644 locale/ja/episodes/40-visualization.Rmd diff --git a/locale/ja/episodes/40-visualization.Rmd b/locale/ja/episodes/40-visualization.Rmd new file mode 100644 index 000000000..425759cbc --- /dev/null +++ b/locale/ja/episodes/40-visualization.Rmd @@ -0,0 +1,1103 @@ +--- +source: Rmd +title: Data visualization +teaching: 60 +exercises: 60 +--- + +```{r loaddata_vis, echo=FALSE, purl=FALSE, message=FALSE} +if (!file.exists("data/rnaseq.csv")) +download.file(url = "https://github.com/carpentries-incubator/bioc-intro/raw/main/episodes/data/rnaseq.csv", + destfile = "data/rnaseq.csv") +``` + +::::::::::::::::::::::::::::::::::::::: 目的 + +- Produce scatter plots, boxplots, line plots, etc. using ggplot. +- Set universal plot settings. +- Describe what faceting is and apply faceting in ggplot. +- Modify the aesthetics of an existing ggplot plot (including axis labels and color). +- Build complex and customized plots from data in a data frame. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- Visualization in R + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +```{r vis_setup, echo=FALSE} +rna <- read.csv("data/rnaseq.csv") +``` + +## Data Visualization + +We start by loading the required packages. **`ggplot2`** is included in +the **`tidyverse`** package. + +```{r load-package, message=FALSE, purl=TRUE} +library("tidyverse") +``` + +If not still in the workspace, load the data we saved in the previous +lesson. + +```{r load-data, eval=FALSE, purl=TRUE} +rna <- read.csv("data/rnaseq.csv") +``` + +The Data Visualization Cheat +Sheet +will cover the basics and more advanced features of `ggplot2` and will +help, in addition to serve as a reminder, getting an overview of the +many data representations available in the package. The following video +tutorials ([part 1](https://www.youtube.com/watch?v=h29g21z0a68) and +[2](https://www.youtube.com/watch?v=0m4yywqNPVY)) by Thomas Lin Pedersen +are also very instructive. + +## Plotting with `ggplot2` + +`ggplot2` is a plotting package that makes it simple to create complex +plots from data in a data frame. It provides a more programmatic +interface for specifying what variables to plot, how they are displayed, +and general visual properties. The theoretical foundation that supports +the `ggplot2` is the _Grammar of Graphics_ (@Wilkinson:2005). Using this +approach, we only need minimal changes if the underlying data change or +if we decide to change from a bar plot to a scatterplot. This helps in +creating publication quality plots with minimal amounts of adjustments +and tweaking. + +There is a book about `ggplot2` (@ggplot2book) that provides a good +overview, but it is outdated. The 3rd edition is in preparation and will +be [freely available online](https://ggplot2-book.org/). The `ggplot2` +webpage ([https://ggplot2.tidyverse.org](https://ggplot2.tidyverse.org)) provides ample documentation. + +`ggplot2` functions like data in the 'long' format, i.e., a column for +every dimension, and a row for every observation. Well-structured data +will save you lots of time when making figures with `ggplot2`. + +ggplot graphics are built step by step by adding new elements. Adding +layers in this fashion allows for extensive flexibility and +customization of plots. + +> The idea behind the Grammar of Graphics it is that you can build every +> graph from the same 3 components: (1) a data set, (2) a coordinate system, +> and (3) geoms — i.e. visual marks that represent data points \[^three\_comp\_ggplot2] + +[^three_comp_ggplot2]: Source: [Data Visualization Cheat Sheet](https://raw.githubusercontent.com/rstudio/cheatsheets/main/data-visualization.pdf). + +To build a ggplot, we will use the following basic template that can be +used for different types of plots: + +``` +ggplot(data = <DATA>, mapping = aes(<MAPPINGS>)) + <GEOM_FUNCTION>() +``` + +- use the `ggplot()` function and bind the plot to a specific **data + frame** using the `data` argument + +```{r, eval=FALSE} +ggplot(data = rna) +``` + +- define a **mapping** (using the aesthetic (`aes`) function), by + selecting the variables to be plotted and specifying how to present + them in the graph, e.g. as x/y positions or characteristics such as + size, shape, color, etc. + +```{r, eval=FALSE} +ggplot(data = rna, mapping = aes(x = expression)) +``` + +- add '**geoms**' - geometries, or graphical representations of the + data in the plot (points, lines, bars). `ggplot2` offers many + different geoms; we will use some common ones today, including: + + ``` + * `geom_point()` for scatter plots, dot plots, etc. + * `geom_histogram()` for histograms + * `geom_boxplot()` for, well, boxplots! + * `geom_line()` for trend lines, time series, etc. + ``` + +To add a geom(etry) to the plot use the `+` operator. Let's use +`geom_histogram()` first: + +```{r first-ggplot, cache=FALSE, purl=TRUE} +ggplot(data = rna, mapping = aes(x = expression)) + + geom_histogram() +``` + +The `+` in the `ggplot2` package is particularly useful because it +allows you to modify existing `ggplot` objects. This means you can +easily set up plot templates and conveniently explore different types of +plots, so the above plot can also be generated with code like this: + +```{r, eval=FALSE, purl=TRUE} +# Assign plot to a variable +rna_plot <- ggplot(data = rna, + mapping = aes(x = expression)) + +# Draw the plot +rna_plot + geom_histogram() +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge + +You have probably noticed an automatic message that appears when +drawing the histogram: + +```{r, echo=FALSE, fig.show="hide"} +ggplot(rna, aes(x = expression)) + + geom_histogram() +``` + +Change the arguments `bins` or `binwidth` of `geom_histogram()` to +change the number or width of the bins. + +::::::::::::::: solution + +## Solution + +```{r, purl=TRUE} +# change bins +ggplot(rna, aes(x = expression)) + + geom_histogram(bins = 15) + +# change binwidth +ggplot(rna, aes(x = expression)) + + geom_histogram(binwidth = 2000) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +We can observe here that the data are skewed to the right. We can apply +log2 transformation to have a more symmetric distribution. Note that we +add here a small constant value (`+1`) to avoid having `-Inf` values +returned for expression values equal to 0. + +```{r log-transfo, cache=FALSE, purl=TRUE} +rna <- rna %>% + mutate(expression_log = log2(expression + 1)) +``` + +If we now draw the histogram of the log2-transformed expressions, the +distribution is indeed closer to a normal distribution. + +```{r second-ggplot, cache=FALSE, purl=TRUE} +ggplot(rna, aes(x = expression_log)) + geom_histogram() +``` + +From now on we will work on the log-transformed expression values. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge + +Another way to visualize this transformation is to consider the scale +of the observations. For example, it may be worth changing the scale +of the axis to better distribute the observations in the space of the +plot. Changing the scale of the axes is done similarly to +adding/modifying other components (i.e., by incrementally adding +commands). Try making this modification: + +- Represent the un-transformed expression on the log10 scale; see + `scale_x_log10()`. Compare it with the previous graph. Why do you + now have warning messages appearing? + +::::::::::::::: solution + +## Solution + +```{r, eval=TRUE, purl=TRUE, echo=TRUE} +ggplot(data = rna,mapping = aes(x = expression))+ + geom_histogram() + + scale_x_log10() +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +**Notes** + +- Anything you put in the `ggplot()` function can be seen by any geom + layers that you add (i.e., these are global plot settings). This + includes the x- and y-axis mapping you set up in `aes()`. +- You can also specify mappings for a given geom independently of the + mappings defined globally in the `ggplot()` function. +- The `+` sign used to add new layers must be placed at the end of the + line containing the _previous_ layer. If, instead, the `+` sign is + added at the beginning of the line containing the new layer, + `ggplot2` will not add the new layer and will return an error + message. + +```{r, eval=FALSE} +# This is the correct syntax for adding layers +rna_plot + + geom_histogram() + +# This will not add the new layer and will return an error message +rna_plot + + geom_histogram() +``` + +## Building your plots iteratively + +We will now draw a scatter plot with two continuous variables and the +`geom_point()` function. This graph will represent the log2 fold changes +of expression comparing time 8 versus time 0, and time 4 versus time 0. +To this end, we first need to compute the means of the log-transformed +expression values by gene and time, then the log fold changes by +subtracting the mean log expressions between time 8 and time 0 and +between time 4 and time 0. Note that we also include here the gene +biotype that we will use later on to represent the genes. We will save +the fold changes in a new data frame called `rna_fc.` + +```{r rna_fc, cache=FALSE, purl=TRUE} +rna_fc <- rna %>% select(gene, time, + gene_biotype, expression_log) %>% + group_by(gene, time, gene_biotype) %>% + summarize(mean_exp = mean(expression_log)) %>% + pivot_wider(names_from = time, + values_from = mean_exp) %>% + mutate(time_8_vs_0 = `8` - `0`, time_4_vs_0 = `4` - `0`) + +``` + +We can then build a ggplot with the newly created dataset `rna_fc`. +Building plots with `ggplot2` is typically an iterative process. We +start by defining the dataset we'll use, lay out the axes, and choose a +geom: + +```{r create-ggplot-object, cache=FALSE, purl=TRUE} +ggplot(data = rna_fc, mapping = aes(x = time_4_vs_0, y = time_8_vs_0)) + + geom_point() +``` + +Then, we start modifying this plot to extract more information from it. +For instance, we can add transparency (`alpha`) to avoid overplotting: + +```{r adding-transparency, cache=FALSE, purl=TRUE} +ggplot(data = rna_fc, mapping = aes(x = time_4_vs_0, y = time_8_vs_0)) + + geom_point(alpha = 0.3) +``` + +We can also add colors for all the points: + +```{r adding-colors, cache=FALSE, purl=TRUE} +ggplot(data = rna_fc, mapping = aes(x = time_4_vs_0, y = time_8_vs_0)) + + geom_point(alpha = 0.3, color = "blue") +``` + +Or to color each gene in the plot differently, you could use a vector as +an input to the argument **color**. `ggplot2` will provide a different +color corresponding to different values in the vector. Here is an +example where we color with `gene_biotype`: + +```{r color-by-gene_biotype1, cache=FALSE, purl=TRUE} +ggplot(data = rna_fc, mapping = aes(x = time_4_vs_0, y = time_8_vs_0)) + + geom_point(alpha = 0.3, aes(color = gene_biotype)) + +``` + +We can also specify the colors directly inside the mapping provided in +the `ggplot()` function. This will be seen by any geom layers and the +mapping will be determined by the x- and y-axis set up in `aes()`. + +```{r color-by-gene_biotype2, cache=FALSE, purl=TRUE} +ggplot(data = rna_fc, mapping = aes(x = time_4_vs_0, y = time_8_vs_0, + color = gene_biotype)) + + geom_point(alpha = 0.3) +``` + +Finally, we could also add a diagonal line with the `geom_abline()` +function: + +```{r adding-diag, cache=FALSE, purl=TRUE} +ggplot(data = rna_fc, mapping = aes(x = time_4_vs_0, y = time_8_vs_0, + color = gene_biotype)) + + geom_point(alpha = 0.3) + + geom_abline(intercept = 0) +``` + +Notice that we can change the geom layer from `geom_point` to +`geom_jitter` and colors will still be determined by `gene_biotype`. + +```{r color-by-gene_biotype3, cache=FALSE, purl=TRUE} +ggplot(data = rna_fc, mapping = aes(x = time_4_vs_0, y = time_8_vs_0, + color = gene_biotype)) + + geom_jitter(alpha = 0.3) + + geom_abline(intercept = 0) +``` + +```{r, echo=FALSE, message=FALSE} +library("hexbin") +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge + +Scatter plots can be useful exploratory tools for small datasets. For +data sets with large numbers of observations, such as the `rna_fc` +data set, overplotting of points can be a limitation of scatter plots. +One strategy for handling such settings is to use hexagonal binning of +observations. The plot space is tessellated into hexagons. Each +hexagon is assigned a color based on the number of observations that +fall within its boundaries. + +- To use hexagonal binning in `ggplot2`, first install the R package + `hexbin` from CRAN and load it. + +- Then use the `geom_hex()` function to produce the hexbin figure. + +- What are the relative strengths and weaknesses of a hexagonal bin + plot compared to a scatter plot? Examine the above scatter plot + and compare it with the hexagonal bin plot that you created. + +::::::::::::::: solution + +## Solution + +```{r, eval=FALSE, purl=TRUE} +install.packages("hexbin") +``` + +```{r, purl=TRUE} +library("hexbin") + +ggplot(data = rna_fc, mapping = aes(x = time_4_vs_0, y = time_8_vs_0)) + + geom_hex() + + geom_abline(intercept = 0) + +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge + +Use what you just learned to create a scatter plot of `expression_log` +over `sample` from the `rna` dataset with the time showing in +different colors. Is this a good way to show this type of data? + +::::::::::::::: solution + +## Solution + +```{r, eval=TRUE, purl=TRUE} +ggplot(data = rna, mapping = aes(y = expression_log, x = sample)) + + geom_point(aes(color = time)) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Boxplot + +We can use boxplots to visualize the distribution of gene expressions +within each sample: + +```{r boxplot, cache=FALSE, purl=TRUE} +ggplot(data = rna, + mapping = aes(y = expression_log, x = sample)) + + geom_boxplot() +``` + +By adding points to boxplot, we can have a better idea of the number of +measurements and of their distribution: + +```{r boxplot-with-points, cache=FALSE, purl=TRUE} +ggplot(data = rna, + mapping = aes(y = expression_log, x = sample)) + + geom_jitter(alpha = 0.2, color = "tomato") + + geom_boxplot(alpha = 0) +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge + +Note how the boxplot layer is in front of the jitter layer? What do +you need to change in the code to put the boxplot below the points? + +::::::::::::::: solution + +## Solution + +We should switch the order of these two geoms: + +```{r boxplot-with-points2, cache=FALSE, purl=TRUE} +ggplot(data = rna, + mapping = aes(y = expression_log, x = sample)) + + geom_boxplot(alpha = 0) + + geom_jitter(alpha = 0.2, color = "tomato") +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +You may notice that the values on the x-axis are still not properly +readable. Let's change the orientation of the labels and adjust them +vertically and horizontally so they don't overlap. You can use a +90-degree angle, or experiment to find the appropriate angle for +diagonally oriented labels: + +```{r boxplot-xaxis-rotated, cache=FALSE, purl=TRUE} +ggplot(data = rna, + mapping = aes(y = expression_log, x = sample)) + + geom_jitter(alpha = 0.2, color = "tomato") + + geom_boxplot(alpha = 0) + + theme(axis.text.x = element_text(angle = 90, hjust = 0.5, vjust = 0.5)) +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge + +Add color to the data points on your boxplot according to the duration +of the infection (`time`). + +_Hint:_ Check the class for `time`. Consider changing the class of +`time` from integer to factor directly in the ggplot mapping. Why does +this change how R makes the graph? + +::::::::::::::: solution + +## Solution + +```{r boxplot-color-time, cache=FALSE, purl=TRUE} +# time as integer +ggplot(data = rna, + mapping = aes(y = expression_log, + x = sample)) + + geom_jitter(alpha = 0.2, aes(color = time)) + + geom_boxplot(alpha = 0) + + theme(axis.text.x = element_text(angle = 90, hjust = 0.5, vjust = 0.5)) + +# time as factor +ggplot(data = rna, + mapping = aes(y = expression_log, + x = sample)) + + geom_jitter(alpha = 0.2, aes(color = as.factor(time))) + + geom_boxplot(alpha = 0) + + theme(axis.text.x = element_text(angle = 90, hjust = 0.5, vjust = 0.5)) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge + +Boxplots are useful summaries, but hide the _shape_ of the +distribution. For example, if the distribution is bimodal, we would +not see it in a boxplot. An alternative to the boxplot is the violin +plot, where the shape (of the density of points) is drawn. + +- Replace the box plot with a violin plot; see `geom_violin()`. Fill + in the violins according to the time with the argument `fill`. + +::::::::::::::: solution + +## Solution + +```{r, eval=TRUE, echo=TRUE, cache=FALSE, purl=TRUE} +ggplot(data = rna, + mapping = aes(y = expression_log, x = sample)) + + geom_violin(aes(fill = as.factor(time))) + + theme(axis.text.x = element_text(angle = 90, hjust = 0.5, vjust = 0.5)) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge + +- Modify the violin plot to fill in the violins by `sex`. + +::::::::::::::: solution + +## Solution + +```{r, eval=TRUE, echo=TRUE, cache=FALSE, purl=TRUE} +ggplot(data = rna, + mapping = aes(y = expression_log, x = sample)) + + geom_violin(aes(fill = sex)) + + theme(axis.text.x = element_text(angle = 90, hjust = 0.5, vjust = 0.5)) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Line plots + +Let's calculate the mean expression per duration of the infection for +the 10 genes having the highest log fold changes comparing time 8 versus +time 0. First, we need to select the genes and create a subset of `rna` +called `sub_rna` containing the 10 selected genes, then we need to group +the data and calculate the mean gene expression within each group: + +```{r, purl=TRUE} +rna_fc <- rna_fc %>% arrange(desc(time_8_vs_0)) + +genes_selected <- rna_fc$gene[1:10] + +sub_rna <- rna %>% + filter(gene %in% genes_selected) + +mean_exp_by_time <- sub_rna %>% + group_by(gene,time) %>% + summarize(mean_exp = mean(expression_log)) + +mean_exp_by_time +``` + +We can build the line plot with duration of the infection on the x-axis +and the mean expression on the y-axis: + +```{r first-time-series, purl=TRUE} +ggplot(data = mean_exp_by_time, mapping = aes(x = time, y = mean_exp)) + + geom_line() +``` + +Unfortunately, this does not work because we plotted data for all the +genes together. We need to tell ggplot to draw a line for each gene by +modifying the aesthetic function to include `group = gene`: + +```{r time-series-by-gene, purl=TRUE} +ggplot(data = mean_exp_by_time, + mapping = aes(x = time, y = mean_exp, group = gene)) + + geom_line() +``` + +We will be able to distinguish genes in the plot if we add colors (using +`color` also automatically groups the data): + +```{r time-series-with-colors, purl=TRUE} +ggplot(data = mean_exp_by_time, + mapping = aes(x = time, y = mean_exp, color = gene)) + + geom_line() +``` + +## Faceting + +`ggplot2` has a special technique called _faceting_ that allows the user +to split one plot into multiple (sub) plots based on a factor included +in the dataset. These different subplots inherit the same properties +(axes limits, ticks, ...) to facilitate their direct comparison. We will +use it to make a line plot across time for each gene: + +```{r first-facet, purl=TRUE} +ggplot(data = mean_exp_by_time, + mapping = aes(x = time, y = mean_exp)) + geom_line() + + facet_wrap(~ gene) +``` + +Here both x- and y-axis have the same scale for all the subplots. You +can change this default behavior by modifying `scales` in order to allow +a free scale for the y-axis: + +```{r first-facet-scales, purl=TRUE} +ggplot(data = mean_exp_by_time, + mapping = aes(x = time, y = mean_exp)) + + geom_line() + + facet_wrap(~ gene, scales = "free_y") +``` + +Now we would like to split the line in each plot by the sex of the mice. +To do that we need to calculate the mean expression in the data frame +grouped by `gene`, `time`, and `sex`: + +```{r data-facet-by-gene-and-sex, purl=TRUE} +mean_exp_by_time_sex <- sub_rna %>% + group_by(gene, time, sex) %>% + summarize(mean_exp = mean(expression_log)) + +mean_exp_by_time_sex +``` + +We can now make the faceted plot by splitting further by sex using +`color` (within a single plot): + +```{r facet-by-gene-and-sex, cache=FALSE, purl=TRUE} +ggplot(data = mean_exp_by_time_sex, + mapping = aes(x = time, y = mean_exp, color = sex)) + + geom_line() + + facet_wrap(~ gene, scales = "free_y") +``` + +Usually plots with white background look more readable when printed. We +can set the background to white using the function `theme_bw()`. +Additionally, we can remove the grid: + +```{r facet-by-gene-and-sex-white-bg, cache=FALSE, purl=TRUE} +ggplot(data = mean_exp_by_time_sex, + mapping = aes(x = time, y = mean_exp, color = sex)) + + geom_line() + + facet_wrap(~ gene, scales = "free_y") + + theme_bw() + + theme(panel.grid = element_blank()) +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge + +Use what you just learned to create a plot that depicts how the +average expression of each chromosome changes through the duration of +infection. + +::::::::::::::: solution + +## Solution + +```{r mean-exp-chromosome-time-series, purl=TRUE} +mean_exp_by_chromosome <- rna %>% + group_by(chromosome_name, time) %>% + summarize(mean_exp = mean(expression_log)) + +ggplot(data = mean_exp_by_chromosome, mapping = aes(x = time, + y = mean_exp)) + + geom_line() + + facet_wrap(~ chromosome_name, scales = "free_y") +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +The `facet_wrap` geometry extracts plots into an arbitrary number of +dimensions to allow them to cleanly fit on one page. On the other hand, +the `facet_grid` geometry allows you to explicitly specify how you want +your plots to be arranged via formula notation (`rows ~ columns`; a `.` +can be used as a placeholder that indicates only one row or column). + +Let's modify the previous plot to compare how the mean gene expression +of males and females has changed through time: + +```{r mean-exp-time-facet-sex-rows, purl=TRUE} +# One column, facet by rows +ggplot(data = mean_exp_by_time_sex, + mapping = aes(x = time, y = mean_exp, color = gene)) + + geom_line() + + facet_grid(sex ~ .) +``` + +```{r mean-exp-time-facet-sex-columns, purl=TRUE} +# One row, facet by column +ggplot(data = mean_exp_by_time_sex, + mapping = aes(x = time, y = mean_exp, color = gene)) + + geom_line() + + facet_grid(. ~ sex) +``` + +## `ggplot2` themes + +In addition to `theme_bw()`, which changes the plot background to white, +`ggplot2` comes with several other themes which can be useful to quickly +change the look of your visualization. The complete list of themes is +available at [https://ggplot2.tidyverse.org/reference/ggtheme.html](https://ggplot2.tidyverse.org/reference/ggtheme.html). +`theme_minimal()` and `theme_light()` are popular, and `theme_void()` +can be useful as a starting point to create a new hand-crafted theme. + +The [ggthemes](https://jrnold.github.io/ggthemes/reference/index.html) +package provides a wide variety of options (including an Excel 2003 +theme). The ggplot2 provides a list of +packages that extend the capabilities of `ggplot2`, including additional +themes. + +## Customisation + +Let's come back to the faceted plot of mean expression by time and gene, +colored by sex. + +Take a look at the ggplot2, +and think of ways you could improve the plot. + +Now, we can change names of axes to something more informative than +'time' and 'mean_exp', and add a title to the figure: + +```{r mean_exp-time-with-right-labels, cache=FALSE, purl=TRUE} +ggplot(data = mean_exp_by_time_sex, + mapping = aes(x = time, y = mean_exp, color = sex)) + + geom_line() + + facet_wrap(~ gene, scales = "free_y") + + theme_bw() + + theme(panel.grid = element_blank()) + + labs(title = "Mean gene expression by duration of the infection", + x = "Duration of the infection (in days)", + y = "Mean gene expression") +``` + +The axes have more informative names, but their readability can be +improved by increasing the font size: + +```{r mean_exp-time-with-right-labels-xfont-size, cache=FALSE, purl=TRUE} +ggplot(data = mean_exp_by_time_sex, + mapping = aes(x = time, y = mean_exp, color = sex)) + + geom_line() + + facet_wrap(~ gene, scales = "free_y") + + theme_bw() + + theme(panel.grid = element_blank()) + + labs(title = "Mean gene expression by duration of the infection", + x = "Duration of the infection (in days)", + y = "Mean gene expression") + + theme(text = element_text(size = 16)) +``` + +Note that it is also possible to change the fonts of your plots. If you +are on Windows, you may have to install the . + +We can further customize the color of x- and y-axis text, the color of +the grid, etc. We can also for example move the legend to the top by +setting `legend.position` to `"top"`. + +```{r mean_exp-time-with-theme, cache=FALSE, purl=TRUE} +ggplot(data = mean_exp_by_time_sex, + mapping = aes(x = time, y = mean_exp, color = sex)) + + geom_line() + + facet_wrap(~ gene, scales = "free_y") + + theme_bw() + + theme(panel.grid = element_blank()) + + labs(title = "Mean gene expression by duration of the infection", + x = "Duration of the infection (in days)", + y = "Mean gene expression") + + theme(text = element_text(size = 16), + axis.text.x = element_text(colour = "royalblue4", size = 12), + axis.text.y = element_text(colour = "royalblue4", size = 12), + panel.grid = element_line(colour="lightsteelblue1"), + legend.position = "top") +``` + +If you like the changes you created better than the default theme, you +can save them as an object to be able to easily apply them to other +plots you may create. Here is an example with the histogram we have +previously created. + +```{r mean_exp-time-with-right-labels-xfont, cache=FALSE, purl=TRUE} +blue_theme <- theme(axis.text.x = element_text(colour = "royalblue4", + size = 12), + axis.text.y = element_text(colour = "royalblue4", + size = 12), + text = element_text(size = 16), + panel.grid = element_line(colour="lightsteelblue1")) + +ggplot(rna, aes(x = expression_log)) + + geom_histogram(bins = 20) + + blue_theme +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge + +With all of this information in hand, please take another five minutes +to either improve one of the plots generated in this exercise or +create a beautiful graph of your own. Use the RStudio ggplot2 +for inspiration. Here are some ideas: + +- See if you can change the thickness of the lines. +- Can you find a way to change the name of the legend? What about + its labels? (hint: look for a ggplot function starting with + `scale_`) +- Try using a different color palette or manually specifying the + colors for the lines (see + [http://www.cookbook-r.com/Graphs/Colors_(ggplot2)/](https://www.cookbook-r.com/Graphs/Colors_\(ggplot2\)/)). + +::::::::::::::: solution + +## Solution + +For example, based on this plot: + +```{r, purl=TRUE} +ggplot(data = mean_exp_by_time_sex, + mapping = aes(x = time, y = mean_exp, color = sex)) + + geom_line() + + facet_wrap(~ gene, scales = "free_y") + + theme_bw() + + theme(panel.grid = element_blank()) +``` + +We can customize it the following ways: + +```{r, purl=TRUE} +# change the thickness of the lines +ggplot(data = mean_exp_by_time_sex, + mapping = aes(x = time, y = mean_exp, color = sex)) + + geom_line(size=1.5) + + facet_wrap(~ gene, scales = "free_y") + + theme_bw() + + theme(panel.grid = element_blank()) + +# change the name of the legend and the labels +ggplot(data = mean_exp_by_time_sex, + mapping = aes(x = time, y = mean_exp, color = sex)) + + geom_line() + + facet_wrap(~ gene, scales = "free_y") + + theme_bw() + + theme(panel.grid = element_blank()) + + scale_color_discrete(name = "Gender", labels = c("F", "M")) + +# using a different color palette +ggplot(data = mean_exp_by_time_sex, + mapping = aes(x = time, y = mean_exp, color = sex)) + + geom_line() + + facet_wrap(~ gene, scales = "free_y") + + theme_bw() + + theme(panel.grid = element_blank()) + + scale_color_brewer(name = "Gender", labels = c("F", "M"), palette = "Dark2") + +# manually specifying the colors +ggplot(data = mean_exp_by_time_sex, + mapping = aes(x = time, y = mean_exp, color = sex)) + + geom_line() + + facet_wrap(~ gene, scales = "free_y") + + theme_bw() + + theme(panel.grid = element_blank()) + + scale_color_manual(name = "Gender", labels = c("F", "M"), + values = c("royalblue", "deeppink")) + +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Composing plots + +Faceting is a great tool for splitting one plot into multiple subplots, +but sometimes you may want to produce a single figure that contains +multiple independent plots, i.e. plots that are based on different +variables or even different data frames. + +Let's start by creating the two plots that we want to arrange next to +each other: + +The first graph counts the number of unique genes per chromosome. We +first need to reorder the levels of `chromosome_name` and filter the +unique genes per chromosome. We also change the scale of the y-axis to a +log10 scale for better readability. + +```{r sub1, purl=TRUE} +rna$chromosome_name <- factor(rna$chromosome_name, + levels = c(1:19,"X","Y")) + +count_gene_chromosome <- rna %>% select(chromosome_name, gene) %>% + distinct() %>% ggplot() + + geom_bar(aes(x = chromosome_name), fill = "seagreen", + position = "dodge", stat = "count") + + labs(y = "log10(n genes)", x = "chromosome") + + scale_y_log10() + +count_gene_chromosome +``` + +Below, we also remove the legend altogether by setting the +`legend.position` to `"none"`. + +```{r sub2, purl=TRUE} +exp_boxplot_sex <- ggplot(rna, aes(y=expression_log, x = as.factor(time), + color=sex)) + + geom_boxplot(alpha = 0) + + labs(y = "Mean gene exp", + x = "time") + theme(legend.position = "none") + +exp_boxplot_sex +``` + +The [**patchwork**](https://github.com/thomasp85/patchwork) package +provides an elegant approach to combining figures using the `+` to +arrange figures (typically side by side). More specifically the `|` +explicitly arranges them side by side and `/` stacks them on top of each +other. + +```{r install-patchwork, message=FALSE, eval=FALSE, purl=TRUE} +install.packages("patchwork") +``` + +```{r patchworkplot1, purl=TRUE} +library("patchwork") +count_gene_chromosome + exp_boxplot_sex +## or count_gene_chromosome | exp_boxplot_sex +``` + +```{r patchwork2, purl=TRUE} +count_gene_chromosome / exp_boxplot_sex +``` + +We can combine further control the layout of the final composition with +`plot_layout` to create more complex layouts: + +```{r patchwork3, purl=TRUE} +count_gene_chromosome + exp_boxplot_sex + plot_layout(ncol = 1) +``` + +```{r patchwork4, purl=TRUE} +count_gene_chromosome + + (count_gene_chromosome + exp_boxplot_sex) + + exp_boxplot_sex + + plot_layout(ncol = 1) +``` + +The last plot can also be created using the `|` and `/` composers: + +```{r patchwork5, purl=TRUE} +count_gene_chromosome / + (count_gene_chromosome | exp_boxplot_sex) / + exp_boxplot_sex +``` + +Learn more about `patchwork` on its +[webpage](https://patchwork.data-imaginist.com/) or in this +[video](https://www.youtube.com/watch?v=0m4yywqNPVY). + +Another option is the **`gridExtra`** package that allows to combine +separate ggplots into a single figure using `grid.arrange()`: + +```{r install-gridextra, message=FALSE, eval=FALSE, purl=TRUE} +install.packages("gridExtra") +``` + +```{r gridarrange-example, message=FALSE, fig.width=10, purl=TRUE} +library("gridExtra") +grid.arrange(count_gene_chromosome, exp_boxplot_sex, ncol = 2) +``` + +In addition to the `ncol` and `nrow` arguments, used to make simple +arrangements, there are tools for constructing more complex +layouts. + +## Exporting plots + +After creating your plot, you can save it to a file in your favorite +format. The Export tab in the **Plot** pane in RStudio will save your +plots at low resolution, which will not be accepted by many journals and +will not scale well for posters. + +Instead, use the `ggsave()` function, which allows you easily change the +dimension and resolution of your plot by adjusting the appropriate +arguments (`width`, `height` and `dpi`). + +Make sure you have the `fig_output/` folder in your working directory. + +```{r ggsave-example, eval=FALSE, purl=TRUE} +my_plot <- ggplot(data = mean_exp_by_time_sex, + mapping = aes(x = time, y = mean_exp, color = sex)) + + geom_line() + + facet_wrap(~ gene, scales = "free_y") + + labs(title = "Mean gene expression by duration of the infection", + x = "Duration of the infection (in days)", + y = "Mean gene expression") + + guides(color=guide_legend(title="Gender")) + + theme_bw() + + theme(axis.text.x = element_text(colour = "royalblue4", size = 12), + axis.text.y = element_text(colour = "royalblue4", size = 12), + text = element_text(size = 16), + panel.grid = element_line(colour="lightsteelblue1"), + legend.position = "top") +ggsave("fig_output/mean_exp_by_time_sex.png", my_plot, width = 15, + height = 10) + +# This also works for grid.arrange() plots +combo_plot <- grid.arrange(count_gene_chromosome, exp_boxplot_sex, + ncol = 2, widths = c(4, 6)) +ggsave("fig_output/combo_plot_chromosome_sex.png", combo_plot, + width = 10, dpi = 300) +``` + +Note: The parameters `width` and `height` also determine the font size +in the saved plot. + +```{r final-challenge, eval=FALSE, purl=TRUE, echo=FALSE} +### Final plotting challenge: +## With all of this information in hand, please take another five +## minutes to either improve one of the plots generated in this +## exercise or create a beautiful graph of your own. Use the RStudio +## ggplot2 cheat sheet for inspiration: +## https://raw.githubusercontent.com/rstudio/cheatsheets/main/data-visualization.pdf +``` + +## Other packages for visualisation + +`ggplot2` is a very powerful package that fits very nicely in our _tidy +data_ and _tidy tools_ pipeline. There are other visualization packages +in R that shouldn't be ignored. + +### Base graphics + +The default graphics system that comes with R, often called _base R +graphics_ is simple and fast. It is based on the _painter's or canvas +model_, where different output are directly overlaid on top of each +other (see figure @ref(fig:paintermodel)). This is a fundamental +difference with `ggplot2` (and with `lattice`, described below), that +returns dedicated objects, that are rendered on screen or in a file, and +that can even be updated. + +```{r paintermodel, fig.width=12, fig.height=4, fig.cap="Successive layers added on top of each other."} +par(mfrow = c(1, 3)) +plot(1:20, main = "First layer, produced with plot(1:20)") + +plot(1:20, main = "A horizontal red line, added with abline(h = 10)") +abline(h = 10, col = "red") + +plot(1:20, main = "A rectangle, added with rect(5, 5, 15, 15)") +abline(h = 10, col = "red") +rect(5, 5, 15, 15, lwd = 3) +``` + +Another main difference is that base graphics' plotting function try to +do _the right_ thing based on their input type, i.e. they will adapt +their behaviour based on the class of their input. This is again very +different from what we have in `ggplot2`, that only accepts dataframes +as input, and that requires plots to be constructed bit by bit. + +```{r plotmethod, fig.width=8, fig.height=8, fig.cap="Plotting boxplots (top) and histograms (bottom) vectors (left) or a matrices (right)."} +par(mfrow = c(2, 2)) +boxplot(rnorm(100), + main = "Boxplot of rnorm(100)") +boxplot(matrix(rnorm(100), ncol = 10), + main = "Boxplot of matrix(rnorm(100), ncol = 10)") +hist(rnorm(100)) +hist(matrix(rnorm(100), ncol = 10)) +``` + +The out-of-the-box approach in base graphics can be very efficient for +simple, standard figures, that can be produced very quickly with a +single line of code and a single function such as `plot`, or `hist`, or +`boxplot`, ... The defaults are however not always the most appealing +and tuning of figures, especially when they become more complex (for +example to produce facets), can become lengthy and cumbersome. + +### The lattice package + +The **`lattice`** package is similar to `ggplot2` in that is uses +dataframes as input, returns graphical objects and supports faceting. +`lattice` however isn't based on the grammar of graphics and has a more +convoluted interface. + +A good reference for the `lattice` package is @latticebook. + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- Visualization in R + +:::::::::::::::::::::::::::::::::::::::::::::::::: From bbb2b638944927ec90144ae64b70bc771d86d7f3 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Fri, 29 Dec 2023 06:11:11 +0900 Subject: [PATCH 030/334] New translations 40-visualization.md (Portuguese) --- locale/pt/episodes/40-visualization.Rmd | 1103 +++++++++++++++++++++++ 1 file changed, 1103 insertions(+) create mode 100644 locale/pt/episodes/40-visualization.Rmd diff --git a/locale/pt/episodes/40-visualization.Rmd b/locale/pt/episodes/40-visualization.Rmd new file mode 100644 index 000000000..b1ab2920c --- /dev/null +++ b/locale/pt/episodes/40-visualization.Rmd @@ -0,0 +1,1103 @@ +--- +source: Rmd +title: Data visualization +teaching: 60 +exercises: 60 +--- + +```{r loaddata_vis, echo=FALSE, purl=FALSE, message=FALSE} +if (!file.exists("data/rnaseq.csv")) +download.file(url = "https://github.com/carpentries-incubator/bioc-intro/raw/main/episodes/data/rnaseq.csv", + destfile = "data/rnaseq.csv") +``` + +::::::::::::::::::::::::::::::::::::::: objectives + +- Produce scatter plots, boxplots, line plots, etc. using ggplot. +- Set universal plot settings. +- Describe what faceting is and apply faceting in ggplot. +- Modify the aesthetics of an existing ggplot plot (including axis labels and color). +- Build complex and customized plots from data in a data frame. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- Visualization in R + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +```{r vis_setup, echo=FALSE} +rna <- read.csv("data/rnaseq.csv") +``` + +## Data Visualization + +We start by loading the required packages. **`ggplot2`** is included in +the **`tidyverse`** package. + +```{r load-package, message=FALSE, purl=TRUE} +library("tidyverse") +``` + +If not still in the workspace, load the data we saved in the previous +lesson. + +```{r load-data, eval=FALSE, purl=TRUE} +rna <- read.csv("data/rnaseq.csv") +``` + +The Data Visualization Cheat +Sheet +will cover the basics and more advanced features of `ggplot2` and will +help, in addition to serve as a reminder, getting an overview of the +many data representations available in the package. The following video +tutorials ([part 1](https://www.youtube.com/watch?v=h29g21z0a68) and +[2](https://www.youtube.com/watch?v=0m4yywqNPVY)) by Thomas Lin Pedersen +are also very instructive. + +## Plotting with `ggplot2` + +`ggplot2` is a plotting package that makes it simple to create complex +plots from data in a data frame. It provides a more programmatic +interface for specifying what variables to plot, how they are displayed, +and general visual properties. The theoretical foundation that supports +the `ggplot2` is the _Grammar of Graphics_ (@Wilkinson:2005). Using this +approach, we only need minimal changes if the underlying data change or +if we decide to change from a bar plot to a scatterplot. This helps in +creating publication quality plots with minimal amounts of adjustments +and tweaking. + +There is a book about `ggplot2` (@ggplot2book) that provides a good +overview, but it is outdated. The 3rd edition is in preparation and will +be [freely available online](https://ggplot2-book.org/). The `ggplot2` +webpage ([https://ggplot2.tidyverse.org](https://ggplot2.tidyverse.org)) provides ample documentation. + +`ggplot2` functions like data in the 'long' format, i.e., a column for +every dimension, and a row for every observation. Well-structured data +will save you lots of time when making figures with `ggplot2`. + +ggplot graphics are built step by step by adding new elements. Adding +layers in this fashion allows for extensive flexibility and +customization of plots. + +> The idea behind the Grammar of Graphics it is that you can build every +> graph from the same 3 components: (1) a data set, (2) a coordinate system, +> and (3) geoms — i.e. visual marks that represent data points \[^three\_comp\_ggplot2] + +[^three_comp_ggplot2]: Source: [Data Visualization Cheat Sheet](https://raw.githubusercontent.com/rstudio/cheatsheets/main/data-visualization.pdf). + +To build a ggplot, we will use the following basic template that can be +used for different types of plots: + +``` +ggplot(data = <DATA>, mapping = aes(<MAPPINGS>)) + <GEOM_FUNCTION>() +``` + +- use the `ggplot()` function and bind the plot to a specific **data + frame** using the `data` argument + +```{r, eval=FALSE} +ggplot(data = rna) +``` + +- define a **mapping** (using the aesthetic (`aes`) function), by + selecting the variables to be plotted and specifying how to present + them in the graph, e.g. as x/y positions or characteristics such as + size, shape, color, etc. + +```{r, eval=FALSE} +ggplot(data = rna, mapping = aes(x = expression)) +``` + +- add '**geoms**' - geometries, or graphical representations of the + data in the plot (points, lines, bars). `ggplot2` offers many + different geoms; we will use some common ones today, including: + + ``` + * `geom_point()` for scatter plots, dot plots, etc. + * `geom_histogram()` for histograms + * `geom_boxplot()` for, well, boxplots! + * `geom_line()` for trend lines, time series, etc. + ``` + +To add a geom(etry) to the plot use the `+` operator. Let's use +`geom_histogram()` first: + +```{r first-ggplot, cache=FALSE, purl=TRUE} +ggplot(data = rna, mapping = aes(x = expression)) + + geom_histogram() +``` + +The `+` in the `ggplot2` package is particularly useful because it +allows you to modify existing `ggplot` objects. This means you can +easily set up plot templates and conveniently explore different types of +plots, so the above plot can also be generated with code like this: + +```{r, eval=FALSE, purl=TRUE} +# Assign plot to a variable +rna_plot <- ggplot(data = rna, + mapping = aes(x = expression)) + +# Draw the plot +rna_plot + geom_histogram() +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge + +You have probably noticed an automatic message that appears when +drawing the histogram: + +```{r, echo=FALSE, fig.show="hide"} +ggplot(rna, aes(x = expression)) + + geom_histogram() +``` + +Change the arguments `bins` or `binwidth` of `geom_histogram()` to +change the number or width of the bins. + +::::::::::::::: solution + +## Solution + +```{r, purl=TRUE} +# change bins +ggplot(rna, aes(x = expression)) + + geom_histogram(bins = 15) + +# change binwidth +ggplot(rna, aes(x = expression)) + + geom_histogram(binwidth = 2000) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +We can observe here that the data are skewed to the right. We can apply +log2 transformation to have a more symmetric distribution. Note that we +add here a small constant value (`+1`) to avoid having `-Inf` values +returned for expression values equal to 0. + +```{r log-transfo, cache=FALSE, purl=TRUE} +rna <- rna %>% + mutate(expression_log = log2(expression + 1)) +``` + +If we now draw the histogram of the log2-transformed expressions, the +distribution is indeed closer to a normal distribution. + +```{r second-ggplot, cache=FALSE, purl=TRUE} +ggplot(rna, aes(x = expression_log)) + geom_histogram() +``` + +From now on we will work on the log-transformed expression values. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge + +Another way to visualize this transformation is to consider the scale +of the observations. For example, it may be worth changing the scale +of the axis to better distribute the observations in the space of the +plot. Changing the scale of the axes is done similarly to +adding/modifying other components (i.e., by incrementally adding +commands). Try making this modification: + +- Represent the un-transformed expression on the log10 scale; see + `scale_x_log10()`. Compare it with the previous graph. Why do you + now have warning messages appearing? + +::::::::::::::: solution + +## Solution + +```{r, eval=TRUE, purl=TRUE, echo=TRUE} +ggplot(data = rna,mapping = aes(x = expression))+ + geom_histogram() + + scale_x_log10() +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +**Notes** + +- Anything you put in the `ggplot()` function can be seen by any geom + layers that you add (i.e., these are global plot settings). This + includes the x- and y-axis mapping you set up in `aes()`. +- You can also specify mappings for a given geom independently of the + mappings defined globally in the `ggplot()` function. +- The `+` sign used to add new layers must be placed at the end of the + line containing the _previous_ layer. If, instead, the `+` sign is + added at the beginning of the line containing the new layer, + `ggplot2` will not add the new layer and will return an error + message. + +```{r, eval=FALSE} +# This is the correct syntax for adding layers +rna_plot + + geom_histogram() + +# This will not add the new layer and will return an error message +rna_plot + + geom_histogram() +``` + +## Building your plots iteratively + +We will now draw a scatter plot with two continuous variables and the +`geom_point()` function. This graph will represent the log2 fold changes +of expression comparing time 8 versus time 0, and time 4 versus time 0. +To this end, we first need to compute the means of the log-transformed +expression values by gene and time, then the log fold changes by +subtracting the mean log expressions between time 8 and time 0 and +between time 4 and time 0. Note that we also include here the gene +biotype that we will use later on to represent the genes. We will save +the fold changes in a new data frame called `rna_fc.` + +```{r rna_fc, cache=FALSE, purl=TRUE} +rna_fc <- rna %>% select(gene, time, + gene_biotype, expression_log) %>% + group_by(gene, time, gene_biotype) %>% + summarize(mean_exp = mean(expression_log)) %>% + pivot_wider(names_from = time, + values_from = mean_exp) %>% + mutate(time_8_vs_0 = `8` - `0`, time_4_vs_0 = `4` - `0`) + +``` + +We can then build a ggplot with the newly created dataset `rna_fc`. +Building plots with `ggplot2` is typically an iterative process. We +start by defining the dataset we'll use, lay out the axes, and choose a +geom: + +```{r create-ggplot-object, cache=FALSE, purl=TRUE} +ggplot(data = rna_fc, mapping = aes(x = time_4_vs_0, y = time_8_vs_0)) + + geom_point() +``` + +Then, we start modifying this plot to extract more information from it. +For instance, we can add transparency (`alpha`) to avoid overplotting: + +```{r adding-transparency, cache=FALSE, purl=TRUE} +ggplot(data = rna_fc, mapping = aes(x = time_4_vs_0, y = time_8_vs_0)) + + geom_point(alpha = 0.3) +``` + +We can also add colors for all the points: + +```{r adding-colors, cache=FALSE, purl=TRUE} +ggplot(data = rna_fc, mapping = aes(x = time_4_vs_0, y = time_8_vs_0)) + + geom_point(alpha = 0.3, color = "blue") +``` + +Or to color each gene in the plot differently, you could use a vector as +an input to the argument **color**. `ggplot2` will provide a different +color corresponding to different values in the vector. Here is an +example where we color with `gene_biotype`: + +```{r color-by-gene_biotype1, cache=FALSE, purl=TRUE} +ggplot(data = rna_fc, mapping = aes(x = time_4_vs_0, y = time_8_vs_0)) + + geom_point(alpha = 0.3, aes(color = gene_biotype)) + +``` + +We can also specify the colors directly inside the mapping provided in +the `ggplot()` function. This will be seen by any geom layers and the +mapping will be determined by the x- and y-axis set up in `aes()`. + +```{r color-by-gene_biotype2, cache=FALSE, purl=TRUE} +ggplot(data = rna_fc, mapping = aes(x = time_4_vs_0, y = time_8_vs_0, + color = gene_biotype)) + + geom_point(alpha = 0.3) +``` + +Finally, we could also add a diagonal line with the `geom_abline()` +function: + +```{r adding-diag, cache=FALSE, purl=TRUE} +ggplot(data = rna_fc, mapping = aes(x = time_4_vs_0, y = time_8_vs_0, + color = gene_biotype)) + + geom_point(alpha = 0.3) + + geom_abline(intercept = 0) +``` + +Notice that we can change the geom layer from `geom_point` to +`geom_jitter` and colors will still be determined by `gene_biotype`. + +```{r color-by-gene_biotype3, cache=FALSE, purl=TRUE} +ggplot(data = rna_fc, mapping = aes(x = time_4_vs_0, y = time_8_vs_0, + color = gene_biotype)) + + geom_jitter(alpha = 0.3) + + geom_abline(intercept = 0) +``` + +```{r, echo=FALSE, message=FALSE} +library("hexbin") +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge + +Scatter plots can be useful exploratory tools for small datasets. For +data sets with large numbers of observations, such as the `rna_fc` +data set, overplotting of points can be a limitation of scatter plots. +One strategy for handling such settings is to use hexagonal binning of +observations. The plot space is tessellated into hexagons. Each +hexagon is assigned a color based on the number of observations that +fall within its boundaries. + +- To use hexagonal binning in `ggplot2`, first install the R package + `hexbin` from CRAN and load it. + +- Then use the `geom_hex()` function to produce the hexbin figure. + +- What are the relative strengths and weaknesses of a hexagonal bin + plot compared to a scatter plot? Examine the above scatter plot + and compare it with the hexagonal bin plot that you created. + +::::::::::::::: solution + +## Solution + +```{r, eval=FALSE, purl=TRUE} +install.packages("hexbin") +``` + +```{r, purl=TRUE} +library("hexbin") + +ggplot(data = rna_fc, mapping = aes(x = time_4_vs_0, y = time_8_vs_0)) + + geom_hex() + + geom_abline(intercept = 0) + +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge + +Use what you just learned to create a scatter plot of `expression_log` +over `sample` from the `rna` dataset with the time showing in +different colors. Is this a good way to show this type of data? + +::::::::::::::: solution + +## Solution + +```{r, eval=TRUE, purl=TRUE} +ggplot(data = rna, mapping = aes(y = expression_log, x = sample)) + + geom_point(aes(color = time)) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Boxplot + +We can use boxplots to visualize the distribution of gene expressions +within each sample: + +```{r boxplot, cache=FALSE, purl=TRUE} +ggplot(data = rna, + mapping = aes(y = expression_log, x = sample)) + + geom_boxplot() +``` + +By adding points to boxplot, we can have a better idea of the number of +measurements and of their distribution: + +```{r boxplot-with-points, cache=FALSE, purl=TRUE} +ggplot(data = rna, + mapping = aes(y = expression_log, x = sample)) + + geom_jitter(alpha = 0.2, color = "tomato") + + geom_boxplot(alpha = 0) +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge + +Note how the boxplot layer is in front of the jitter layer? What do +you need to change in the code to put the boxplot below the points? + +::::::::::::::: solution + +## Solution + +We should switch the order of these two geoms: + +```{r boxplot-with-points2, cache=FALSE, purl=TRUE} +ggplot(data = rna, + mapping = aes(y = expression_log, x = sample)) + + geom_boxplot(alpha = 0) + + geom_jitter(alpha = 0.2, color = "tomato") +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +You may notice that the values on the x-axis are still not properly +readable. Let's change the orientation of the labels and adjust them +vertically and horizontally so they don't overlap. You can use a +90-degree angle, or experiment to find the appropriate angle for +diagonally oriented labels: + +```{r boxplot-xaxis-rotated, cache=FALSE, purl=TRUE} +ggplot(data = rna, + mapping = aes(y = expression_log, x = sample)) + + geom_jitter(alpha = 0.2, color = "tomato") + + geom_boxplot(alpha = 0) + + theme(axis.text.x = element_text(angle = 90, hjust = 0.5, vjust = 0.5)) +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge + +Add color to the data points on your boxplot according to the duration +of the infection (`time`). + +_Hint:_ Check the class for `time`. Consider changing the class of +`time` from integer to factor directly in the ggplot mapping. Why does +this change how R makes the graph? + +::::::::::::::: solution + +## Solution + +```{r boxplot-color-time, cache=FALSE, purl=TRUE} +# time as integer +ggplot(data = rna, + mapping = aes(y = expression_log, + x = sample)) + + geom_jitter(alpha = 0.2, aes(color = time)) + + geom_boxplot(alpha = 0) + + theme(axis.text.x = element_text(angle = 90, hjust = 0.5, vjust = 0.5)) + +# time as factor +ggplot(data = rna, + mapping = aes(y = expression_log, + x = sample)) + + geom_jitter(alpha = 0.2, aes(color = as.factor(time))) + + geom_boxplot(alpha = 0) + + theme(axis.text.x = element_text(angle = 90, hjust = 0.5, vjust = 0.5)) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge + +Boxplots are useful summaries, but hide the _shape_ of the +distribution. For example, if the distribution is bimodal, we would +not see it in a boxplot. An alternative to the boxplot is the violin +plot, where the shape (of the density of points) is drawn. + +- Replace the box plot with a violin plot; see `geom_violin()`. Fill + in the violins according to the time with the argument `fill`. + +::::::::::::::: solution + +## Solution + +```{r, eval=TRUE, echo=TRUE, cache=FALSE, purl=TRUE} +ggplot(data = rna, + mapping = aes(y = expression_log, x = sample)) + + geom_violin(aes(fill = as.factor(time))) + + theme(axis.text.x = element_text(angle = 90, hjust = 0.5, vjust = 0.5)) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge + +- Modify the violin plot to fill in the violins by `sex`. + +::::::::::::::: solution + +## Solution + +```{r, eval=TRUE, echo=TRUE, cache=FALSE, purl=TRUE} +ggplot(data = rna, + mapping = aes(y = expression_log, x = sample)) + + geom_violin(aes(fill = sex)) + + theme(axis.text.x = element_text(angle = 90, hjust = 0.5, vjust = 0.5)) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Line plots + +Let's calculate the mean expression per duration of the infection for +the 10 genes having the highest log fold changes comparing time 8 versus +time 0. First, we need to select the genes and create a subset of `rna` +called `sub_rna` containing the 10 selected genes, then we need to group +the data and calculate the mean gene expression within each group: + +```{r, purl=TRUE} +rna_fc <- rna_fc %>% arrange(desc(time_8_vs_0)) + +genes_selected <- rna_fc$gene[1:10] + +sub_rna <- rna %>% + filter(gene %in% genes_selected) + +mean_exp_by_time <- sub_rna %>% + group_by(gene,time) %>% + summarize(mean_exp = mean(expression_log)) + +mean_exp_by_time +``` + +We can build the line plot with duration of the infection on the x-axis +and the mean expression on the y-axis: + +```{r first-time-series, purl=TRUE} +ggplot(data = mean_exp_by_time, mapping = aes(x = time, y = mean_exp)) + + geom_line() +``` + +Unfortunately, this does not work because we plotted data for all the +genes together. We need to tell ggplot to draw a line for each gene by +modifying the aesthetic function to include `group = gene`: + +```{r time-series-by-gene, purl=TRUE} +ggplot(data = mean_exp_by_time, + mapping = aes(x = time, y = mean_exp, group = gene)) + + geom_line() +``` + +We will be able to distinguish genes in the plot if we add colors (using +`color` also automatically groups the data): + +```{r time-series-with-colors, purl=TRUE} +ggplot(data = mean_exp_by_time, + mapping = aes(x = time, y = mean_exp, color = gene)) + + geom_line() +``` + +## Faceting + +`ggplot2` has a special technique called _faceting_ that allows the user +to split one plot into multiple (sub) plots based on a factor included +in the dataset. These different subplots inherit the same properties +(axes limits, ticks, ...) to facilitate their direct comparison. We will +use it to make a line plot across time for each gene: + +```{r first-facet, purl=TRUE} +ggplot(data = mean_exp_by_time, + mapping = aes(x = time, y = mean_exp)) + geom_line() + + facet_wrap(~ gene) +``` + +Here both x- and y-axis have the same scale for all the subplots. You +can change this default behavior by modifying `scales` in order to allow +a free scale for the y-axis: + +```{r first-facet-scales, purl=TRUE} +ggplot(data = mean_exp_by_time, + mapping = aes(x = time, y = mean_exp)) + + geom_line() + + facet_wrap(~ gene, scales = "free_y") +``` + +Now we would like to split the line in each plot by the sex of the mice. +To do that we need to calculate the mean expression in the data frame +grouped by `gene`, `time`, and `sex`: + +```{r data-facet-by-gene-and-sex, purl=TRUE} +mean_exp_by_time_sex <- sub_rna %>% + group_by(gene, time, sex) %>% + summarize(mean_exp = mean(expression_log)) + +mean_exp_by_time_sex +``` + +We can now make the faceted plot by splitting further by sex using +`color` (within a single plot): + +```{r facet-by-gene-and-sex, cache=FALSE, purl=TRUE} +ggplot(data = mean_exp_by_time_sex, + mapping = aes(x = time, y = mean_exp, color = sex)) + + geom_line() + + facet_wrap(~ gene, scales = "free_y") +``` + +Usually plots with white background look more readable when printed. We +can set the background to white using the function `theme_bw()`. +Additionally, we can remove the grid: + +```{r facet-by-gene-and-sex-white-bg, cache=FALSE, purl=TRUE} +ggplot(data = mean_exp_by_time_sex, + mapping = aes(x = time, y = mean_exp, color = sex)) + + geom_line() + + facet_wrap(~ gene, scales = "free_y") + + theme_bw() + + theme(panel.grid = element_blank()) +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge + +Use what you just learned to create a plot that depicts how the +average expression of each chromosome changes through the duration of +infection. + +::::::::::::::: solution + +## Solution + +```{r mean-exp-chromosome-time-series, purl=TRUE} +mean_exp_by_chromosome <- rna %>% + group_by(chromosome_name, time) %>% + summarize(mean_exp = mean(expression_log)) + +ggplot(data = mean_exp_by_chromosome, mapping = aes(x = time, + y = mean_exp)) + + geom_line() + + facet_wrap(~ chromosome_name, scales = "free_y") +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +The `facet_wrap` geometry extracts plots into an arbitrary number of +dimensions to allow them to cleanly fit on one page. On the other hand, +the `facet_grid` geometry allows you to explicitly specify how you want +your plots to be arranged via formula notation (`rows ~ columns`; a `.` +can be used as a placeholder that indicates only one row or column). + +Let's modify the previous plot to compare how the mean gene expression +of males and females has changed through time: + +```{r mean-exp-time-facet-sex-rows, purl=TRUE} +# One column, facet by rows +ggplot(data = mean_exp_by_time_sex, + mapping = aes(x = time, y = mean_exp, color = gene)) + + geom_line() + + facet_grid(sex ~ .) +``` + +```{r mean-exp-time-facet-sex-columns, purl=TRUE} +# One row, facet by column +ggplot(data = mean_exp_by_time_sex, + mapping = aes(x = time, y = mean_exp, color = gene)) + + geom_line() + + facet_grid(. ~ sex) +``` + +## `ggplot2` themes + +In addition to `theme_bw()`, which changes the plot background to white, +`ggplot2` comes with several other themes which can be useful to quickly +change the look of your visualization. The complete list of themes is +available at [https://ggplot2.tidyverse.org/reference/ggtheme.html](https://ggplot2.tidyverse.org/reference/ggtheme.html). +`theme_minimal()` and `theme_light()` are popular, and `theme_void()` +can be useful as a starting point to create a new hand-crafted theme. + +The [ggthemes](https://jrnold.github.io/ggthemes/reference/index.html) +package provides a wide variety of options (including an Excel 2003 +theme). The ggplot2 provides a list of +packages that extend the capabilities of `ggplot2`, including additional +themes. + +## Customisation + +Let's come back to the faceted plot of mean expression by time and gene, +colored by sex. + +Take a look at the ggplot2, +and think of ways you could improve the plot. + +Now, we can change names of axes to something more informative than +'time' and 'mean_exp', and add a title to the figure: + +```{r mean_exp-time-with-right-labels, cache=FALSE, purl=TRUE} +ggplot(data = mean_exp_by_time_sex, + mapping = aes(x = time, y = mean_exp, color = sex)) + + geom_line() + + facet_wrap(~ gene, scales = "free_y") + + theme_bw() + + theme(panel.grid = element_blank()) + + labs(title = "Mean gene expression by duration of the infection", + x = "Duration of the infection (in days)", + y = "Mean gene expression") +``` + +The axes have more informative names, but their readability can be +improved by increasing the font size: + +```{r mean_exp-time-with-right-labels-xfont-size, cache=FALSE, purl=TRUE} +ggplot(data = mean_exp_by_time_sex, + mapping = aes(x = time, y = mean_exp, color = sex)) + + geom_line() + + facet_wrap(~ gene, scales = "free_y") + + theme_bw() + + theme(panel.grid = element_blank()) + + labs(title = "Mean gene expression by duration of the infection", + x = "Duration of the infection (in days)", + y = "Mean gene expression") + + theme(text = element_text(size = 16)) +``` + +Note that it is also possible to change the fonts of your plots. If you +are on Windows, you may have to install the . + +We can further customize the color of x- and y-axis text, the color of +the grid, etc. We can also for example move the legend to the top by +setting `legend.position` to `"top"`. + +```{r mean_exp-time-with-theme, cache=FALSE, purl=TRUE} +ggplot(data = mean_exp_by_time_sex, + mapping = aes(x = time, y = mean_exp, color = sex)) + + geom_line() + + facet_wrap(~ gene, scales = "free_y") + + theme_bw() + + theme(panel.grid = element_blank()) + + labs(title = "Mean gene expression by duration of the infection", + x = "Duration of the infection (in days)", + y = "Mean gene expression") + + theme(text = element_text(size = 16), + axis.text.x = element_text(colour = "royalblue4", size = 12), + axis.text.y = element_text(colour = "royalblue4", size = 12), + panel.grid = element_line(colour="lightsteelblue1"), + legend.position = "top") +``` + +If you like the changes you created better than the default theme, you +can save them as an object to be able to easily apply them to other +plots you may create. Here is an example with the histogram we have +previously created. + +```{r mean_exp-time-with-right-labels-xfont, cache=FALSE, purl=TRUE} +blue_theme <- theme(axis.text.x = element_text(colour = "royalblue4", + size = 12), + axis.text.y = element_text(colour = "royalblue4", + size = 12), + text = element_text(size = 16), + panel.grid = element_line(colour="lightsteelblue1")) + +ggplot(rna, aes(x = expression_log)) + + geom_histogram(bins = 20) + + blue_theme +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge + +With all of this information in hand, please take another five minutes +to either improve one of the plots generated in this exercise or +create a beautiful graph of your own. Use the RStudio ggplot2 +for inspiration. Here are some ideas: + +- See if you can change the thickness of the lines. +- Can you find a way to change the name of the legend? What about + its labels? (hint: look for a ggplot function starting with + `scale_`) +- Try using a different color palette or manually specifying the + colors for the lines (see + [http://www.cookbook-r.com/Graphs/Colors_(ggplot2)/](https://www.cookbook-r.com/Graphs/Colors_\(ggplot2\)/)). + +::::::::::::::: solution + +## Solution + +For example, based on this plot: + +```{r, purl=TRUE} +ggplot(data = mean_exp_by_time_sex, + mapping = aes(x = time, y = mean_exp, color = sex)) + + geom_line() + + facet_wrap(~ gene, scales = "free_y") + + theme_bw() + + theme(panel.grid = element_blank()) +``` + +We can customize it the following ways: + +```{r, purl=TRUE} +# change the thickness of the lines +ggplot(data = mean_exp_by_time_sex, + mapping = aes(x = time, y = mean_exp, color = sex)) + + geom_line(size=1.5) + + facet_wrap(~ gene, scales = "free_y") + + theme_bw() + + theme(panel.grid = element_blank()) + +# change the name of the legend and the labels +ggplot(data = mean_exp_by_time_sex, + mapping = aes(x = time, y = mean_exp, color = sex)) + + geom_line() + + facet_wrap(~ gene, scales = "free_y") + + theme_bw() + + theme(panel.grid = element_blank()) + + scale_color_discrete(name = "Gender", labels = c("F", "M")) + +# using a different color palette +ggplot(data = mean_exp_by_time_sex, + mapping = aes(x = time, y = mean_exp, color = sex)) + + geom_line() + + facet_wrap(~ gene, scales = "free_y") + + theme_bw() + + theme(panel.grid = element_blank()) + + scale_color_brewer(name = "Gender", labels = c("F", "M"), palette = "Dark2") + +# manually specifying the colors +ggplot(data = mean_exp_by_time_sex, + mapping = aes(x = time, y = mean_exp, color = sex)) + + geom_line() + + facet_wrap(~ gene, scales = "free_y") + + theme_bw() + + theme(panel.grid = element_blank()) + + scale_color_manual(name = "Gender", labels = c("F", "M"), + values = c("royalblue", "deeppink")) + +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Composing plots + +Faceting is a great tool for splitting one plot into multiple subplots, +but sometimes you may want to produce a single figure that contains +multiple independent plots, i.e. plots that are based on different +variables or even different data frames. + +Let's start by creating the two plots that we want to arrange next to +each other: + +The first graph counts the number of unique genes per chromosome. We +first need to reorder the levels of `chromosome_name` and filter the +unique genes per chromosome. We also change the scale of the y-axis to a +log10 scale for better readability. + +```{r sub1, purl=TRUE} +rna$chromosome_name <- factor(rna$chromosome_name, + levels = c(1:19,"X","Y")) + +count_gene_chromosome <- rna %>% select(chromosome_name, gene) %>% + distinct() %>% ggplot() + + geom_bar(aes(x = chromosome_name), fill = "seagreen", + position = "dodge", stat = "count") + + labs(y = "log10(n genes)", x = "chromosome") + + scale_y_log10() + +count_gene_chromosome +``` + +Below, we also remove the legend altogether by setting the +`legend.position` to `"none"`. + +```{r sub2, purl=TRUE} +exp_boxplot_sex <- ggplot(rna, aes(y=expression_log, x = as.factor(time), + color=sex)) + + geom_boxplot(alpha = 0) + + labs(y = "Mean gene exp", + x = "time") + theme(legend.position = "none") + +exp_boxplot_sex +``` + +The [**patchwork**](https://github.com/thomasp85/patchwork) package +provides an elegant approach to combining figures using the `+` to +arrange figures (typically side by side). More specifically the `|` +explicitly arranges them side by side and `/` stacks them on top of each +other. + +```{r install-patchwork, message=FALSE, eval=FALSE, purl=TRUE} +install.packages("patchwork") +``` + +```{r patchworkplot1, purl=TRUE} +library("patchwork") +count_gene_chromosome + exp_boxplot_sex +## or count_gene_chromosome | exp_boxplot_sex +``` + +```{r patchwork2, purl=TRUE} +count_gene_chromosome / exp_boxplot_sex +``` + +We can combine further control the layout of the final composition with +`plot_layout` to create more complex layouts: + +```{r patchwork3, purl=TRUE} +count_gene_chromosome + exp_boxplot_sex + plot_layout(ncol = 1) +``` + +```{r patchwork4, purl=TRUE} +count_gene_chromosome + + (count_gene_chromosome + exp_boxplot_sex) + + exp_boxplot_sex + + plot_layout(ncol = 1) +``` + +The last plot can also be created using the `|` and `/` composers: + +```{r patchwork5, purl=TRUE} +count_gene_chromosome / + (count_gene_chromosome | exp_boxplot_sex) / + exp_boxplot_sex +``` + +Learn more about `patchwork` on its +[webpage](https://patchwork.data-imaginist.com/) or in this +[video](https://www.youtube.com/watch?v=0m4yywqNPVY). + +Another option is the **`gridExtra`** package that allows to combine +separate ggplots into a single figure using `grid.arrange()`: + +```{r install-gridextra, message=FALSE, eval=FALSE, purl=TRUE} +install.packages("gridExtra") +``` + +```{r gridarrange-example, message=FALSE, fig.width=10, purl=TRUE} +library("gridExtra") +grid.arrange(count_gene_chromosome, exp_boxplot_sex, ncol = 2) +``` + +In addition to the `ncol` and `nrow` arguments, used to make simple +arrangements, there are tools for constructing more complex +layouts. + +## Exporting plots + +After creating your plot, you can save it to a file in your favorite +format. The Export tab in the **Plot** pane in RStudio will save your +plots at low resolution, which will not be accepted by many journals and +will not scale well for posters. + +Instead, use the `ggsave()` function, which allows you easily change the +dimension and resolution of your plot by adjusting the appropriate +arguments (`width`, `height` and `dpi`). + +Make sure you have the `fig_output/` folder in your working directory. + +```{r ggsave-example, eval=FALSE, purl=TRUE} +my_plot <- ggplot(data = mean_exp_by_time_sex, + mapping = aes(x = time, y = mean_exp, color = sex)) + + geom_line() + + facet_wrap(~ gene, scales = "free_y") + + labs(title = "Mean gene expression by duration of the infection", + x = "Duration of the infection (in days)", + y = "Mean gene expression") + + guides(color=guide_legend(title="Gender")) + + theme_bw() + + theme(axis.text.x = element_text(colour = "royalblue4", size = 12), + axis.text.y = element_text(colour = "royalblue4", size = 12), + text = element_text(size = 16), + panel.grid = element_line(colour="lightsteelblue1"), + legend.position = "top") +ggsave("fig_output/mean_exp_by_time_sex.png", my_plot, width = 15, + height = 10) + +# This also works for grid.arrange() plots +combo_plot <- grid.arrange(count_gene_chromosome, exp_boxplot_sex, + ncol = 2, widths = c(4, 6)) +ggsave("fig_output/combo_plot_chromosome_sex.png", combo_plot, + width = 10, dpi = 300) +``` + +Note: The parameters `width` and `height` also determine the font size +in the saved plot. + +```{r final-challenge, eval=FALSE, purl=TRUE, echo=FALSE} +### Final plotting challenge: +## With all of this information in hand, please take another five +## minutes to either improve one of the plots generated in this +## exercise or create a beautiful graph of your own. Use the RStudio +## ggplot2 cheat sheet for inspiration: +## https://raw.githubusercontent.com/rstudio/cheatsheets/main/data-visualization.pdf +``` + +## Other packages for visualisation + +`ggplot2` is a very powerful package that fits very nicely in our _tidy +data_ and _tidy tools_ pipeline. There are other visualization packages +in R that shouldn't be ignored. + +### Base graphics + +The default graphics system that comes with R, often called _base R +graphics_ is simple and fast. It is based on the _painter's or canvas +model_, where different output are directly overlaid on top of each +other (see figure @ref(fig:paintermodel)). This is a fundamental +difference with `ggplot2` (and with `lattice`, described below), that +returns dedicated objects, that are rendered on screen or in a file, and +that can even be updated. + +```{r paintermodel, fig.width=12, fig.height=4, fig.cap="Successive layers added on top of each other."} +par(mfrow = c(1, 3)) +plot(1:20, main = "First layer, produced with plot(1:20)") + +plot(1:20, main = "A horizontal red line, added with abline(h = 10)") +abline(h = 10, col = "red") + +plot(1:20, main = "A rectangle, added with rect(5, 5, 15, 15)") +abline(h = 10, col = "red") +rect(5, 5, 15, 15, lwd = 3) +``` + +Another main difference is that base graphics' plotting function try to +do _the right_ thing based on their input type, i.e. they will adapt +their behaviour based on the class of their input. This is again very +different from what we have in `ggplot2`, that only accepts dataframes +as input, and that requires plots to be constructed bit by bit. + +```{r plotmethod, fig.width=8, fig.height=8, fig.cap="Plotting boxplots (top) and histograms (bottom) vectors (left) or a matrices (right)."} +par(mfrow = c(2, 2)) +boxplot(rnorm(100), + main = "Boxplot of rnorm(100)") +boxplot(matrix(rnorm(100), ncol = 10), + main = "Boxplot of matrix(rnorm(100), ncol = 10)") +hist(rnorm(100)) +hist(matrix(rnorm(100), ncol = 10)) +``` + +The out-of-the-box approach in base graphics can be very efficient for +simple, standard figures, that can be produced very quickly with a +single line of code and a single function such as `plot`, or `hist`, or +`boxplot`, ... The defaults are however not always the most appealing +and tuning of figures, especially when they become more complex (for +example to produce facets), can become lengthy and cumbersome. + +### The lattice package + +The **`lattice`** package is similar to `ggplot2` in that is uses +dataframes as input, returns graphical objects and supports faceting. +`lattice` however isn't based on the grammar of graphics and has a more +convoluted interface. + +A good reference for the `lattice` package is @latticebook. + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- Visualization in R + +:::::::::::::::::::::::::::::::::::::::::::::::::: From 2fd432cdea91f53fed69784ba3e1cb6179777989 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Fri, 29 Dec 2023 06:11:13 +0900 Subject: [PATCH 031/334] New translations 40-visualization.md (Chinese Simplified) --- locale/zh/episodes/40-visualization.Rmd | 1103 +++++++++++++++++++++++ 1 file changed, 1103 insertions(+) create mode 100644 locale/zh/episodes/40-visualization.Rmd diff --git a/locale/zh/episodes/40-visualization.Rmd b/locale/zh/episodes/40-visualization.Rmd new file mode 100644 index 000000000..b1ab2920c --- /dev/null +++ b/locale/zh/episodes/40-visualization.Rmd @@ -0,0 +1,1103 @@ +--- +source: Rmd +title: Data visualization +teaching: 60 +exercises: 60 +--- + +```{r loaddata_vis, echo=FALSE, purl=FALSE, message=FALSE} +if (!file.exists("data/rnaseq.csv")) +download.file(url = "https://github.com/carpentries-incubator/bioc-intro/raw/main/episodes/data/rnaseq.csv", + destfile = "data/rnaseq.csv") +``` + +::::::::::::::::::::::::::::::::::::::: objectives + +- Produce scatter plots, boxplots, line plots, etc. using ggplot. +- Set universal plot settings. +- Describe what faceting is and apply faceting in ggplot. +- Modify the aesthetics of an existing ggplot plot (including axis labels and color). +- Build complex and customized plots from data in a data frame. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- Visualization in R + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +```{r vis_setup, echo=FALSE} +rna <- read.csv("data/rnaseq.csv") +``` + +## Data Visualization + +We start by loading the required packages. **`ggplot2`** is included in +the **`tidyverse`** package. + +```{r load-package, message=FALSE, purl=TRUE} +library("tidyverse") +``` + +If not still in the workspace, load the data we saved in the previous +lesson. + +```{r load-data, eval=FALSE, purl=TRUE} +rna <- read.csv("data/rnaseq.csv") +``` + +The Data Visualization Cheat +Sheet +will cover the basics and more advanced features of `ggplot2` and will +help, in addition to serve as a reminder, getting an overview of the +many data representations available in the package. The following video +tutorials ([part 1](https://www.youtube.com/watch?v=h29g21z0a68) and +[2](https://www.youtube.com/watch?v=0m4yywqNPVY)) by Thomas Lin Pedersen +are also very instructive. + +## Plotting with `ggplot2` + +`ggplot2` is a plotting package that makes it simple to create complex +plots from data in a data frame. It provides a more programmatic +interface for specifying what variables to plot, how they are displayed, +and general visual properties. The theoretical foundation that supports +the `ggplot2` is the _Grammar of Graphics_ (@Wilkinson:2005). Using this +approach, we only need minimal changes if the underlying data change or +if we decide to change from a bar plot to a scatterplot. This helps in +creating publication quality plots with minimal amounts of adjustments +and tweaking. + +There is a book about `ggplot2` (@ggplot2book) that provides a good +overview, but it is outdated. The 3rd edition is in preparation and will +be [freely available online](https://ggplot2-book.org/). The `ggplot2` +webpage ([https://ggplot2.tidyverse.org](https://ggplot2.tidyverse.org)) provides ample documentation. + +`ggplot2` functions like data in the 'long' format, i.e., a column for +every dimension, and a row for every observation. Well-structured data +will save you lots of time when making figures with `ggplot2`. + +ggplot graphics are built step by step by adding new elements. Adding +layers in this fashion allows for extensive flexibility and +customization of plots. + +> The idea behind the Grammar of Graphics it is that you can build every +> graph from the same 3 components: (1) a data set, (2) a coordinate system, +> and (3) geoms — i.e. visual marks that represent data points \[^three\_comp\_ggplot2] + +[^three_comp_ggplot2]: Source: [Data Visualization Cheat Sheet](https://raw.githubusercontent.com/rstudio/cheatsheets/main/data-visualization.pdf). + +To build a ggplot, we will use the following basic template that can be +used for different types of plots: + +``` +ggplot(data = <DATA>, mapping = aes(<MAPPINGS>)) + <GEOM_FUNCTION>() +``` + +- use the `ggplot()` function and bind the plot to a specific **data + frame** using the `data` argument + +```{r, eval=FALSE} +ggplot(data = rna) +``` + +- define a **mapping** (using the aesthetic (`aes`) function), by + selecting the variables to be plotted and specifying how to present + them in the graph, e.g. as x/y positions or characteristics such as + size, shape, color, etc. + +```{r, eval=FALSE} +ggplot(data = rna, mapping = aes(x = expression)) +``` + +- add '**geoms**' - geometries, or graphical representations of the + data in the plot (points, lines, bars). `ggplot2` offers many + different geoms; we will use some common ones today, including: + + ``` + * `geom_point()` for scatter plots, dot plots, etc. + * `geom_histogram()` for histograms + * `geom_boxplot()` for, well, boxplots! + * `geom_line()` for trend lines, time series, etc. + ``` + +To add a geom(etry) to the plot use the `+` operator. Let's use +`geom_histogram()` first: + +```{r first-ggplot, cache=FALSE, purl=TRUE} +ggplot(data = rna, mapping = aes(x = expression)) + + geom_histogram() +``` + +The `+` in the `ggplot2` package is particularly useful because it +allows you to modify existing `ggplot` objects. This means you can +easily set up plot templates and conveniently explore different types of +plots, so the above plot can also be generated with code like this: + +```{r, eval=FALSE, purl=TRUE} +# Assign plot to a variable +rna_plot <- ggplot(data = rna, + mapping = aes(x = expression)) + +# Draw the plot +rna_plot + geom_histogram() +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge + +You have probably noticed an automatic message that appears when +drawing the histogram: + +```{r, echo=FALSE, fig.show="hide"} +ggplot(rna, aes(x = expression)) + + geom_histogram() +``` + +Change the arguments `bins` or `binwidth` of `geom_histogram()` to +change the number or width of the bins. + +::::::::::::::: solution + +## Solution + +```{r, purl=TRUE} +# change bins +ggplot(rna, aes(x = expression)) + + geom_histogram(bins = 15) + +# change binwidth +ggplot(rna, aes(x = expression)) + + geom_histogram(binwidth = 2000) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +We can observe here that the data are skewed to the right. We can apply +log2 transformation to have a more symmetric distribution. Note that we +add here a small constant value (`+1`) to avoid having `-Inf` values +returned for expression values equal to 0. + +```{r log-transfo, cache=FALSE, purl=TRUE} +rna <- rna %>% + mutate(expression_log = log2(expression + 1)) +``` + +If we now draw the histogram of the log2-transformed expressions, the +distribution is indeed closer to a normal distribution. + +```{r second-ggplot, cache=FALSE, purl=TRUE} +ggplot(rna, aes(x = expression_log)) + geom_histogram() +``` + +From now on we will work on the log-transformed expression values. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge + +Another way to visualize this transformation is to consider the scale +of the observations. For example, it may be worth changing the scale +of the axis to better distribute the observations in the space of the +plot. Changing the scale of the axes is done similarly to +adding/modifying other components (i.e., by incrementally adding +commands). Try making this modification: + +- Represent the un-transformed expression on the log10 scale; see + `scale_x_log10()`. Compare it with the previous graph. Why do you + now have warning messages appearing? + +::::::::::::::: solution + +## Solution + +```{r, eval=TRUE, purl=TRUE, echo=TRUE} +ggplot(data = rna,mapping = aes(x = expression))+ + geom_histogram() + + scale_x_log10() +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +**Notes** + +- Anything you put in the `ggplot()` function can be seen by any geom + layers that you add (i.e., these are global plot settings). This + includes the x- and y-axis mapping you set up in `aes()`. +- You can also specify mappings for a given geom independently of the + mappings defined globally in the `ggplot()` function. +- The `+` sign used to add new layers must be placed at the end of the + line containing the _previous_ layer. If, instead, the `+` sign is + added at the beginning of the line containing the new layer, + `ggplot2` will not add the new layer and will return an error + message. + +```{r, eval=FALSE} +# This is the correct syntax for adding layers +rna_plot + + geom_histogram() + +# This will not add the new layer and will return an error message +rna_plot + + geom_histogram() +``` + +## Building your plots iteratively + +We will now draw a scatter plot with two continuous variables and the +`geom_point()` function. This graph will represent the log2 fold changes +of expression comparing time 8 versus time 0, and time 4 versus time 0. +To this end, we first need to compute the means of the log-transformed +expression values by gene and time, then the log fold changes by +subtracting the mean log expressions between time 8 and time 0 and +between time 4 and time 0. Note that we also include here the gene +biotype that we will use later on to represent the genes. We will save +the fold changes in a new data frame called `rna_fc.` + +```{r rna_fc, cache=FALSE, purl=TRUE} +rna_fc <- rna %>% select(gene, time, + gene_biotype, expression_log) %>% + group_by(gene, time, gene_biotype) %>% + summarize(mean_exp = mean(expression_log)) %>% + pivot_wider(names_from = time, + values_from = mean_exp) %>% + mutate(time_8_vs_0 = `8` - `0`, time_4_vs_0 = `4` - `0`) + +``` + +We can then build a ggplot with the newly created dataset `rna_fc`. +Building plots with `ggplot2` is typically an iterative process. We +start by defining the dataset we'll use, lay out the axes, and choose a +geom: + +```{r create-ggplot-object, cache=FALSE, purl=TRUE} +ggplot(data = rna_fc, mapping = aes(x = time_4_vs_0, y = time_8_vs_0)) + + geom_point() +``` + +Then, we start modifying this plot to extract more information from it. +For instance, we can add transparency (`alpha`) to avoid overplotting: + +```{r adding-transparency, cache=FALSE, purl=TRUE} +ggplot(data = rna_fc, mapping = aes(x = time_4_vs_0, y = time_8_vs_0)) + + geom_point(alpha = 0.3) +``` + +We can also add colors for all the points: + +```{r adding-colors, cache=FALSE, purl=TRUE} +ggplot(data = rna_fc, mapping = aes(x = time_4_vs_0, y = time_8_vs_0)) + + geom_point(alpha = 0.3, color = "blue") +``` + +Or to color each gene in the plot differently, you could use a vector as +an input to the argument **color**. `ggplot2` will provide a different +color corresponding to different values in the vector. Here is an +example where we color with `gene_biotype`: + +```{r color-by-gene_biotype1, cache=FALSE, purl=TRUE} +ggplot(data = rna_fc, mapping = aes(x = time_4_vs_0, y = time_8_vs_0)) + + geom_point(alpha = 0.3, aes(color = gene_biotype)) + +``` + +We can also specify the colors directly inside the mapping provided in +the `ggplot()` function. This will be seen by any geom layers and the +mapping will be determined by the x- and y-axis set up in `aes()`. + +```{r color-by-gene_biotype2, cache=FALSE, purl=TRUE} +ggplot(data = rna_fc, mapping = aes(x = time_4_vs_0, y = time_8_vs_0, + color = gene_biotype)) + + geom_point(alpha = 0.3) +``` + +Finally, we could also add a diagonal line with the `geom_abline()` +function: + +```{r adding-diag, cache=FALSE, purl=TRUE} +ggplot(data = rna_fc, mapping = aes(x = time_4_vs_0, y = time_8_vs_0, + color = gene_biotype)) + + geom_point(alpha = 0.3) + + geom_abline(intercept = 0) +``` + +Notice that we can change the geom layer from `geom_point` to +`geom_jitter` and colors will still be determined by `gene_biotype`. + +```{r color-by-gene_biotype3, cache=FALSE, purl=TRUE} +ggplot(data = rna_fc, mapping = aes(x = time_4_vs_0, y = time_8_vs_0, + color = gene_biotype)) + + geom_jitter(alpha = 0.3) + + geom_abline(intercept = 0) +``` + +```{r, echo=FALSE, message=FALSE} +library("hexbin") +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge + +Scatter plots can be useful exploratory tools for small datasets. For +data sets with large numbers of observations, such as the `rna_fc` +data set, overplotting of points can be a limitation of scatter plots. +One strategy for handling such settings is to use hexagonal binning of +observations. The plot space is tessellated into hexagons. Each +hexagon is assigned a color based on the number of observations that +fall within its boundaries. + +- To use hexagonal binning in `ggplot2`, first install the R package + `hexbin` from CRAN and load it. + +- Then use the `geom_hex()` function to produce the hexbin figure. + +- What are the relative strengths and weaknesses of a hexagonal bin + plot compared to a scatter plot? Examine the above scatter plot + and compare it with the hexagonal bin plot that you created. + +::::::::::::::: solution + +## Solution + +```{r, eval=FALSE, purl=TRUE} +install.packages("hexbin") +``` + +```{r, purl=TRUE} +library("hexbin") + +ggplot(data = rna_fc, mapping = aes(x = time_4_vs_0, y = time_8_vs_0)) + + geom_hex() + + geom_abline(intercept = 0) + +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge + +Use what you just learned to create a scatter plot of `expression_log` +over `sample` from the `rna` dataset with the time showing in +different colors. Is this a good way to show this type of data? + +::::::::::::::: solution + +## Solution + +```{r, eval=TRUE, purl=TRUE} +ggplot(data = rna, mapping = aes(y = expression_log, x = sample)) + + geom_point(aes(color = time)) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Boxplot + +We can use boxplots to visualize the distribution of gene expressions +within each sample: + +```{r boxplot, cache=FALSE, purl=TRUE} +ggplot(data = rna, + mapping = aes(y = expression_log, x = sample)) + + geom_boxplot() +``` + +By adding points to boxplot, we can have a better idea of the number of +measurements and of their distribution: + +```{r boxplot-with-points, cache=FALSE, purl=TRUE} +ggplot(data = rna, + mapping = aes(y = expression_log, x = sample)) + + geom_jitter(alpha = 0.2, color = "tomato") + + geom_boxplot(alpha = 0) +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge + +Note how the boxplot layer is in front of the jitter layer? What do +you need to change in the code to put the boxplot below the points? + +::::::::::::::: solution + +## Solution + +We should switch the order of these two geoms: + +```{r boxplot-with-points2, cache=FALSE, purl=TRUE} +ggplot(data = rna, + mapping = aes(y = expression_log, x = sample)) + + geom_boxplot(alpha = 0) + + geom_jitter(alpha = 0.2, color = "tomato") +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +You may notice that the values on the x-axis are still not properly +readable. Let's change the orientation of the labels and adjust them +vertically and horizontally so they don't overlap. You can use a +90-degree angle, or experiment to find the appropriate angle for +diagonally oriented labels: + +```{r boxplot-xaxis-rotated, cache=FALSE, purl=TRUE} +ggplot(data = rna, + mapping = aes(y = expression_log, x = sample)) + + geom_jitter(alpha = 0.2, color = "tomato") + + geom_boxplot(alpha = 0) + + theme(axis.text.x = element_text(angle = 90, hjust = 0.5, vjust = 0.5)) +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge + +Add color to the data points on your boxplot according to the duration +of the infection (`time`). + +_Hint:_ Check the class for `time`. Consider changing the class of +`time` from integer to factor directly in the ggplot mapping. Why does +this change how R makes the graph? + +::::::::::::::: solution + +## Solution + +```{r boxplot-color-time, cache=FALSE, purl=TRUE} +# time as integer +ggplot(data = rna, + mapping = aes(y = expression_log, + x = sample)) + + geom_jitter(alpha = 0.2, aes(color = time)) + + geom_boxplot(alpha = 0) + + theme(axis.text.x = element_text(angle = 90, hjust = 0.5, vjust = 0.5)) + +# time as factor +ggplot(data = rna, + mapping = aes(y = expression_log, + x = sample)) + + geom_jitter(alpha = 0.2, aes(color = as.factor(time))) + + geom_boxplot(alpha = 0) + + theme(axis.text.x = element_text(angle = 90, hjust = 0.5, vjust = 0.5)) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge + +Boxplots are useful summaries, but hide the _shape_ of the +distribution. For example, if the distribution is bimodal, we would +not see it in a boxplot. An alternative to the boxplot is the violin +plot, where the shape (of the density of points) is drawn. + +- Replace the box plot with a violin plot; see `geom_violin()`. Fill + in the violins according to the time with the argument `fill`. + +::::::::::::::: solution + +## Solution + +```{r, eval=TRUE, echo=TRUE, cache=FALSE, purl=TRUE} +ggplot(data = rna, + mapping = aes(y = expression_log, x = sample)) + + geom_violin(aes(fill = as.factor(time))) + + theme(axis.text.x = element_text(angle = 90, hjust = 0.5, vjust = 0.5)) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge + +- Modify the violin plot to fill in the violins by `sex`. + +::::::::::::::: solution + +## Solution + +```{r, eval=TRUE, echo=TRUE, cache=FALSE, purl=TRUE} +ggplot(data = rna, + mapping = aes(y = expression_log, x = sample)) + + geom_violin(aes(fill = sex)) + + theme(axis.text.x = element_text(angle = 90, hjust = 0.5, vjust = 0.5)) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Line plots + +Let's calculate the mean expression per duration of the infection for +the 10 genes having the highest log fold changes comparing time 8 versus +time 0. First, we need to select the genes and create a subset of `rna` +called `sub_rna` containing the 10 selected genes, then we need to group +the data and calculate the mean gene expression within each group: + +```{r, purl=TRUE} +rna_fc <- rna_fc %>% arrange(desc(time_8_vs_0)) + +genes_selected <- rna_fc$gene[1:10] + +sub_rna <- rna %>% + filter(gene %in% genes_selected) + +mean_exp_by_time <- sub_rna %>% + group_by(gene,time) %>% + summarize(mean_exp = mean(expression_log)) + +mean_exp_by_time +``` + +We can build the line plot with duration of the infection on the x-axis +and the mean expression on the y-axis: + +```{r first-time-series, purl=TRUE} +ggplot(data = mean_exp_by_time, mapping = aes(x = time, y = mean_exp)) + + geom_line() +``` + +Unfortunately, this does not work because we plotted data for all the +genes together. We need to tell ggplot to draw a line for each gene by +modifying the aesthetic function to include `group = gene`: + +```{r time-series-by-gene, purl=TRUE} +ggplot(data = mean_exp_by_time, + mapping = aes(x = time, y = mean_exp, group = gene)) + + geom_line() +``` + +We will be able to distinguish genes in the plot if we add colors (using +`color` also automatically groups the data): + +```{r time-series-with-colors, purl=TRUE} +ggplot(data = mean_exp_by_time, + mapping = aes(x = time, y = mean_exp, color = gene)) + + geom_line() +``` + +## Faceting + +`ggplot2` has a special technique called _faceting_ that allows the user +to split one plot into multiple (sub) plots based on a factor included +in the dataset. These different subplots inherit the same properties +(axes limits, ticks, ...) to facilitate their direct comparison. We will +use it to make a line plot across time for each gene: + +```{r first-facet, purl=TRUE} +ggplot(data = mean_exp_by_time, + mapping = aes(x = time, y = mean_exp)) + geom_line() + + facet_wrap(~ gene) +``` + +Here both x- and y-axis have the same scale for all the subplots. You +can change this default behavior by modifying `scales` in order to allow +a free scale for the y-axis: + +```{r first-facet-scales, purl=TRUE} +ggplot(data = mean_exp_by_time, + mapping = aes(x = time, y = mean_exp)) + + geom_line() + + facet_wrap(~ gene, scales = "free_y") +``` + +Now we would like to split the line in each plot by the sex of the mice. +To do that we need to calculate the mean expression in the data frame +grouped by `gene`, `time`, and `sex`: + +```{r data-facet-by-gene-and-sex, purl=TRUE} +mean_exp_by_time_sex <- sub_rna %>% + group_by(gene, time, sex) %>% + summarize(mean_exp = mean(expression_log)) + +mean_exp_by_time_sex +``` + +We can now make the faceted plot by splitting further by sex using +`color` (within a single plot): + +```{r facet-by-gene-and-sex, cache=FALSE, purl=TRUE} +ggplot(data = mean_exp_by_time_sex, + mapping = aes(x = time, y = mean_exp, color = sex)) + + geom_line() + + facet_wrap(~ gene, scales = "free_y") +``` + +Usually plots with white background look more readable when printed. We +can set the background to white using the function `theme_bw()`. +Additionally, we can remove the grid: + +```{r facet-by-gene-and-sex-white-bg, cache=FALSE, purl=TRUE} +ggplot(data = mean_exp_by_time_sex, + mapping = aes(x = time, y = mean_exp, color = sex)) + + geom_line() + + facet_wrap(~ gene, scales = "free_y") + + theme_bw() + + theme(panel.grid = element_blank()) +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge + +Use what you just learned to create a plot that depicts how the +average expression of each chromosome changes through the duration of +infection. + +::::::::::::::: solution + +## Solution + +```{r mean-exp-chromosome-time-series, purl=TRUE} +mean_exp_by_chromosome <- rna %>% + group_by(chromosome_name, time) %>% + summarize(mean_exp = mean(expression_log)) + +ggplot(data = mean_exp_by_chromosome, mapping = aes(x = time, + y = mean_exp)) + + geom_line() + + facet_wrap(~ chromosome_name, scales = "free_y") +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +The `facet_wrap` geometry extracts plots into an arbitrary number of +dimensions to allow them to cleanly fit on one page. On the other hand, +the `facet_grid` geometry allows you to explicitly specify how you want +your plots to be arranged via formula notation (`rows ~ columns`; a `.` +can be used as a placeholder that indicates only one row or column). + +Let's modify the previous plot to compare how the mean gene expression +of males and females has changed through time: + +```{r mean-exp-time-facet-sex-rows, purl=TRUE} +# One column, facet by rows +ggplot(data = mean_exp_by_time_sex, + mapping = aes(x = time, y = mean_exp, color = gene)) + + geom_line() + + facet_grid(sex ~ .) +``` + +```{r mean-exp-time-facet-sex-columns, purl=TRUE} +# One row, facet by column +ggplot(data = mean_exp_by_time_sex, + mapping = aes(x = time, y = mean_exp, color = gene)) + + geom_line() + + facet_grid(. ~ sex) +``` + +## `ggplot2` themes + +In addition to `theme_bw()`, which changes the plot background to white, +`ggplot2` comes with several other themes which can be useful to quickly +change the look of your visualization. The complete list of themes is +available at [https://ggplot2.tidyverse.org/reference/ggtheme.html](https://ggplot2.tidyverse.org/reference/ggtheme.html). +`theme_minimal()` and `theme_light()` are popular, and `theme_void()` +can be useful as a starting point to create a new hand-crafted theme. + +The [ggthemes](https://jrnold.github.io/ggthemes/reference/index.html) +package provides a wide variety of options (including an Excel 2003 +theme). The ggplot2 provides a list of +packages that extend the capabilities of `ggplot2`, including additional +themes. + +## Customisation + +Let's come back to the faceted plot of mean expression by time and gene, +colored by sex. + +Take a look at the ggplot2, +and think of ways you could improve the plot. + +Now, we can change names of axes to something more informative than +'time' and 'mean_exp', and add a title to the figure: + +```{r mean_exp-time-with-right-labels, cache=FALSE, purl=TRUE} +ggplot(data = mean_exp_by_time_sex, + mapping = aes(x = time, y = mean_exp, color = sex)) + + geom_line() + + facet_wrap(~ gene, scales = "free_y") + + theme_bw() + + theme(panel.grid = element_blank()) + + labs(title = "Mean gene expression by duration of the infection", + x = "Duration of the infection (in days)", + y = "Mean gene expression") +``` + +The axes have more informative names, but their readability can be +improved by increasing the font size: + +```{r mean_exp-time-with-right-labels-xfont-size, cache=FALSE, purl=TRUE} +ggplot(data = mean_exp_by_time_sex, + mapping = aes(x = time, y = mean_exp, color = sex)) + + geom_line() + + facet_wrap(~ gene, scales = "free_y") + + theme_bw() + + theme(panel.grid = element_blank()) + + labs(title = "Mean gene expression by duration of the infection", + x = "Duration of the infection (in days)", + y = "Mean gene expression") + + theme(text = element_text(size = 16)) +``` + +Note that it is also possible to change the fonts of your plots. If you +are on Windows, you may have to install the . + +We can further customize the color of x- and y-axis text, the color of +the grid, etc. We can also for example move the legend to the top by +setting `legend.position` to `"top"`. + +```{r mean_exp-time-with-theme, cache=FALSE, purl=TRUE} +ggplot(data = mean_exp_by_time_sex, + mapping = aes(x = time, y = mean_exp, color = sex)) + + geom_line() + + facet_wrap(~ gene, scales = "free_y") + + theme_bw() + + theme(panel.grid = element_blank()) + + labs(title = "Mean gene expression by duration of the infection", + x = "Duration of the infection (in days)", + y = "Mean gene expression") + + theme(text = element_text(size = 16), + axis.text.x = element_text(colour = "royalblue4", size = 12), + axis.text.y = element_text(colour = "royalblue4", size = 12), + panel.grid = element_line(colour="lightsteelblue1"), + legend.position = "top") +``` + +If you like the changes you created better than the default theme, you +can save them as an object to be able to easily apply them to other +plots you may create. Here is an example with the histogram we have +previously created. + +```{r mean_exp-time-with-right-labels-xfont, cache=FALSE, purl=TRUE} +blue_theme <- theme(axis.text.x = element_text(colour = "royalblue4", + size = 12), + axis.text.y = element_text(colour = "royalblue4", + size = 12), + text = element_text(size = 16), + panel.grid = element_line(colour="lightsteelblue1")) + +ggplot(rna, aes(x = expression_log)) + + geom_histogram(bins = 20) + + blue_theme +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge + +With all of this information in hand, please take another five minutes +to either improve one of the plots generated in this exercise or +create a beautiful graph of your own. Use the RStudio ggplot2 +for inspiration. Here are some ideas: + +- See if you can change the thickness of the lines. +- Can you find a way to change the name of the legend? What about + its labels? (hint: look for a ggplot function starting with + `scale_`) +- Try using a different color palette or manually specifying the + colors for the lines (see + [http://www.cookbook-r.com/Graphs/Colors_(ggplot2)/](https://www.cookbook-r.com/Graphs/Colors_\(ggplot2\)/)). + +::::::::::::::: solution + +## Solution + +For example, based on this plot: + +```{r, purl=TRUE} +ggplot(data = mean_exp_by_time_sex, + mapping = aes(x = time, y = mean_exp, color = sex)) + + geom_line() + + facet_wrap(~ gene, scales = "free_y") + + theme_bw() + + theme(panel.grid = element_blank()) +``` + +We can customize it the following ways: + +```{r, purl=TRUE} +# change the thickness of the lines +ggplot(data = mean_exp_by_time_sex, + mapping = aes(x = time, y = mean_exp, color = sex)) + + geom_line(size=1.5) + + facet_wrap(~ gene, scales = "free_y") + + theme_bw() + + theme(panel.grid = element_blank()) + +# change the name of the legend and the labels +ggplot(data = mean_exp_by_time_sex, + mapping = aes(x = time, y = mean_exp, color = sex)) + + geom_line() + + facet_wrap(~ gene, scales = "free_y") + + theme_bw() + + theme(panel.grid = element_blank()) + + scale_color_discrete(name = "Gender", labels = c("F", "M")) + +# using a different color palette +ggplot(data = mean_exp_by_time_sex, + mapping = aes(x = time, y = mean_exp, color = sex)) + + geom_line() + + facet_wrap(~ gene, scales = "free_y") + + theme_bw() + + theme(panel.grid = element_blank()) + + scale_color_brewer(name = "Gender", labels = c("F", "M"), palette = "Dark2") + +# manually specifying the colors +ggplot(data = mean_exp_by_time_sex, + mapping = aes(x = time, y = mean_exp, color = sex)) + + geom_line() + + facet_wrap(~ gene, scales = "free_y") + + theme_bw() + + theme(panel.grid = element_blank()) + + scale_color_manual(name = "Gender", labels = c("F", "M"), + values = c("royalblue", "deeppink")) + +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Composing plots + +Faceting is a great tool for splitting one plot into multiple subplots, +but sometimes you may want to produce a single figure that contains +multiple independent plots, i.e. plots that are based on different +variables or even different data frames. + +Let's start by creating the two plots that we want to arrange next to +each other: + +The first graph counts the number of unique genes per chromosome. We +first need to reorder the levels of `chromosome_name` and filter the +unique genes per chromosome. We also change the scale of the y-axis to a +log10 scale for better readability. + +```{r sub1, purl=TRUE} +rna$chromosome_name <- factor(rna$chromosome_name, + levels = c(1:19,"X","Y")) + +count_gene_chromosome <- rna %>% select(chromosome_name, gene) %>% + distinct() %>% ggplot() + + geom_bar(aes(x = chromosome_name), fill = "seagreen", + position = "dodge", stat = "count") + + labs(y = "log10(n genes)", x = "chromosome") + + scale_y_log10() + +count_gene_chromosome +``` + +Below, we also remove the legend altogether by setting the +`legend.position` to `"none"`. + +```{r sub2, purl=TRUE} +exp_boxplot_sex <- ggplot(rna, aes(y=expression_log, x = as.factor(time), + color=sex)) + + geom_boxplot(alpha = 0) + + labs(y = "Mean gene exp", + x = "time") + theme(legend.position = "none") + +exp_boxplot_sex +``` + +The [**patchwork**](https://github.com/thomasp85/patchwork) package +provides an elegant approach to combining figures using the `+` to +arrange figures (typically side by side). More specifically the `|` +explicitly arranges them side by side and `/` stacks them on top of each +other. + +```{r install-patchwork, message=FALSE, eval=FALSE, purl=TRUE} +install.packages("patchwork") +``` + +```{r patchworkplot1, purl=TRUE} +library("patchwork") +count_gene_chromosome + exp_boxplot_sex +## or count_gene_chromosome | exp_boxplot_sex +``` + +```{r patchwork2, purl=TRUE} +count_gene_chromosome / exp_boxplot_sex +``` + +We can combine further control the layout of the final composition with +`plot_layout` to create more complex layouts: + +```{r patchwork3, purl=TRUE} +count_gene_chromosome + exp_boxplot_sex + plot_layout(ncol = 1) +``` + +```{r patchwork4, purl=TRUE} +count_gene_chromosome + + (count_gene_chromosome + exp_boxplot_sex) + + exp_boxplot_sex + + plot_layout(ncol = 1) +``` + +The last plot can also be created using the `|` and `/` composers: + +```{r patchwork5, purl=TRUE} +count_gene_chromosome / + (count_gene_chromosome | exp_boxplot_sex) / + exp_boxplot_sex +``` + +Learn more about `patchwork` on its +[webpage](https://patchwork.data-imaginist.com/) or in this +[video](https://www.youtube.com/watch?v=0m4yywqNPVY). + +Another option is the **`gridExtra`** package that allows to combine +separate ggplots into a single figure using `grid.arrange()`: + +```{r install-gridextra, message=FALSE, eval=FALSE, purl=TRUE} +install.packages("gridExtra") +``` + +```{r gridarrange-example, message=FALSE, fig.width=10, purl=TRUE} +library("gridExtra") +grid.arrange(count_gene_chromosome, exp_boxplot_sex, ncol = 2) +``` + +In addition to the `ncol` and `nrow` arguments, used to make simple +arrangements, there are tools for constructing more complex +layouts. + +## Exporting plots + +After creating your plot, you can save it to a file in your favorite +format. The Export tab in the **Plot** pane in RStudio will save your +plots at low resolution, which will not be accepted by many journals and +will not scale well for posters. + +Instead, use the `ggsave()` function, which allows you easily change the +dimension and resolution of your plot by adjusting the appropriate +arguments (`width`, `height` and `dpi`). + +Make sure you have the `fig_output/` folder in your working directory. + +```{r ggsave-example, eval=FALSE, purl=TRUE} +my_plot <- ggplot(data = mean_exp_by_time_sex, + mapping = aes(x = time, y = mean_exp, color = sex)) + + geom_line() + + facet_wrap(~ gene, scales = "free_y") + + labs(title = "Mean gene expression by duration of the infection", + x = "Duration of the infection (in days)", + y = "Mean gene expression") + + guides(color=guide_legend(title="Gender")) + + theme_bw() + + theme(axis.text.x = element_text(colour = "royalblue4", size = 12), + axis.text.y = element_text(colour = "royalblue4", size = 12), + text = element_text(size = 16), + panel.grid = element_line(colour="lightsteelblue1"), + legend.position = "top") +ggsave("fig_output/mean_exp_by_time_sex.png", my_plot, width = 15, + height = 10) + +# This also works for grid.arrange() plots +combo_plot <- grid.arrange(count_gene_chromosome, exp_boxplot_sex, + ncol = 2, widths = c(4, 6)) +ggsave("fig_output/combo_plot_chromosome_sex.png", combo_plot, + width = 10, dpi = 300) +``` + +Note: The parameters `width` and `height` also determine the font size +in the saved plot. + +```{r final-challenge, eval=FALSE, purl=TRUE, echo=FALSE} +### Final plotting challenge: +## With all of this information in hand, please take another five +## minutes to either improve one of the plots generated in this +## exercise or create a beautiful graph of your own. Use the RStudio +## ggplot2 cheat sheet for inspiration: +## https://raw.githubusercontent.com/rstudio/cheatsheets/main/data-visualization.pdf +``` + +## Other packages for visualisation + +`ggplot2` is a very powerful package that fits very nicely in our _tidy +data_ and _tidy tools_ pipeline. There are other visualization packages +in R that shouldn't be ignored. + +### Base graphics + +The default graphics system that comes with R, often called _base R +graphics_ is simple and fast. It is based on the _painter's or canvas +model_, where different output are directly overlaid on top of each +other (see figure @ref(fig:paintermodel)). This is a fundamental +difference with `ggplot2` (and with `lattice`, described below), that +returns dedicated objects, that are rendered on screen or in a file, and +that can even be updated. + +```{r paintermodel, fig.width=12, fig.height=4, fig.cap="Successive layers added on top of each other."} +par(mfrow = c(1, 3)) +plot(1:20, main = "First layer, produced with plot(1:20)") + +plot(1:20, main = "A horizontal red line, added with abline(h = 10)") +abline(h = 10, col = "red") + +plot(1:20, main = "A rectangle, added with rect(5, 5, 15, 15)") +abline(h = 10, col = "red") +rect(5, 5, 15, 15, lwd = 3) +``` + +Another main difference is that base graphics' plotting function try to +do _the right_ thing based on their input type, i.e. they will adapt +their behaviour based on the class of their input. This is again very +different from what we have in `ggplot2`, that only accepts dataframes +as input, and that requires plots to be constructed bit by bit. + +```{r plotmethod, fig.width=8, fig.height=8, fig.cap="Plotting boxplots (top) and histograms (bottom) vectors (left) or a matrices (right)."} +par(mfrow = c(2, 2)) +boxplot(rnorm(100), + main = "Boxplot of rnorm(100)") +boxplot(matrix(rnorm(100), ncol = 10), + main = "Boxplot of matrix(rnorm(100), ncol = 10)") +hist(rnorm(100)) +hist(matrix(rnorm(100), ncol = 10)) +``` + +The out-of-the-box approach in base graphics can be very efficient for +simple, standard figures, that can be produced very quickly with a +single line of code and a single function such as `plot`, or `hist`, or +`boxplot`, ... The defaults are however not always the most appealing +and tuning of figures, especially when they become more complex (for +example to produce facets), can become lengthy and cumbersome. + +### The lattice package + +The **`lattice`** package is similar to `ggplot2` in that is uses +dataframes as input, returns graphical objects and supports faceting. +`lattice` however isn't based on the grammar of graphics and has a more +convoluted interface. + +A good reference for the `lattice` package is @latticebook. + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- Visualization in R + +:::::::::::::::::::::::::::::::::::::::::::::::::: From a3bb4ba2dd2a429da4e7b932620de5b976639726 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Fri, 29 Dec 2023 06:11:15 +0900 Subject: [PATCH 032/334] New translations 60-next-steps.md (French) --- locale/fr/episodes/60-next-steps.Rmd | 464 +++++++++++++++++++++++++++ 1 file changed, 464 insertions(+) create mode 100644 locale/fr/episodes/60-next-steps.Rmd diff --git a/locale/fr/episodes/60-next-steps.Rmd b/locale/fr/episodes/60-next-steps.Rmd new file mode 100644 index 000000000..77da2a8ad --- /dev/null +++ b/locale/fr/episodes/60-next-steps.Rmd @@ -0,0 +1,464 @@ +--- +source: Rmd +title: Next steps +teaching: 45 +exercises: 45 +--- + +```{r, include=FALSE} +``` + +::::::::::::::::::::::::::::::::::::::: objectives + +- Introduce the Bioconductor project. +- Introduce the notion of data containers. +- Give an overview of the `SummarizedExperiment`, extensively used in + omics analyses. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- What is a `SummarizedExperiment`? +- What is Bioconductor? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Next steps + +```{r, echo=FALSE, message=FALSE} +library("tidyverse") +``` + +Data in bioinformatics is often complex. To deal with this, +developers define specialised data containers (termed classes) that +match the properties of the data they need to handle. + +This aspect is central to the **Bioconductor**[^Bioconductor] project +which uses the same **core data infrastructure** across packages. This +certainly contributed to Bioconductor's success. Bioconductor package +developers are advised to make use of existing infrastructure to +provide coherence, interoperability, and stability to the project as a +whole. + +[^Bioconductor]: The [Bioconductor](https://www.bioconductor.org) was + initiated by Robert Gentleman, one of the two creators of the R + language. Bioconductor provides tools dedicated to omics data + analysis. Bioconductor uses the R statistical programming language + and is open source and open development. + +To illustrate such an omics data container, we'll present the +`SummarizedExperiment` class. + +## SummarizedExperiment + +The figure below represents the anatomy of the SummarizedExperiment class. + +```{r SE, echo=FALSE, out.width="80%"} +knitr::include_graphics("https://uclouvain-cbio.github.io/WSBIM1322/figs/SE.svg") +``` + +Objects of the class SummarizedExperiment contain : + +- **One (or more) assay(s)** containing the quantitative omics data + (expression data), stored as a matrix-like object. Features (genes, + transcripts, proteins, ...) are defined along the rows, and samples + along the columns. + +- A **sample metadata** slot containing sample co-variates, stored as a + data frame. Rows from this table represent samples (rows match exactly the + columns of the expression data). + +- A **feature metadata** slot containing feature co-variates, stored as + a data frame. The rows of this data frame match exactly the rows of the + expression data. + +The coordinated nature of the `SummarizedExperiment` guarantees that +during data manipulation, the dimensions of the different slots will +always match (i.e the columns in the expression data and then rows in +the sample metadata, as well as the rows in the expression data and +feature metadata) during data manipulation. For example, if we had to +exclude one sample from the assay, it would be automatically removed +from the sample metadata in the same operation. + +The metadata slots can grow additional co-variates +(columns) without affecting the other structures. + +### Creating a SummarizedExperiment + +In order to create a `SummarizedExperiment`, we will create the +individual components, i.e the count matrix, the sample and gene +metadata from csv files. These are typically how RNA-Seq data are +provided (after raw data have been processed). + +```{r, echo=FALSE, message=FALSE} +rna <- read_csv("data/rnaseq.csv") + +## count matrix +counts <- rna %>% + select(gene, sample, expression) %>% + pivot_wider(names_from = sample, + values_from = expression) + +## convert to matrix and set row names +count_matrix <- counts %>% select(-gene) %>% as.matrix() +rownames(count_matrix) <- counts$gene + +## sample annotation +sample_metadata <- rna %>% + select(sample, organism, age, sex, infection, strain, time, tissue, mouse) + +## remove redundancy +sample_metadata <- unique(sample_metadata) + +## gene annotation +gene_metadata <- rna %>% + select(gene, ENTREZID, product, ensembl_gene_id, external_synonym, + chromosome_name, gene_biotype, phenotype_description, + hsapiens_homolog_associated_gene_name) + +# remove redundancy +gene_metadata <- unique(gene_metadata) + +## write to csv +write.csv(count_matrix, file = "data/count_matrix.csv") +write.csv(gene_metadata, file = "data/gene_metadata.csv", row.names = FALSE) +write.csv(sample_metadata, file = "data/sample_metadata.csv", row.names = FALSE) +``` + +- **An expression matrix**: we load the count matrix, specifying that + the first columns contains row/gene names, and convert the + `data.frame` to a `matrix`. You can download it + [here](https://carpentries-incubator.github.io/bioc-intro/data/count_matrix.csv). + +```{r} +count_matrix <- read.csv("data/count_matrix.csv", + row.names = 1) %>% + as.matrix() + +count_matrix[1:5, ] +dim(count_matrix) +``` + +- **A table describing the samples**, available + [here](https://carpentries-incubator.github.io/bioc-intro/data/sample_metadata.csv). + +```{r} +sample_metadata <- read.csv("data/sample_metadata.csv") +sample_metadata +dim(sample_metadata) +``` + +- **A table describing the genes**, available + [here](https://carpentries-incubator.github.io/bioc-intro/data/gene_metadata.csv). + +```{r} +gene_metadata <- read.csv("data/gene_metadata.csv") +gene_metadata[1:10, 1:4] +dim(gene_metadata) +``` + +We will create a `SummarizedExperiment` from these tables: + +- The count matrix that will be used as the **`assay`** + +- The table describing the samples will be used as the **sample + metadata** slot + +- The table describing the genes will be used as the **features + metadata** slot + +To do this we can put the different parts together using the +`SummarizedExperiment` constructor: + +```{r, message=FALSE, warning=FALSE} +## BiocManager::install("SummarizedExperiment") +library("SummarizedExperiment") +``` + +First, we make sure that the samples are in the same order in the +count matrix and the sample annotation, and the same for the genes in +the count matrix and the gene annotation. + +```{r} +stopifnot(rownames(count_matrix) == gene_metadata$gene) +stopifnot(colnames(count_matrix) == sample_metadata$sample) +``` + +```{r} +se <- SummarizedExperiment(assays = list(counts = count_matrix), + colData = sample_metadata, + rowData = gene_metadata) +se +``` + +### Saving data + +Exporting data to a spreadsheet, as we did in a previous episode, has +several limitations, such as those described in the first chapter +(possible inconsistencies with `,` and `.` for decimal separators and +lack of variable type definitions). Furthermore, exporting data to a +spreadsheet is only relevant for rectangular data such as dataframes +and matrices. + +A more general way to save data, that is specific to R and is +guaranteed to work on any operating system, is to use the `saveRDS` +function. Saving objects like this will generate a binary +representation on disk (using the `rds` file extension here), which +can be loaded back into R using the `readRDS` function. + +```{r, eval=FALSE} +saveRDS(se, file = "data_output/se.rds") +rm(se) +se <- readRDS("data_output/se.rds") +head(se) +``` + +To conclude, when it comes to saving data from R that will be loaded +again in R, saving and loading with `saveRDS` and `readRDS` is the +preferred approach. If tabular data need to be shared with somebody +that is not using R, then exporting to a text-based spreadsheet is a +good alternative. + +Using this data structure, we can access the expression matrix with +the `assay` function: + +```{r} +head(assay(se)) +dim(assay(se)) +``` + +We can access the sample metadata using the `colData` function: + +```{r} +colData(se) +dim(colData(se)) +``` + +We can also access the feature metadata using the `rowData` function: + +```{r} +head(rowData(se)) +dim(rowData(se)) +``` + +### Subsetting a SummarizedExperiment + +SummarizedExperiment can be subset just like with data frames, with +numerics or with characters of logicals. + +Below, we create a new instance of class SummarizedExperiment that +contains only the 5 first features for the 3 first samples. + +```{r} +se1 <- se[1:5, 1:3] +se1 +``` + +```{r} +colData(se1) +rowData(se1) +``` + +We can also use the `colData()` function to subset on something from +the sample metadata or the `rowData()` to subset on something from the +feature metadata. For example, here we keep only miRNAs and the non +infected samples: + +```{r} +se1 <- se[rowData(se)$gene_biotype == "miRNA", + colData(se)$infection == "NonInfected"] +se1 +assay(se1) +colData(se1) +rowData(se1) +``` + +<!--For the following exercise, you should download the SE.rda object +(that contains the `se` object), and open the file using the 'load()' +function.--> + +<!-- ```{r, eval = FALSE, echo = FALSE} --> + +<!-- download.file(url = "https://raw.githubusercontent.com/UCLouvain-CBIO/bioinfo-training-01-intro-r/master/data/SE.rda", --> + +<!-- destfile = "data/SE.rda") --> + +<!-- load("data/SE.rda") --> + +<!-- ``` --> + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge + +Extract the gene expression levels of the 3 first genes in samples +at time 0 and at time 8. + +::::::::::::::: solution + +## Solution + +```{r, purl=FALSE} +assay(se)[1:3, colData(se)$time != 4] + +# Equivalent to +assay(se)[1:3, colData(se)$time == 0 | colData(se)$time == 8] +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge + +Verify that you get the same values using the long `rna` table. + +::::::::::::::: solution + +## Solution + +```{r, purl=FALSE} +rna |> + filter(gene %in% c("Asl", "Apod", "Cyd2d22")) |> + filter(time != 4) |> select(expression) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +The long table and the `SummarizedExperiment` contain the same +information, but are simply structured differently. Each approach has its +own advantages: the former is a good fit for the `tidyverse` packages, +while the latter is the preferred structure for many bioinformatics and +statistical processing steps. For example, a typical RNA-Seq analyses using +the `DESeq2` package. + +#### Adding variables to metadata + +We can also add information to the metadata. +Suppose that you want to add the center where the samples were collected... + +```{r} +colData(se)$center <- rep("University of Illinois", nrow(colData(se))) +colData(se) +``` + +This illustrates that the metadata slots can grow indefinitely without +affecting the other structures! + +### tidySummarizedExperiment + +You may be wondering, can we use tidyverse commands to interact with +`SummarizedExperiment` objects? The answer is yes, we can with the +`tidySummarizedExperiment` package. + +Remember what our SummarizedExperiment object looks like: + +```{r, message=FALSE} +se +``` + +Load `tidySummarizedExperiment` and then take a look at the se object +again. + +```{r, message=FALSE} +#BiocManager::install("tidySummarizedExperiment") +library("tidySummarizedExperiment") + +se +``` + +It's still a `SummarizedExperiment` object, so maintains the efficient +structure, but now we can view it as a tibble. Note the first line of +the output says this, it's a `SummarizedExperiment`-`tibble` +abstraction. We can also see in the second line of the output the +number of transcripts and samples. + +If we want to revert to the standard `SummarizedExperiment` view, we +can do that. + +```{r} +options("restore_SummarizedExperiment_show" = TRUE) +se +``` + +But here we use the tibble view. + +```{r} +options("restore_SummarizedExperiment_show" = FALSE) +se +``` + +We can now use tidyverse commands to interact with the +`SummarizedExperiment` object. + +We can use `filter` to filter for rows using a condition e.g. to view +all rows for one sample. + +```{r} +se %>% filter(.sample == "GSM2545336") +``` + +We can use `select` to specify columns we want to view. + +```{r} +se %>% select(.sample) +``` + +We can use `mutate` to add metadata info. + +```{r} +se %>% mutate(center = "Heidelberg University") +``` + +We can also combine commands with the tidyverse pipe `%>%`. For +example, we could combine `group_by` and `summarise` to get the total +counts for each sample. + +```{r} +se %>% + group_by(.sample) %>% + summarise(total_counts=sum(counts)) +``` + +We can treat the tidy SummarizedExperiment object as a normal tibble +for plotting. + +Here we plot the distribution of counts per sample. + +```{r tidySE-plot} +se %>% + ggplot(aes(counts + 1, group=.sample, color=infection)) + + geom_density() + + scale_x_log10() + + theme_bw() +``` + +For more information on tidySummarizedExperiment, see the package +website +[here](https://stemangiola.github.io/tidySummarizedExperiment/). + +**Take-home message** + +- `SummarizedExperiment` represents an efficient way to store and + handle omics data. + +- They are used in many Bioconductor packages. + +If you follow the next training focused on RNA sequencing analysis, +you will learn to use the Bioconductor `DESeq2` package to do some +differential expression analyses. The whole analysis of the `DESeq2` +package is handled in a `SummarizedExperiment`. + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- Bioconductor is a project provide support and packages for the + comprehension of high high-throughput biology data. +- A `SummarizedExperiment` is a type of object useful to store and + manage high-throughput omics data. + +:::::::::::::::::::::::::::::::::::::::::::::::::: From eff5a41b4681382c36470f560c3c4aa615ce02d0 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Fri, 29 Dec 2023 06:11:16 +0900 Subject: [PATCH 033/334] New translations 60-next-steps.md (Spanish) --- locale/es/episodes/60-next-steps.Rmd | 464 +++++++++++++++++++++++++++ 1 file changed, 464 insertions(+) create mode 100644 locale/es/episodes/60-next-steps.Rmd diff --git a/locale/es/episodes/60-next-steps.Rmd b/locale/es/episodes/60-next-steps.Rmd new file mode 100644 index 000000000..742cc9c2b --- /dev/null +++ b/locale/es/episodes/60-next-steps.Rmd @@ -0,0 +1,464 @@ +--- +source: Rmd +title: Next steps +teaching: 45 +exercises: 45 +--- + +```{r, include=FALSE} +``` + +::::::::::::::::::::::::::::::::::::::: objetivos + +- Introduce the Bioconductor project. +- Introduce the notion of data containers. +- Give an overview of the `SummarizedExperiment`, extensively used in + omics analyses. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- What is a `SummarizedExperiment`? +- What is Bioconductor? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Next steps + +```{r, echo=FALSE, message=FALSE} +library("tidyverse") +``` + +Data in bioinformatics is often complex. To deal with this, +developers define specialised data containers (termed classes) that +match the properties of the data they need to handle. + +This aspect is central to the **Bioconductor**[^Bioconductor] project +which uses the same **core data infrastructure** across packages. This +certainly contributed to Bioconductor's success. Bioconductor package +developers are advised to make use of existing infrastructure to +provide coherence, interoperability, and stability to the project as a +whole. + +[^Bioconductor]: The [Bioconductor](https://www.bioconductor.org) was + initiated by Robert Gentleman, one of the two creators of the R + language. Bioconductor provides tools dedicated to omics data + analysis. Bioconductor uses the R statistical programming language + and is open source and open development. + +To illustrate such an omics data container, we'll present the +`SummarizedExperiment` class. + +## SummarizedExperiment + +The figure below represents the anatomy of the SummarizedExperiment class. + +```{r SE, echo=FALSE, out.width="80%"} +knitr::include_graphics("https://uclouvain-cbio.github.io/WSBIM1322/figs/SE.svg") +``` + +Objects of the class SummarizedExperiment contain : + +- **One (or more) assay(s)** containing the quantitative omics data + (expression data), stored as a matrix-like object. Features (genes, + transcripts, proteins, ...) are defined along the rows, and samples + along the columns. + +- A **sample metadata** slot containing sample co-variates, stored as a + data frame. Rows from this table represent samples (rows match exactly the + columns of the expression data). + +- A **feature metadata** slot containing feature co-variates, stored as + a data frame. The rows of this data frame match exactly the rows of the + expression data. + +The coordinated nature of the `SummarizedExperiment` guarantees that +during data manipulation, the dimensions of the different slots will +always match (i.e the columns in the expression data and then rows in +the sample metadata, as well as the rows in the expression data and +feature metadata) during data manipulation. For example, if we had to +exclude one sample from the assay, it would be automatically removed +from the sample metadata in the same operation. + +The metadata slots can grow additional co-variates +(columns) without affecting the other structures. + +### Creating a SummarizedExperiment + +In order to create a `SummarizedExperiment`, we will create the +individual components, i.e the count matrix, the sample and gene +metadata from csv files. These are typically how RNA-Seq data are +provided (after raw data have been processed). + +```{r, echo=FALSE, message=FALSE} +rna <- read_csv("data/rnaseq.csv") + +## count matrix +counts <- rna %>% + select(gene, sample, expression) %>% + pivot_wider(names_from = sample, + values_from = expression) + +## convert to matrix and set row names +count_matrix <- counts %>% select(-gene) %>% as.matrix() +rownames(count_matrix) <- counts$gene + +## sample annotation +sample_metadata <- rna %>% + select(sample, organism, age, sex, infection, strain, time, tissue, mouse) + +## remove redundancy +sample_metadata <- unique(sample_metadata) + +## gene annotation +gene_metadata <- rna %>% + select(gene, ENTREZID, product, ensembl_gene_id, external_synonym, + chromosome_name, gene_biotype, phenotype_description, + hsapiens_homolog_associated_gene_name) + +# remove redundancy +gene_metadata <- unique(gene_metadata) + +## write to csv +write.csv(count_matrix, file = "data/count_matrix.csv") +write.csv(gene_metadata, file = "data/gene_metadata.csv", row.names = FALSE) +write.csv(sample_metadata, file = "data/sample_metadata.csv", row.names = FALSE) +``` + +- **An expression matrix**: we load the count matrix, specifying that + the first columns contains row/gene names, and convert the + `data.frame` to a `matrix`. You can download it + [here](https://carpentries-incubator.github.io/bioc-intro/data/count_matrix.csv). + +```{r} +count_matrix <- read.csv("data/count_matrix.csv", + row.names = 1) %>% + as.matrix() + +count_matrix[1:5, ] +dim(count_matrix) +``` + +- **A table describing the samples**, available + [here](https://carpentries-incubator.github.io/bioc-intro/data/sample_metadata.csv). + +```{r} +sample_metadata <- read.csv("data/sample_metadata.csv") +sample_metadata +dim(sample_metadata) +``` + +- **A table describing the genes**, available + [here](https://carpentries-incubator.github.io/bioc-intro/data/gene_metadata.csv). + +```{r} +gene_metadata <- read.csv("data/gene_metadata.csv") +gene_metadata[1:10, 1:4] +dim(gene_metadata) +``` + +We will create a `SummarizedExperiment` from these tables: + +- The count matrix that will be used as the **`assay`** + +- The table describing the samples will be used as the **sample + metadata** slot + +- The table describing the genes will be used as the **features + metadata** slot + +To do this we can put the different parts together using the +`SummarizedExperiment` constructor: + +```{r, message=FALSE, warning=FALSE} +## BiocManager::install("SummarizedExperiment") +library("SummarizedExperiment") +``` + +First, we make sure that the samples are in the same order in the +count matrix and the sample annotation, and the same for the genes in +the count matrix and the gene annotation. + +```{r} +stopifnot(rownames(count_matrix) == gene_metadata$gene) +stopifnot(colnames(count_matrix) == sample_metadata$sample) +``` + +```{r} +se <- SummarizedExperiment(assays = list(counts = count_matrix), + colData = sample_metadata, + rowData = gene_metadata) +se +``` + +### Saving data + +Exporting data to a spreadsheet, as we did in a previous episode, has +several limitations, such as those described in the first chapter +(possible inconsistencies with `,` and `.` for decimal separators and +lack of variable type definitions). Furthermore, exporting data to a +spreadsheet is only relevant for rectangular data such as dataframes +and matrices. + +A more general way to save data, that is specific to R and is +guaranteed to work on any operating system, is to use the `saveRDS` +function. Saving objects like this will generate a binary +representation on disk (using the `rds` file extension here), which +can be loaded back into R using the `readRDS` function. + +```{r, eval=FALSE} +saveRDS(se, file = "data_output/se.rds") +rm(se) +se <- readRDS("data_output/se.rds") +head(se) +``` + +To conclude, when it comes to saving data from R that will be loaded +again in R, saving and loading with `saveRDS` and `readRDS` is the +preferred approach. If tabular data need to be shared with somebody +that is not using R, then exporting to a text-based spreadsheet is a +good alternative. + +Using this data structure, we can access the expression matrix with +the `assay` function: + +```{r} +head(assay(se)) +dim(assay(se)) +``` + +We can access the sample metadata using the `colData` function: + +```{r} +colData(se) +dim(colData(se)) +``` + +We can also access the feature metadata using the `rowData` function: + +```{r} +head(rowData(se)) +dim(rowData(se)) +``` + +### Subsetting a SummarizedExperiment + +SummarizedExperiment can be subset just like with data frames, with +numerics or with characters of logicals. + +Below, we create a new instance of class SummarizedExperiment that +contains only the 5 first features for the 3 first samples. + +```{r} +se1 <- se[1:5, 1:3] +se1 +``` + +```{r} +colData(se1) +rowData(se1) +``` + +We can also use the `colData()` function to subset on something from +the sample metadata or the `rowData()` to subset on something from the +feature metadata. For example, here we keep only miRNAs and the non +infected samples: + +```{r} +se1 <- se[rowData(se)$gene_biotype == "miRNA", + colData(se)$infection == "NonInfected"] +se1 +assay(se1) +colData(se1) +rowData(se1) +``` + +<!--For the following exercise, you should download the SE.rda object +(that contains the `se` object), and open the file using the 'load()' +function.--> + +<!-- ```{r, eval = FALSE, echo = FALSE} --> + +<!-- download.file(url = "https://raw.githubusercontent.com/UCLouvain-CBIO/bioinfo-training-01-intro-r/master/data/SE.rda", --> + +<!-- destfile = "data/SE.rda") --> + +<!-- load("data/SE.rda") --> + +<!-- ``` --> + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge + +Extract the gene expression levels of the 3 first genes in samples +at time 0 and at time 8. + +::::::::::::::: solution + +## Solution + +```{r, purl=FALSE} +assay(se)[1:3, colData(se)$time != 4] + +# Equivalent to +assay(se)[1:3, colData(se)$time == 0 | colData(se)$time == 8] +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge + +Verify that you get the same values using the long `rna` table. + +::::::::::::::: solution + +## Solution + +```{r, purl=FALSE} +rna |> + filter(gene %in% c("Asl", "Apod", "Cyd2d22")) |> + filter(time != 4) |> select(expression) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +The long table and the `SummarizedExperiment` contain the same +information, but are simply structured differently. Each approach has its +own advantages: the former is a good fit for the `tidyverse` packages, +while the latter is the preferred structure for many bioinformatics and +statistical processing steps. For example, a typical RNA-Seq analyses using +the `DESeq2` package. + +#### Adding variables to metadata + +We can also add information to the metadata. +Suppose that you want to add the center where the samples were collected... + +```{r} +colData(se)$center <- rep("University of Illinois", nrow(colData(se))) +colData(se) +``` + +This illustrates that the metadata slots can grow indefinitely without +affecting the other structures! + +### tidySummarizedExperiment + +You may be wondering, can we use tidyverse commands to interact with +`SummarizedExperiment` objects? The answer is yes, we can with the +`tidySummarizedExperiment` package. + +Remember what our SummarizedExperiment object looks like: + +```{r, message=FALSE} +se +``` + +Load `tidySummarizedExperiment` and then take a look at the se object +again. + +```{r, message=FALSE} +#BiocManager::install("tidySummarizedExperiment") +library("tidySummarizedExperiment") + +se +``` + +It's still a `SummarizedExperiment` object, so maintains the efficient +structure, but now we can view it as a tibble. Note the first line of +the output says this, it's a `SummarizedExperiment`-`tibble` +abstraction. We can also see in the second line of the output the +number of transcripts and samples. + +If we want to revert to the standard `SummarizedExperiment` view, we +can do that. + +```{r} +options("restore_SummarizedExperiment_show" = TRUE) +se +``` + +But here we use the tibble view. + +```{r} +options("restore_SummarizedExperiment_show" = FALSE) +se +``` + +We can now use tidyverse commands to interact with the +`SummarizedExperiment` object. + +We can use `filter` to filter for rows using a condition e.g. to view +all rows for one sample. + +```{r} +se %>% filter(.sample == "GSM2545336") +``` + +We can use `select` to specify columns we want to view. + +```{r} +se %>% select(.sample) +``` + +We can use `mutate` to add metadata info. + +```{r} +se %>% mutate(center = "Heidelberg University") +``` + +We can also combine commands with the tidyverse pipe `%>%`. For +example, we could combine `group_by` and `summarise` to get the total +counts for each sample. + +```{r} +se %>% + group_by(.sample) %>% + summarise(total_counts=sum(counts)) +``` + +We can treat the tidy SummarizedExperiment object as a normal tibble +for plotting. + +Here we plot the distribution of counts per sample. + +```{r tidySE-plot} +se %>% + ggplot(aes(counts + 1, group=.sample, color=infection)) + + geom_density() + + scale_x_log10() + + theme_bw() +``` + +For more information on tidySummarizedExperiment, see the package +website +[here](https://stemangiola.github.io/tidySummarizedExperiment/). + +**Take-home message** + +- `SummarizedExperiment` represents an efficient way to store and + handle omics data. + +- They are used in many Bioconductor packages. + +If you follow the next training focused on RNA sequencing analysis, +you will learn to use the Bioconductor `DESeq2` package to do some +differential expression analyses. The whole analysis of the `DESeq2` +package is handled in a `SummarizedExperiment`. + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- Bioconductor is a project provide support and packages for the + comprehension of high high-throughput biology data. +- A `SummarizedExperiment` is a type of object useful to store and + manage high-throughput omics data. + +:::::::::::::::::::::::::::::::::::::::::::::::::: From f2e5701dd203fe8d8f89180e7736019789c2393c Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Fri, 29 Dec 2023 06:11:17 +0900 Subject: [PATCH 034/334] New translations 60-next-steps.md (Japanese) --- locale/ja/episodes/60-next-steps.Rmd | 464 +++++++++++++++++++++++++++ 1 file changed, 464 insertions(+) create mode 100644 locale/ja/episodes/60-next-steps.Rmd diff --git a/locale/ja/episodes/60-next-steps.Rmd b/locale/ja/episodes/60-next-steps.Rmd new file mode 100644 index 000000000..aa91aaaf6 --- /dev/null +++ b/locale/ja/episodes/60-next-steps.Rmd @@ -0,0 +1,464 @@ +--- +source: Rmd +title: Next steps +teaching: 45 +exercises: 45 +--- + +```{r, include=FALSE} +``` + +::::::::::::::::::::::::::::::::::::::: 目的 + +- Introduce the Bioconductor project. +- Introduce the notion of data containers. +- Give an overview of the `SummarizedExperiment`, extensively used in + omics analyses. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- What is a `SummarizedExperiment`? +- What is Bioconductor? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Next steps + +```{r, echo=FALSE, message=FALSE} +library("tidyverse") +``` + +Data in bioinformatics is often complex. To deal with this, +developers define specialised data containers (termed classes) that +match the properties of the data they need to handle. + +This aspect is central to the **Bioconductor**[^Bioconductor] project +which uses the same **core data infrastructure** across packages. This +certainly contributed to Bioconductor's success. Bioconductor package +developers are advised to make use of existing infrastructure to +provide coherence, interoperability, and stability to the project as a +whole. + +[^Bioconductor]: The [Bioconductor](https://www.bioconductor.org) was + initiated by Robert Gentleman, one of the two creators of the R + language. Bioconductor provides tools dedicated to omics data + analysis. Bioconductor uses the R statistical programming language + and is open source and open development. + +To illustrate such an omics data container, we'll present the +`SummarizedExperiment` class. + +## SummarizedExperiment + +The figure below represents the anatomy of the SummarizedExperiment class. + +```{r SE, echo=FALSE, out.width="80%"} +knitr::include_graphics("https://uclouvain-cbio.github.io/WSBIM1322/figs/SE.svg") +``` + +Objects of the class SummarizedExperiment contain : + +- **One (or more) assay(s)** containing the quantitative omics data + (expression data), stored as a matrix-like object. Features (genes, + transcripts, proteins, ...) are defined along the rows, and samples + along the columns. + +- A **sample metadata** slot containing sample co-variates, stored as a + data frame. Rows from this table represent samples (rows match exactly the + columns of the expression data). + +- A **feature metadata** slot containing feature co-variates, stored as + a data frame. The rows of this data frame match exactly the rows of the + expression data. + +The coordinated nature of the `SummarizedExperiment` guarantees that +during data manipulation, the dimensions of the different slots will +always match (i.e the columns in the expression data and then rows in +the sample metadata, as well as the rows in the expression data and +feature metadata) during data manipulation. For example, if we had to +exclude one sample from the assay, it would be automatically removed +from the sample metadata in the same operation. + +The metadata slots can grow additional co-variates +(columns) without affecting the other structures. + +### Creating a SummarizedExperiment + +In order to create a `SummarizedExperiment`, we will create the +individual components, i.e the count matrix, the sample and gene +metadata from csv files. These are typically how RNA-Seq data are +provided (after raw data have been processed). + +```{r, echo=FALSE, message=FALSE} +rna <- read_csv("data/rnaseq.csv") + +## count matrix +counts <- rna %>% + select(gene, sample, expression) %>% + pivot_wider(names_from = sample, + values_from = expression) + +## convert to matrix and set row names +count_matrix <- counts %>% select(-gene) %>% as.matrix() +rownames(count_matrix) <- counts$gene + +## sample annotation +sample_metadata <- rna %>% + select(sample, organism, age, sex, infection, strain, time, tissue, mouse) + +## remove redundancy +sample_metadata <- unique(sample_metadata) + +## gene annotation +gene_metadata <- rna %>% + select(gene, ENTREZID, product, ensembl_gene_id, external_synonym, + chromosome_name, gene_biotype, phenotype_description, + hsapiens_homolog_associated_gene_name) + +# remove redundancy +gene_metadata <- unique(gene_metadata) + +## write to csv +write.csv(count_matrix, file = "data/count_matrix.csv") +write.csv(gene_metadata, file = "data/gene_metadata.csv", row.names = FALSE) +write.csv(sample_metadata, file = "data/sample_metadata.csv", row.names = FALSE) +``` + +- **An expression matrix**: we load the count matrix, specifying that + the first columns contains row/gene names, and convert the + `data.frame` to a `matrix`. You can download it + [here](https://carpentries-incubator.github.io/bioc-intro/data/count_matrix.csv). + +```{r} +count_matrix <- read.csv("data/count_matrix.csv", + row.names = 1) %>% + as.matrix() + +count_matrix[1:5, ] +dim(count_matrix) +``` + +- **A table describing the samples**, available + [here](https://carpentries-incubator.github.io/bioc-intro/data/sample_metadata.csv). + +```{r} +sample_metadata <- read.csv("data/sample_metadata.csv") +sample_metadata +dim(sample_metadata) +``` + +- **A table describing the genes**, available + [here](https://carpentries-incubator.github.io/bioc-intro/data/gene_metadata.csv). + +```{r} +gene_metadata <- read.csv("data/gene_metadata.csv") +gene_metadata[1:10, 1:4] +dim(gene_metadata) +``` + +We will create a `SummarizedExperiment` from these tables: + +- The count matrix that will be used as the **`assay`** + +- The table describing the samples will be used as the **sample + metadata** slot + +- The table describing the genes will be used as the **features + metadata** slot + +To do this we can put the different parts together using the +`SummarizedExperiment` constructor: + +```{r, message=FALSE, warning=FALSE} +## BiocManager::install("SummarizedExperiment") +library("SummarizedExperiment") +``` + +First, we make sure that the samples are in the same order in the +count matrix and the sample annotation, and the same for the genes in +the count matrix and the gene annotation. + +```{r} +stopifnot(rownames(count_matrix) == gene_metadata$gene) +stopifnot(colnames(count_matrix) == sample_metadata$sample) +``` + +```{r} +se <- SummarizedExperiment(assays = list(counts = count_matrix), + colData = sample_metadata, + rowData = gene_metadata) +se +``` + +### Saving data + +Exporting data to a spreadsheet, as we did in a previous episode, has +several limitations, such as those described in the first chapter +(possible inconsistencies with `,` and `.` for decimal separators and +lack of variable type definitions). Furthermore, exporting data to a +spreadsheet is only relevant for rectangular data such as dataframes +and matrices. + +A more general way to save data, that is specific to R and is +guaranteed to work on any operating system, is to use the `saveRDS` +function. Saving objects like this will generate a binary +representation on disk (using the `rds` file extension here), which +can be loaded back into R using the `readRDS` function. + +```{r, eval=FALSE} +saveRDS(se, file = "data_output/se.rds") +rm(se) +se <- readRDS("data_output/se.rds") +head(se) +``` + +To conclude, when it comes to saving data from R that will be loaded +again in R, saving and loading with `saveRDS` and `readRDS` is the +preferred approach. If tabular data need to be shared with somebody +that is not using R, then exporting to a text-based spreadsheet is a +good alternative. + +Using this data structure, we can access the expression matrix with +the `assay` function: + +```{r} +head(assay(se)) +dim(assay(se)) +``` + +We can access the sample metadata using the `colData` function: + +```{r} +colData(se) +dim(colData(se)) +``` + +We can also access the feature metadata using the `rowData` function: + +```{r} +head(rowData(se)) +dim(rowData(se)) +``` + +### Subsetting a SummarizedExperiment + +SummarizedExperiment can be subset just like with data frames, with +numerics or with characters of logicals. + +Below, we create a new instance of class SummarizedExperiment that +contains only the 5 first features for the 3 first samples. + +```{r} +se1 <- se[1:5, 1:3] +se1 +``` + +```{r} +colData(se1) +rowData(se1) +``` + +We can also use the `colData()` function to subset on something from +the sample metadata or the `rowData()` to subset on something from the +feature metadata. For example, here we keep only miRNAs and the non +infected samples: + +```{r} +se1 <- se[rowData(se)$gene_biotype == "miRNA", + colData(se)$infection == "NonInfected"] +se1 +assay(se1) +colData(se1) +rowData(se1) +``` + +<!--For the following exercise, you should download the SE.rda object +(that contains the `se` object), and open the file using the 'load()' +function.--> + +<!-- ```{r, eval = FALSE, echo = FALSE} --> + +<!-- download.file(url = "https://raw.githubusercontent.com/UCLouvain-CBIO/bioinfo-training-01-intro-r/master/data/SE.rda", --> + +<!-- destfile = "data/SE.rda") --> + +<!-- load("data/SE.rda") --> + +<!-- ``` --> + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge + +Extract the gene expression levels of the 3 first genes in samples +at time 0 and at time 8. + +::::::::::::::: solution + +## Solution + +```{r, purl=FALSE} +assay(se)[1:3, colData(se)$time != 4] + +# Equivalent to +assay(se)[1:3, colData(se)$time == 0 | colData(se)$time == 8] +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge + +Verify that you get the same values using the long `rna` table. + +::::::::::::::: solution + +## Solution + +```{r, purl=FALSE} +rna |> + filter(gene %in% c("Asl", "Apod", "Cyd2d22")) |> + filter(time != 4) |> select(expression) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +The long table and the `SummarizedExperiment` contain the same +information, but are simply structured differently. Each approach has its +own advantages: the former is a good fit for the `tidyverse` packages, +while the latter is the preferred structure for many bioinformatics and +statistical processing steps. For example, a typical RNA-Seq analyses using +the `DESeq2` package. + +#### Adding variables to metadata + +We can also add information to the metadata. +Suppose that you want to add the center where the samples were collected... + +```{r} +colData(se)$center <- rep("University of Illinois", nrow(colData(se))) +colData(se) +``` + +This illustrates that the metadata slots can grow indefinitely without +affecting the other structures! + +### tidySummarizedExperiment + +You may be wondering, can we use tidyverse commands to interact with +`SummarizedExperiment` objects? The answer is yes, we can with the +`tidySummarizedExperiment` package. + +Remember what our SummarizedExperiment object looks like: + +```{r, message=FALSE} +se +``` + +Load `tidySummarizedExperiment` and then take a look at the se object +again. + +```{r, message=FALSE} +#BiocManager::install("tidySummarizedExperiment") +library("tidySummarizedExperiment") + +se +``` + +It's still a `SummarizedExperiment` object, so maintains the efficient +structure, but now we can view it as a tibble. Note the first line of +the output says this, it's a `SummarizedExperiment`-`tibble` +abstraction. We can also see in the second line of the output the +number of transcripts and samples. + +If we want to revert to the standard `SummarizedExperiment` view, we +can do that. + +```{r} +options("restore_SummarizedExperiment_show" = TRUE) +se +``` + +But here we use the tibble view. + +```{r} +options("restore_SummarizedExperiment_show" = FALSE) +se +``` + +We can now use tidyverse commands to interact with the +`SummarizedExperiment` object. + +We can use `filter` to filter for rows using a condition e.g. to view +all rows for one sample. + +```{r} +se %>% filter(.sample == "GSM2545336") +``` + +We can use `select` to specify columns we want to view. + +```{r} +se %>% select(.sample) +``` + +We can use `mutate` to add metadata info. + +```{r} +se %>% mutate(center = "Heidelberg University") +``` + +We can also combine commands with the tidyverse pipe `%>%`. For +example, we could combine `group_by` and `summarise` to get the total +counts for each sample. + +```{r} +se %>% + group_by(.sample) %>% + summarise(total_counts=sum(counts)) +``` + +We can treat the tidy SummarizedExperiment object as a normal tibble +for plotting. + +Here we plot the distribution of counts per sample. + +```{r tidySE-plot} +se %>% + ggplot(aes(counts + 1, group=.sample, color=infection)) + + geom_density() + + scale_x_log10() + + theme_bw() +``` + +For more information on tidySummarizedExperiment, see the package +website +[here](https://stemangiola.github.io/tidySummarizedExperiment/). + +**Take-home message** + +- `SummarizedExperiment` represents an efficient way to store and + handle omics data. + +- They are used in many Bioconductor packages. + +If you follow the next training focused on RNA sequencing analysis, +you will learn to use the Bioconductor `DESeq2` package to do some +differential expression analyses. The whole analysis of the `DESeq2` +package is handled in a `SummarizedExperiment`. + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- Bioconductor is a project provide support and packages for the + comprehension of high high-throughput biology data. +- A `SummarizedExperiment` is a type of object useful to store and + manage high-throughput omics data. + +:::::::::::::::::::::::::::::::::::::::::::::::::: From a9f9a8b21670d8d99e3e5e60a17090b0c2ebe029 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Fri, 29 Dec 2023 06:11:19 +0900 Subject: [PATCH 035/334] New translations 60-next-steps.md (Portuguese) --- locale/pt/episodes/60-next-steps.Rmd | 449 +++++++++++++++++++++++++++ 1 file changed, 449 insertions(+) create mode 100644 locale/pt/episodes/60-next-steps.Rmd diff --git a/locale/pt/episodes/60-next-steps.Rmd b/locale/pt/episodes/60-next-steps.Rmd new file mode 100644 index 000000000..3ecb0f797 --- /dev/null +++ b/locale/pt/episodes/60-next-steps.Rmd @@ -0,0 +1,449 @@ +--- +source: Rmd +title: Next steps +teaching: 45 +exercises: 45 +--- + +```{r, include=FALSE} +``` + +::::::::::::::::::::::::::::::::::::::: objectives + +- Introduce the Bioconductor project. +- Introduce the notion of data containers. +- Give an overview of the `SummarizedExperiment`, extensively used in + omics analyses. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- What is a `SummarizedExperiment`? +- What is Bioconductor? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Next steps + +```{r, echo=FALSE, message=FALSE} +library("tidyverse") +``` + +Data in bioinformatics is often complex. Para lidar com este problema, os programadores de +definem contentores de dados especializados (denominados de classes) que correspondem às propriedades dos dados que necessitam ser manipulados. + +Este aspeto é fundamental para o projeto do **Bioconductor**[^Bioconductor] que utiliza a mesma **infraestrutura de dados central** em todos os pacotes. Esta estrutura contribuiu certamente para o sucesso do Bioconductor. Os programadores de pacotes do Bioconductor +são aconselhados a utilizar a infraestrutura existente para +proporcionar coerência, interoperabilidade e estabilidade ao projeto como um todo. + +[^Bioconductor]: The [Bioconductor](https://www.bioconductor.org) was + initiated by Robert Gentleman, one of the two creators of the R + language. O Bioconductor fornece ferramentas dedicadas a análise de dados ômicos. O Bioconductor utiliza a linguagem de programação estatística R e tem o código e o desenvolvimento aberto. + +Para ilustrar um contêiner de dados ômicos, apresentaremos a classe +`SummarizedExperiment`. + +## SummarizedExperiment + +The figure below represents the anatomy of the SummarizedExperiment class. + +```{r SE, echo=FALSE, out.width="80%"} +knitr::include_graphics("https://uclouvain-cbio.github.io/WSBIM1322/figs/SE.svg") +``` + +Objects of the class SummarizedExperiment contain : + +- **One (or more) assay(s)** containing the quantitative omics data + (expression data), stored as a matrix-like object. Features (genes, + transcripts, proteins, ...) are defined along the rows, and samples + along the columns. + +- A **sample metadata** slot containing sample co-variates, stored as a + data frame. As linhas desta tabela representam amostras (as linhas correspondem exatamente às colunas + dos dados de expressão gênica). + +- A **feature metadata** slot containing feature co-variates, stored as + a data frame. As linhas desta estrutura de dados correspondem exatamente às linhas dos dados da expressão. + +A natureza coordenada do `SummarizedExperiment` garante que +durante a manipulação dos dados, as dimensões dos diferentes compartimentos serão +sempre correspondentes (por exemplo as colunas nos dados de expressão e, em seguida, as linhas nos metadados da amostra, bem como as linhas nos dados de expressão e +metadados das variáveis) durante a manipulação dos dados. Por exemplo, se tivéssemos que +excluir uma amostra do ensaio, esta seria automaticamente removida +dos metadados da amostra na mesma operação. + +Os compartimentos de metadados podem aumentar as co-variáveis adicionais +(colunas) sem afetar as outras estruturas. + +### Creating a SummarizedExperiment + +In order to create a `SummarizedExperiment`, we will create the +individual components, i.e the count matrix, the sample and gene +metadata from csv files. Normalmente, é assim que os dados de RNA-Seq são +fornecidos (depois dos dados brutos terem sido processados). + +```{r, echo=FALSE, message=FALSE} +rna <- read_csv("data/rnaseq.csv") + +## count matrix +counts <- rna %>% + select(gene, sample, expression) %>% + pivot_wider(names_from = sample, + values_from = expression) + +## convert to matrix and set row names +count_matrix <- counts %>% select(-gene) %>% as.matrix() +rownames(count_matrix) <- counts$gene + +## sample annotation +sample_metadata <- rna %>% + select(sample, organism, age, sex, infection, strain, time, tissue, mouse) + +## remove redundancy +sample_metadata <- unique(sample_metadata) + +## gene annotation +gene_metadata <- rna %>% + select(gene, ENTREZID, product, ensembl_gene_id, external_synonym, + chromosome_name, gene_biotype, phenotype_description, + hsapiens_homolog_associated_gene_name) + +# remove redundancy +gene_metadata <- unique(gene_metadata) + +## write to csv +write.csv(count_matrix, file = "data/count_matrix.csv") +write.csv(gene_metadata, file = "data/gene_metadata.csv", row.names = FALSE) +write.csv(sample_metadata, file = "data/sample_metadata.csv", row.names = FALSE) +``` + +- **An expression matrix**: we load the count matrix, specifying that + the first columns contains row/gene names, and convert the + `data.frame` to a `matrix`. You can download it + [here](https://carpentries-incubator.github.io/bioc-intro/data/count_matrix.csv). + +```{r} +count_matrix <- read.csv("data/count_matrix.csv", + row.names = 1) %>% + as.matrix() + +count_matrix[1:5, ] +dim(count_matrix) +``` + +- **A table describing the samples**, available + [here](https://carpentries-incubator.github.io/bioc-intro/data/sample_metadata.csv). + +```{r} +sample_metadata <- read.csv("data/sample_metadata.csv") +sample_metadata +dim(sample_metadata) +``` + +- **A table describing the genes**, available + [here](https://carpentries-incubator.github.io/bioc-intro/data/gene_metadata.csv). + +```{r} +gene_metadata <- read.csv("data/gene_metadata.csv") +gene_metadata[1:10, 1:4] +dim(gene_metadata) +``` + +We will create a `SummarizedExperiment` from these tables: + +- The count matrix that will be used as the **`assay`** + +- The table describing the samples will be used as the **sample + metadata** slot + +- The table describing the genes will be used as the **features + metadata** slot + +To do this we can put the different parts together using the +`SummarizedExperiment` constructor: + +```{r, message=FALSE, warning=FALSE} +## BiocManager::install("SummarizedExperiment") +library("SummarizedExperiment") +``` + +First, we make sure that the samples are in the same order in the +count matrix and the sample annotation, and the same for the genes in +the count matrix and the gene annotation. + +```{r} +stopifnot(rownames(count_matrix) == gene_metadata$gene) +stopifnot(colnames(count_matrix) == sample_metadata$sample) +``` + +```{r} +se <- SummarizedExperiment(assays = list(counts = count_matrix), + colData = sample_metadata, + rowData = gene_metadata) +se +``` + +### Saving data + +Exporting data to a spreadsheet, as we did in a previous episode, has +several limitations, such as those described in the first chapter +(possible inconsistencies with `,` and `.` for decimal separators and +lack of variable type definitions). Além disso, a exportação dos dados para uma spreadsheet +só é relevante para dados retangulares, tais como data.frames +e matrizes. + +Uma forma mais geral de guardar dados, que é específica do R e é +garantida para funcionar em qualquer sistema operativo, é utilizar a função `saveRDS`. Guardar objetos como este irá gerar uma representação binária +no disco (usando a extensão de arquivo `rds` aqui), que +pode ser carregada de volta para o R usando a função `readRDS`. + +```{r, eval=FALSE} +saveRDS(se, file = "data_output/se.rds") +rm(se) +se <- readRDS("data_output/se.rds") +head(se) +``` + +To conclude, when it comes to saving data from R that will be loaded +again in R, saving and loading with `saveRDS` and `readRDS` is the +preferred approach. Se os dados tabulares tiverem de ser partilhados com alguém +que não utilize o R, então a exportação para spreadsheet baseada em texto é uma +boa alternativa. + +Using this data structure, we can access the expression matrix with +the `assay` function: + +```{r} +head(assay(se)) +dim(assay(se)) +``` + +We can access the sample metadata using the `colData` function: + +```{r} +colData(se) +dim(colData(se)) +``` + +We can also access the feature metadata using the `rowData` function: + +```{r} +head(rowData(se)) +dim(rowData(se)) +``` + +### Subsetting a SummarizedExperiment + +SummarizedExperiment can be subset just like with data frames, with +numerics or with characters of logicals. + +Abaixo, criamos uma nova instância da classe SummarizedExperiment que contém apenas as 5 primeiras variáveis para as 3 primeiras amostras. + +```{r} +se1 <- se[1:5, 1:3] +se1 +``` + +```{r} +colData(se1) +rowData(se1) +``` + +We can also use the `colData()` function to subset on something from +the sample metadata or the `rowData()` to subset on something from the +feature metadata. For example, here we keep only miRNAs and the non +infected samples: + +```{r} +se1 <- se[rowData(se)$gene_biotype == "miRNA", + colData(se)$infection == "NonInfected"] +se1 +assay(se1) +colData(se1) +rowData(se1) +``` + +<!--For the following exercise, you should download the SE.rda object +(that contains the `se` object), and open the file using the 'load()' +function.--> + +<!-- ```{r, eval = FALSE, echo = FALSE} --> + +<!-- download.file(url = "https://raw.githubusercontent.com/UCLouvain-CBIO/bioinfo-training-01-intro-r/master/data/SE.rda", --> + +<!-- destfile = "data/SE.rda") --> + +<!-- load("data/SE.rda") --> + +<!-- ``` --> + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge + +Extract the gene expression levels of the 3 first genes in samples +at time 0 and at time 8. + +::::::::::::::: solution + +## Solution + +```{r, purl=FALSE} +assay(se)[1:3, colData(se)$time != 4] + +# Equivalent to +assay(se)[1:3, colData(se)$time == 0 | colData(se)$time == 8] +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge + +Verify that you get the same values using the long `rna` table. + +::::::::::::::: solution + +## Solution + +```{r, purl=FALSE} +rna |> + filter(gene %in% c("Asl", "Apod", "Cyd2d22")) |> + filter(time != 4) |> select(expression) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +The long table and the `SummarizedExperiment` contain the same +information, but are simply structured differently. Cada abordagem tem as suas +próprias vantagens: a primeira adequa-se bem aos pacotes `tidyverse`, +enquanto a segunda é a estrutura preferida para muitas etapas de processamento bioinformático e +estatístico. Por exemplo, uma análise típica de RNA-Seq utilizando +o pacote `DESeq2`. + +#### Adding variables to metadata + +We can also add information to the metadata. +Suponhamos que se pretende acrescentar o local onde as amostras foram recolhidas... + +```{r} +colData(se)$center <- rep("University of Illinois", nrow(colData(se))) +colData(se) +``` + +This illustrates that the metadata slots can grow indefinitely without +affecting the other structures! + +### tidySummarizedExperiment + +You may be wondering, can we use tidyverse commands to interact with +`SummarizedExperiment` objects? A resposta é sim, podemos fazê-lo com o pacote +`tidySummarizedExperiment`. + +Remember what our SummarizedExperiment object looks like: + +```{r, message=FALSE} +se +``` + +Load `tidySummarizedExperiment` and then take a look at the se object +again. + +```{r, message=FALSE} +#BiocManager::install("tidySummarizedExperiment") +library("tidySummarizedExperiment") + +se +``` + +It's still a `SummarizedExperiment` object, so maintains the efficient +structure, but now we can view it as a tibble. Repare que na primeira linha do output diz isto: +`SummarizedExperiment`-`tibble` +abstraction. Também podemos ver na segunda linha do output o +número de transcrições e amostras. + +Se quisermos, podemos reverter para a visualização padrão do `SummarizedExperiment`. + +```{r} +options("restore_SummarizedExperiment_show" = TRUE) +se +``` + +But here we use the tibble view. + +```{r} +options("restore_SummarizedExperiment_show" = FALSE) +se +``` + +We can now use tidyverse commands to interact with the +`SummarizedExperiment` object. + +Podemos utilizar `filter` para filtrar as linhas utilizando uma condição, por exemplo, para visualizar +todas as linhas de uma amostra. + +```{r} +se %>% filter(.sample == "GSM2545336") +``` + +We can use `select` to specify columns we want to view. + +```{r} +se %>% select(.sample) +``` + +We can use `mutate` to add metadata info. + +```{r} +se %>% mutate(center = "Heidelberg University") +``` + +We can also combine commands with the tidyverse pipe `%>%`. Por exemplo, poderíamos combinar `group_by` e `summarise` para obter o total de contagens para cada amostra. + +```{r} +se %>% + group_by(.sample) %>% + summarise(total_counts=sum(counts)) +``` + +We can treat the tidy SummarizedExperiment object as a normal tibble +for plotting. + +Aqui traçamos a distribuição das contagens por amostra. + +```{r tidySE-plot} +se %>% + ggplot(aes(counts + 1, group=.sample, color=infection)) + + geom_density() + + scale_x_log10() + + theme_bw() +``` + +For more information on tidySummarizedExperiment, see the package +website +[here](https://stemangiola.github.io/tidySummarizedExperiment/). + +**Take-home message** + +- `SummarizedExperiment` represents an efficient way to store and + handle omics data. + +- They are used in many Bioconductor packages. + +Se seguir a próxima formação centrada na análise de sequências de RNA, aprenderá a utilizar o pacote Bioconductor `DESeq2` para efetuar algumas análises de expressão diferencial. Toda a análise do pacote `DESeq2` +é tratada num `SummarizedExperiment`. + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- Bioconductor is a project provide support and packages for the + comprehension of high high-throughput biology data. +- A `SummarizedExperiment` is a type of object useful to store and + manage high-throughput omics data. + +:::::::::::::::::::::::::::::::::::::::::::::::::: From d582c752ce7e0ab21dfd14d12eecbe88e5de7c0f Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Fri, 29 Dec 2023 06:11:20 +0900 Subject: [PATCH 036/334] New translations 60-next-steps.md (Chinese Simplified) --- locale/zh/episodes/60-next-steps.Rmd | 464 +++++++++++++++++++++++++++ 1 file changed, 464 insertions(+) create mode 100644 locale/zh/episodes/60-next-steps.Rmd diff --git a/locale/zh/episodes/60-next-steps.Rmd b/locale/zh/episodes/60-next-steps.Rmd new file mode 100644 index 000000000..77da2a8ad --- /dev/null +++ b/locale/zh/episodes/60-next-steps.Rmd @@ -0,0 +1,464 @@ +--- +source: Rmd +title: Next steps +teaching: 45 +exercises: 45 +--- + +```{r, include=FALSE} +``` + +::::::::::::::::::::::::::::::::::::::: objectives + +- Introduce the Bioconductor project. +- Introduce the notion of data containers. +- Give an overview of the `SummarizedExperiment`, extensively used in + omics analyses. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- What is a `SummarizedExperiment`? +- What is Bioconductor? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Next steps + +```{r, echo=FALSE, message=FALSE} +library("tidyverse") +``` + +Data in bioinformatics is often complex. To deal with this, +developers define specialised data containers (termed classes) that +match the properties of the data they need to handle. + +This aspect is central to the **Bioconductor**[^Bioconductor] project +which uses the same **core data infrastructure** across packages. This +certainly contributed to Bioconductor's success. Bioconductor package +developers are advised to make use of existing infrastructure to +provide coherence, interoperability, and stability to the project as a +whole. + +[^Bioconductor]: The [Bioconductor](https://www.bioconductor.org) was + initiated by Robert Gentleman, one of the two creators of the R + language. Bioconductor provides tools dedicated to omics data + analysis. Bioconductor uses the R statistical programming language + and is open source and open development. + +To illustrate such an omics data container, we'll present the +`SummarizedExperiment` class. + +## SummarizedExperiment + +The figure below represents the anatomy of the SummarizedExperiment class. + +```{r SE, echo=FALSE, out.width="80%"} +knitr::include_graphics("https://uclouvain-cbio.github.io/WSBIM1322/figs/SE.svg") +``` + +Objects of the class SummarizedExperiment contain : + +- **One (or more) assay(s)** containing the quantitative omics data + (expression data), stored as a matrix-like object. Features (genes, + transcripts, proteins, ...) are defined along the rows, and samples + along the columns. + +- A **sample metadata** slot containing sample co-variates, stored as a + data frame. Rows from this table represent samples (rows match exactly the + columns of the expression data). + +- A **feature metadata** slot containing feature co-variates, stored as + a data frame. The rows of this data frame match exactly the rows of the + expression data. + +The coordinated nature of the `SummarizedExperiment` guarantees that +during data manipulation, the dimensions of the different slots will +always match (i.e the columns in the expression data and then rows in +the sample metadata, as well as the rows in the expression data and +feature metadata) during data manipulation. For example, if we had to +exclude one sample from the assay, it would be automatically removed +from the sample metadata in the same operation. + +The metadata slots can grow additional co-variates +(columns) without affecting the other structures. + +### Creating a SummarizedExperiment + +In order to create a `SummarizedExperiment`, we will create the +individual components, i.e the count matrix, the sample and gene +metadata from csv files. These are typically how RNA-Seq data are +provided (after raw data have been processed). + +```{r, echo=FALSE, message=FALSE} +rna <- read_csv("data/rnaseq.csv") + +## count matrix +counts <- rna %>% + select(gene, sample, expression) %>% + pivot_wider(names_from = sample, + values_from = expression) + +## convert to matrix and set row names +count_matrix <- counts %>% select(-gene) %>% as.matrix() +rownames(count_matrix) <- counts$gene + +## sample annotation +sample_metadata <- rna %>% + select(sample, organism, age, sex, infection, strain, time, tissue, mouse) + +## remove redundancy +sample_metadata <- unique(sample_metadata) + +## gene annotation +gene_metadata <- rna %>% + select(gene, ENTREZID, product, ensembl_gene_id, external_synonym, + chromosome_name, gene_biotype, phenotype_description, + hsapiens_homolog_associated_gene_name) + +# remove redundancy +gene_metadata <- unique(gene_metadata) + +## write to csv +write.csv(count_matrix, file = "data/count_matrix.csv") +write.csv(gene_metadata, file = "data/gene_metadata.csv", row.names = FALSE) +write.csv(sample_metadata, file = "data/sample_metadata.csv", row.names = FALSE) +``` + +- **An expression matrix**: we load the count matrix, specifying that + the first columns contains row/gene names, and convert the + `data.frame` to a `matrix`. You can download it + [here](https://carpentries-incubator.github.io/bioc-intro/data/count_matrix.csv). + +```{r} +count_matrix <- read.csv("data/count_matrix.csv", + row.names = 1) %>% + as.matrix() + +count_matrix[1:5, ] +dim(count_matrix) +``` + +- **A table describing the samples**, available + [here](https://carpentries-incubator.github.io/bioc-intro/data/sample_metadata.csv). + +```{r} +sample_metadata <- read.csv("data/sample_metadata.csv") +sample_metadata +dim(sample_metadata) +``` + +- **A table describing the genes**, available + [here](https://carpentries-incubator.github.io/bioc-intro/data/gene_metadata.csv). + +```{r} +gene_metadata <- read.csv("data/gene_metadata.csv") +gene_metadata[1:10, 1:4] +dim(gene_metadata) +``` + +We will create a `SummarizedExperiment` from these tables: + +- The count matrix that will be used as the **`assay`** + +- The table describing the samples will be used as the **sample + metadata** slot + +- The table describing the genes will be used as the **features + metadata** slot + +To do this we can put the different parts together using the +`SummarizedExperiment` constructor: + +```{r, message=FALSE, warning=FALSE} +## BiocManager::install("SummarizedExperiment") +library("SummarizedExperiment") +``` + +First, we make sure that the samples are in the same order in the +count matrix and the sample annotation, and the same for the genes in +the count matrix and the gene annotation. + +```{r} +stopifnot(rownames(count_matrix) == gene_metadata$gene) +stopifnot(colnames(count_matrix) == sample_metadata$sample) +``` + +```{r} +se <- SummarizedExperiment(assays = list(counts = count_matrix), + colData = sample_metadata, + rowData = gene_metadata) +se +``` + +### Saving data + +Exporting data to a spreadsheet, as we did in a previous episode, has +several limitations, such as those described in the first chapter +(possible inconsistencies with `,` and `.` for decimal separators and +lack of variable type definitions). Furthermore, exporting data to a +spreadsheet is only relevant for rectangular data such as dataframes +and matrices. + +A more general way to save data, that is specific to R and is +guaranteed to work on any operating system, is to use the `saveRDS` +function. Saving objects like this will generate a binary +representation on disk (using the `rds` file extension here), which +can be loaded back into R using the `readRDS` function. + +```{r, eval=FALSE} +saveRDS(se, file = "data_output/se.rds") +rm(se) +se <- readRDS("data_output/se.rds") +head(se) +``` + +To conclude, when it comes to saving data from R that will be loaded +again in R, saving and loading with `saveRDS` and `readRDS` is the +preferred approach. If tabular data need to be shared with somebody +that is not using R, then exporting to a text-based spreadsheet is a +good alternative. + +Using this data structure, we can access the expression matrix with +the `assay` function: + +```{r} +head(assay(se)) +dim(assay(se)) +``` + +We can access the sample metadata using the `colData` function: + +```{r} +colData(se) +dim(colData(se)) +``` + +We can also access the feature metadata using the `rowData` function: + +```{r} +head(rowData(se)) +dim(rowData(se)) +``` + +### Subsetting a SummarizedExperiment + +SummarizedExperiment can be subset just like with data frames, with +numerics or with characters of logicals. + +Below, we create a new instance of class SummarizedExperiment that +contains only the 5 first features for the 3 first samples. + +```{r} +se1 <- se[1:5, 1:3] +se1 +``` + +```{r} +colData(se1) +rowData(se1) +``` + +We can also use the `colData()` function to subset on something from +the sample metadata or the `rowData()` to subset on something from the +feature metadata. For example, here we keep only miRNAs and the non +infected samples: + +```{r} +se1 <- se[rowData(se)$gene_biotype == "miRNA", + colData(se)$infection == "NonInfected"] +se1 +assay(se1) +colData(se1) +rowData(se1) +``` + +<!--For the following exercise, you should download the SE.rda object +(that contains the `se` object), and open the file using the 'load()' +function.--> + +<!-- ```{r, eval = FALSE, echo = FALSE} --> + +<!-- download.file(url = "https://raw.githubusercontent.com/UCLouvain-CBIO/bioinfo-training-01-intro-r/master/data/SE.rda", --> + +<!-- destfile = "data/SE.rda") --> + +<!-- load("data/SE.rda") --> + +<!-- ``` --> + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge + +Extract the gene expression levels of the 3 first genes in samples +at time 0 and at time 8. + +::::::::::::::: solution + +## Solution + +```{r, purl=FALSE} +assay(se)[1:3, colData(se)$time != 4] + +# Equivalent to +assay(se)[1:3, colData(se)$time == 0 | colData(se)$time == 8] +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge + +Verify that you get the same values using the long `rna` table. + +::::::::::::::: solution + +## Solution + +```{r, purl=FALSE} +rna |> + filter(gene %in% c("Asl", "Apod", "Cyd2d22")) |> + filter(time != 4) |> select(expression) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +The long table and the `SummarizedExperiment` contain the same +information, but are simply structured differently. Each approach has its +own advantages: the former is a good fit for the `tidyverse` packages, +while the latter is the preferred structure for many bioinformatics and +statistical processing steps. For example, a typical RNA-Seq analyses using +the `DESeq2` package. + +#### Adding variables to metadata + +We can also add information to the metadata. +Suppose that you want to add the center where the samples were collected... + +```{r} +colData(se)$center <- rep("University of Illinois", nrow(colData(se))) +colData(se) +``` + +This illustrates that the metadata slots can grow indefinitely without +affecting the other structures! + +### tidySummarizedExperiment + +You may be wondering, can we use tidyverse commands to interact with +`SummarizedExperiment` objects? The answer is yes, we can with the +`tidySummarizedExperiment` package. + +Remember what our SummarizedExperiment object looks like: + +```{r, message=FALSE} +se +``` + +Load `tidySummarizedExperiment` and then take a look at the se object +again. + +```{r, message=FALSE} +#BiocManager::install("tidySummarizedExperiment") +library("tidySummarizedExperiment") + +se +``` + +It's still a `SummarizedExperiment` object, so maintains the efficient +structure, but now we can view it as a tibble. Note the first line of +the output says this, it's a `SummarizedExperiment`-`tibble` +abstraction. We can also see in the second line of the output the +number of transcripts and samples. + +If we want to revert to the standard `SummarizedExperiment` view, we +can do that. + +```{r} +options("restore_SummarizedExperiment_show" = TRUE) +se +``` + +But here we use the tibble view. + +```{r} +options("restore_SummarizedExperiment_show" = FALSE) +se +``` + +We can now use tidyverse commands to interact with the +`SummarizedExperiment` object. + +We can use `filter` to filter for rows using a condition e.g. to view +all rows for one sample. + +```{r} +se %>% filter(.sample == "GSM2545336") +``` + +We can use `select` to specify columns we want to view. + +```{r} +se %>% select(.sample) +``` + +We can use `mutate` to add metadata info. + +```{r} +se %>% mutate(center = "Heidelberg University") +``` + +We can also combine commands with the tidyverse pipe `%>%`. For +example, we could combine `group_by` and `summarise` to get the total +counts for each sample. + +```{r} +se %>% + group_by(.sample) %>% + summarise(total_counts=sum(counts)) +``` + +We can treat the tidy SummarizedExperiment object as a normal tibble +for plotting. + +Here we plot the distribution of counts per sample. + +```{r tidySE-plot} +se %>% + ggplot(aes(counts + 1, group=.sample, color=infection)) + + geom_density() + + scale_x_log10() + + theme_bw() +``` + +For more information on tidySummarizedExperiment, see the package +website +[here](https://stemangiola.github.io/tidySummarizedExperiment/). + +**Take-home message** + +- `SummarizedExperiment` represents an efficient way to store and + handle omics data. + +- They are used in many Bioconductor packages. + +If you follow the next training focused on RNA sequencing analysis, +you will learn to use the Bioconductor `DESeq2` package to do some +differential expression analyses. The whole analysis of the `DESeq2` +package is handled in a `SummarizedExperiment`. + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- Bioconductor is a project provide support and packages for the + comprehension of high high-throughput biology data. +- A `SummarizedExperiment` is a type of object useful to store and + manage high-throughput omics data. + +:::::::::::::::::::::::::::::::::::::::::::::::::: From cce1405387c49e86c8c0e53ce2036f88d40d25c3 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Fri, 29 Dec 2023 06:11:22 +0900 Subject: [PATCH 037/334] New translations instructor-notes.md (French) --- locale/fr/instructors/instructor-notes.md | 5 +++++ 1 file changed, 5 insertions(+) create mode 100644 locale/fr/instructors/instructor-notes.md diff --git a/locale/fr/instructors/instructor-notes.md b/locale/fr/instructors/instructor-notes.md new file mode 100644 index 000000000..a5ec5a2dc --- /dev/null +++ b/locale/fr/instructors/instructor-notes.md @@ -0,0 +1,5 @@ +--- +title: Instructor Notes +--- + +FIXME From a274c553ae5cc60fc7209d3db066dd1ec6045b6c Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Fri, 29 Dec 2023 06:11:22 +0900 Subject: [PATCH 038/334] New translations instructor-notes.md (Spanish) --- locale/es/instructors/instructor-notes.md | 5 +++++ 1 file changed, 5 insertions(+) create mode 100644 locale/es/instructors/instructor-notes.md diff --git a/locale/es/instructors/instructor-notes.md b/locale/es/instructors/instructor-notes.md new file mode 100644 index 000000000..a5ec5a2dc --- /dev/null +++ b/locale/es/instructors/instructor-notes.md @@ -0,0 +1,5 @@ +--- +title: Instructor Notes +--- + +FIXME From 8ea1e4d3f25d0b5b50e087a6ce77b9e6c9e04acc Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Fri, 29 Dec 2023 06:11:24 +0900 Subject: [PATCH 039/334] New translations instructor-notes.md (Japanese) --- locale/ja/instructors/instructor-notes.md | 5 +++++ 1 file changed, 5 insertions(+) create mode 100644 locale/ja/instructors/instructor-notes.md diff --git a/locale/ja/instructors/instructor-notes.md b/locale/ja/instructors/instructor-notes.md new file mode 100644 index 000000000..a5ec5a2dc --- /dev/null +++ b/locale/ja/instructors/instructor-notes.md @@ -0,0 +1,5 @@ +--- +title: Instructor Notes +--- + +FIXME From 73d4358b6583324374477b6851dcca7878c123bd Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Fri, 29 Dec 2023 06:11:25 +0900 Subject: [PATCH 040/334] New translations instructor-notes.md (Portuguese) --- locale/pt/instructors/instructor-notes.md | 5 +++++ 1 file changed, 5 insertions(+) create mode 100644 locale/pt/instructors/instructor-notes.md diff --git a/locale/pt/instructors/instructor-notes.md b/locale/pt/instructors/instructor-notes.md new file mode 100644 index 000000000..a5ec5a2dc --- /dev/null +++ b/locale/pt/instructors/instructor-notes.md @@ -0,0 +1,5 @@ +--- +title: Instructor Notes +--- + +FIXME From d96e8cf14f96c85fea8590496d3045c7cbd45190 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Fri, 29 Dec 2023 06:11:25 +0900 Subject: [PATCH 041/334] New translations instructor-notes.md (Chinese Simplified) --- locale/zh/instructors/instructor-notes.md | 5 +++++ 1 file changed, 5 insertions(+) create mode 100644 locale/zh/instructors/instructor-notes.md diff --git a/locale/zh/instructors/instructor-notes.md b/locale/zh/instructors/instructor-notes.md new file mode 100644 index 000000000..a5ec5a2dc --- /dev/null +++ b/locale/zh/instructors/instructor-notes.md @@ -0,0 +1,5 @@ +--- +title: Instructor Notes +--- + +FIXME From 7b70cb9d1277ccb02ce43ce94f944b6a3e257907 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Fri, 29 Dec 2023 06:11:26 +0900 Subject: [PATCH 042/334] New translations discuss.md (French) --- locale/fr/learners/discuss.md | 5 +++++ 1 file changed, 5 insertions(+) create mode 100644 locale/fr/learners/discuss.md diff --git a/locale/fr/learners/discuss.md b/locale/fr/learners/discuss.md new file mode 100644 index 000000000..405883d41 --- /dev/null +++ b/locale/fr/learners/discuss.md @@ -0,0 +1,5 @@ +--- +title: Discussion +--- + +FIXME From 86fa95be8ac9c2555babcd9fec334f3c207b92f0 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Fri, 29 Dec 2023 06:11:27 +0900 Subject: [PATCH 043/334] New translations discuss.md (Spanish) --- locale/es/learners/discuss.md | 5 +++++ 1 file changed, 5 insertions(+) create mode 100644 locale/es/learners/discuss.md diff --git a/locale/es/learners/discuss.md b/locale/es/learners/discuss.md new file mode 100644 index 000000000..405883d41 --- /dev/null +++ b/locale/es/learners/discuss.md @@ -0,0 +1,5 @@ +--- +title: Discussion +--- + +FIXME From 9bf869a6ef2b6159293b7c88ad5d14edb310e2c5 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Fri, 29 Dec 2023 06:11:28 +0900 Subject: [PATCH 044/334] New translations discuss.md (Japanese) --- locale/ja/learners/discuss.md | 5 +++++ 1 file changed, 5 insertions(+) create mode 100644 locale/ja/learners/discuss.md diff --git a/locale/ja/learners/discuss.md b/locale/ja/learners/discuss.md new file mode 100644 index 000000000..405883d41 --- /dev/null +++ b/locale/ja/learners/discuss.md @@ -0,0 +1,5 @@ +--- +title: Discussion +--- + +FIXME From 66c2ef141659e8159885071d3c0c95e8d513d6a2 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Fri, 29 Dec 2023 06:11:30 +0900 Subject: [PATCH 045/334] New translations discuss.md (Portuguese) --- locale/pt/learners/discuss.md | 5 +++++ 1 file changed, 5 insertions(+) create mode 100644 locale/pt/learners/discuss.md diff --git a/locale/pt/learners/discuss.md b/locale/pt/learners/discuss.md new file mode 100644 index 000000000..405883d41 --- /dev/null +++ b/locale/pt/learners/discuss.md @@ -0,0 +1,5 @@ +--- +title: Discussion +--- + +FIXME From 6d8e6e5ef42d76238d34b5549b15a5d71c82e487 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Fri, 29 Dec 2023 06:11:31 +0900 Subject: [PATCH 046/334] New translations discuss.md (Chinese Simplified) --- locale/zh/learners/discuss.md | 5 +++++ 1 file changed, 5 insertions(+) create mode 100644 locale/zh/learners/discuss.md diff --git a/locale/zh/learners/discuss.md b/locale/zh/learners/discuss.md new file mode 100644 index 000000000..405883d41 --- /dev/null +++ b/locale/zh/learners/discuss.md @@ -0,0 +1,5 @@ +--- +title: Discussion +--- + +FIXME From 9f8dc429612fdbc107b5ee022094adce56ec7d41 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Fri, 29 Dec 2023 06:11:32 +0900 Subject: [PATCH 047/334] New translations reference.md (French) --- locale/fr/learners/reference.md | 7 +++++++ 1 file changed, 7 insertions(+) create mode 100644 locale/fr/learners/reference.md diff --git a/locale/fr/learners/reference.md b/locale/fr/learners/reference.md new file mode 100644 index 000000000..91bab9733 --- /dev/null +++ b/locale/fr/learners/reference.md @@ -0,0 +1,7 @@ +--- +{} +--- + +## Glossary + +FIXME From c6379556feb1df0c6c0a5077d53221b66bc4c553 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Fri, 29 Dec 2023 06:11:33 +0900 Subject: [PATCH 048/334] New translations reference.md (Spanish) --- locale/es/learners/reference.md | 7 +++++++ 1 file changed, 7 insertions(+) create mode 100644 locale/es/learners/reference.md diff --git a/locale/es/learners/reference.md b/locale/es/learners/reference.md new file mode 100644 index 000000000..91bab9733 --- /dev/null +++ b/locale/es/learners/reference.md @@ -0,0 +1,7 @@ +--- +{} +--- + +## Glossary + +FIXME From 7924aefd32f894047ecf1b7984892372ebe3fb8d Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Fri, 29 Dec 2023 06:11:34 +0900 Subject: [PATCH 049/334] New translations reference.md (Japanese) --- locale/ja/learners/reference.md | 7 +++++++ 1 file changed, 7 insertions(+) create mode 100644 locale/ja/learners/reference.md diff --git a/locale/ja/learners/reference.md b/locale/ja/learners/reference.md new file mode 100644 index 000000000..91bab9733 --- /dev/null +++ b/locale/ja/learners/reference.md @@ -0,0 +1,7 @@ +--- +{} +--- + +## Glossary + +FIXME From 680a0d9436b6948e72e222e48981ac507f000f4a Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Fri, 29 Dec 2023 06:11:35 +0900 Subject: [PATCH 050/334] New translations reference.md (Portuguese) --- locale/pt/learners/reference.md | 7 +++++++ 1 file changed, 7 insertions(+) create mode 100644 locale/pt/learners/reference.md diff --git a/locale/pt/learners/reference.md b/locale/pt/learners/reference.md new file mode 100644 index 000000000..91bab9733 --- /dev/null +++ b/locale/pt/learners/reference.md @@ -0,0 +1,7 @@ +--- +{} +--- + +## Glossary + +FIXME From 3a4698db19d341c008fde41c29a6baebb10bfff7 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Fri, 29 Dec 2023 06:11:36 +0900 Subject: [PATCH 051/334] New translations reference.md (Chinese Simplified) --- locale/zh/learners/reference.md | 7 +++++++ 1 file changed, 7 insertions(+) create mode 100644 locale/zh/learners/reference.md diff --git a/locale/zh/learners/reference.md b/locale/zh/learners/reference.md new file mode 100644 index 000000000..91bab9733 --- /dev/null +++ b/locale/zh/learners/reference.md @@ -0,0 +1,7 @@ +--- +{} +--- + +## Glossary + +FIXME From e81cdab887ac90051f0d7df7c83b80b57651d864 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Fri, 29 Dec 2023 06:11:37 +0900 Subject: [PATCH 052/334] New translations setup.md (French) --- locale/fr/learners/setup.md | 158 ++++++++++++++++++++++++++++++++++++ 1 file changed, 158 insertions(+) create mode 100644 locale/fr/learners/setup.md diff --git a/locale/fr/learners/setup.md b/locale/fr/learners/setup.md new file mode 100644 index 000000000..2c33990f5 --- /dev/null +++ b/locale/fr/learners/setup.md @@ -0,0 +1,158 @@ +--- +title: Setup +--- + +- Please make sure you have a spreadsheet editor at hand, such as + LibreOffice, Microsoft Excel or Google Sheets. + +- Install R, RStudio and packages (see below). + +### R and RStudio + +- R and RStudio are separate downloads and installations. R is the + underlying statistical computing environment, but using R alone is + no fun. RStudio is a graphical integrated development environment + (IDE) that makes using R much easier and more interactive. You need + to install R before you install RStudio. After installing both + programs, you will need to install some specific R packages within + RStudio. Follow the instructions below for your operating system, + and then follow the instructions to install packages. + +### You are running Windows + +<br> + +::::::::::::::: solution + +## If you already have R and RStudio installed + +- Open RStudio, and click on "Help" > "Check for updates". If a new version is + available, quit RStudio, and download the latest version for RStudio. + +- To check which version of R you are using, start RStudio and the first thing + that appears in the console indicates the version of R you are + running. Alternatively, you can type `sessionInfo()`, which will also display + which version of R you are running. Go on + the [CRAN website](https://cran.r-project.org/bin/windows/base/) and check + whether a more recent version is available. If so, please download and install + it. You can [check here](https://cran.r-project.org/bin/windows/base/rw-FAQ.html#How-do-I-UNinstall-R_003f) for + more information on how to remove old versions from your system if you wish to do so. + +- Follow the steps in the instructions [for everyone](#for-everyone) at the + bottom of this page. + +::::::::::::::::::::::::: + +::::::::::::::: solution + +## If you don't have R and RStudio installed + +- Download R from + the [CRAN website](https://cran.r-project.org/bin/windows/base/release.htm). + +- Run the `.exe` file that was just downloaded + +- Go to the [RStudio download page](https://www.rstudio.com/products/rstudio/download/#download) + +- Under _All Installers_ select **RStudio xxxx.yy.zz-uuu.exe - Windows 10/11** (where x, y, z, and u represent version numbers) + +- Double click the file to install it + +- Once it's installed, open RStudio to make sure it works and you don't get any + error messages + +- Follow the steps in the instructions [for everyone](#for-everyone) at the + bottom of this page. + +::::::::::::::::::::::::: + +### You are running macOS + +<br> + +::::::::::::::: solution + +## If you already have R and RStudio installed + +- Open RStudio, and click on "Help" > "Check for updates". If a new version is + available, quit RStudio, and download the latest version for RStudio. + +- To check the version of R you are using, start RStudio and the first thing + that appears on the terminal indicates the version of R you are running. Alternatively, you can type `sessionInfo()`, which will + also display which version of R you are running. Go on + the [CRAN website](https://cran.r-project.org/bin/macosx/) and check + whether a more recent version is available. If so, please download and install + it. + +- Follow the steps in the instructions [for everyone](#for-everyone) at the + bottom of this page. + +::::::::::::::::::::::::: + +::::::::::::::: solution + +## If you don't have R and RStudio installed + +- Download R from + the [CRAN website](https://cran.r-project.org/bin/macosx/). + +- Select the `.pkg` file for the latest R version + +- Double click on the downloaded file to install R + +- It is also a good idea to install [XQuartz](https://www.xquartz.org/) (needed + by some packages) + +- Go to the [RStudio download page](https://www.rstudio.com/products/rstudio/download/#download) + +- Under _All Installers_ select **RStudio xxxx.yy.zz-uuu.dmg - macOS 10.15+** (where x, y, z, and u represent version numbers) + +- Double click the file to install RStudio + +- Once it's installed, open RStudio to make sure it works and you don't get any + error messages. + +- Follow the steps in the instructions [for everyone](#for-everyone) at the + bottom of this page. + +::::::::::::::::::::::::: + +### You are running Linux + +<br> + +::::::::::::::: solution + +## Install R using your package manager and RStudio + +- Follow the instructions for your distribution + from [CRAN](https://cloud.r-project.org/bin/linux), they provide information + to get the most recent version of R for common distributions. For most + distributions, you could use your package manager (e.g., for Debian/Ubuntu run + `sudo apt-get install r-base`, and for Fedora `sudo yum install R`), but we + don't recommend this approach as the versions provided by this are + usually out of date. In any case, make sure you have at least R 4.2.0. +- Go to the RStudio download + page +- Under _All Installers_ select the version that matches your distribution, and + install it with your preferred method (e.g., with Debian/Ubuntu `sudo dpkg -i rstudio-xxxx.yy.zz-uuu-amd64.deb` at the terminal). +- Once it's installed, open RStudio to make sure it works and you don't get any + error messages. +- Follow the steps in the [instructions for everyone](#for-everyone) + +::::::::::::::::::::::::: + +### For everyone + +After installing R and RStudio, you need to install a couple of +packages that will be used during the workshop. We will also learn +about package installation during the course to explain the following +commands. For now, simply follow the instructions below: + +- Start RStudio by double-clicking the icon and then type: + +```r +install.packages(c("BiocManager", "remotes")) +BiocManager::install(c("tidyverse", "SummarizedExperiment", "hexbin", + "patchwork", "gridExtra", "lubridate")) +``` From eaff3b9a87fbf5095b77a87904ad601b3574ede8 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Fri, 29 Dec 2023 06:11:38 +0900 Subject: [PATCH 053/334] New translations setup.md (Spanish) --- locale/es/learners/setup.md | 158 ++++++++++++++++++++++++++++++++++++ 1 file changed, 158 insertions(+) create mode 100644 locale/es/learners/setup.md diff --git a/locale/es/learners/setup.md b/locale/es/learners/setup.md new file mode 100644 index 000000000..2c33990f5 --- /dev/null +++ b/locale/es/learners/setup.md @@ -0,0 +1,158 @@ +--- +title: Setup +--- + +- Please make sure you have a spreadsheet editor at hand, such as + LibreOffice, Microsoft Excel or Google Sheets. + +- Install R, RStudio and packages (see below). + +### R and RStudio + +- R and RStudio are separate downloads and installations. R is the + underlying statistical computing environment, but using R alone is + no fun. RStudio is a graphical integrated development environment + (IDE) that makes using R much easier and more interactive. You need + to install R before you install RStudio. After installing both + programs, you will need to install some specific R packages within + RStudio. Follow the instructions below for your operating system, + and then follow the instructions to install packages. + +### You are running Windows + +<br> + +::::::::::::::: solution + +## If you already have R and RStudio installed + +- Open RStudio, and click on "Help" > "Check for updates". If a new version is + available, quit RStudio, and download the latest version for RStudio. + +- To check which version of R you are using, start RStudio and the first thing + that appears in the console indicates the version of R you are + running. Alternatively, you can type `sessionInfo()`, which will also display + which version of R you are running. Go on + the [CRAN website](https://cran.r-project.org/bin/windows/base/) and check + whether a more recent version is available. If so, please download and install + it. You can [check here](https://cran.r-project.org/bin/windows/base/rw-FAQ.html#How-do-I-UNinstall-R_003f) for + more information on how to remove old versions from your system if you wish to do so. + +- Follow the steps in the instructions [for everyone](#for-everyone) at the + bottom of this page. + +::::::::::::::::::::::::: + +::::::::::::::: solution + +## If you don't have R and RStudio installed + +- Download R from + the [CRAN website](https://cran.r-project.org/bin/windows/base/release.htm). + +- Run the `.exe` file that was just downloaded + +- Go to the [RStudio download page](https://www.rstudio.com/products/rstudio/download/#download) + +- Under _All Installers_ select **RStudio xxxx.yy.zz-uuu.exe - Windows 10/11** (where x, y, z, and u represent version numbers) + +- Double click the file to install it + +- Once it's installed, open RStudio to make sure it works and you don't get any + error messages + +- Follow the steps in the instructions [for everyone](#for-everyone) at the + bottom of this page. + +::::::::::::::::::::::::: + +### You are running macOS + +<br> + +::::::::::::::: solution + +## If you already have R and RStudio installed + +- Open RStudio, and click on "Help" > "Check for updates". If a new version is + available, quit RStudio, and download the latest version for RStudio. + +- To check the version of R you are using, start RStudio and the first thing + that appears on the terminal indicates the version of R you are running. Alternatively, you can type `sessionInfo()`, which will + also display which version of R you are running. Go on + the [CRAN website](https://cran.r-project.org/bin/macosx/) and check + whether a more recent version is available. If so, please download and install + it. + +- Follow the steps in the instructions [for everyone](#for-everyone) at the + bottom of this page. + +::::::::::::::::::::::::: + +::::::::::::::: solution + +## If you don't have R and RStudio installed + +- Download R from + the [CRAN website](https://cran.r-project.org/bin/macosx/). + +- Select the `.pkg` file for the latest R version + +- Double click on the downloaded file to install R + +- It is also a good idea to install [XQuartz](https://www.xquartz.org/) (needed + by some packages) + +- Go to the [RStudio download page](https://www.rstudio.com/products/rstudio/download/#download) + +- Under _All Installers_ select **RStudio xxxx.yy.zz-uuu.dmg - macOS 10.15+** (where x, y, z, and u represent version numbers) + +- Double click the file to install RStudio + +- Once it's installed, open RStudio to make sure it works and you don't get any + error messages. + +- Follow the steps in the instructions [for everyone](#for-everyone) at the + bottom of this page. + +::::::::::::::::::::::::: + +### You are running Linux + +<br> + +::::::::::::::: solution + +## Install R using your package manager and RStudio + +- Follow the instructions for your distribution + from [CRAN](https://cloud.r-project.org/bin/linux), they provide information + to get the most recent version of R for common distributions. For most + distributions, you could use your package manager (e.g., for Debian/Ubuntu run + `sudo apt-get install r-base`, and for Fedora `sudo yum install R`), but we + don't recommend this approach as the versions provided by this are + usually out of date. In any case, make sure you have at least R 4.2.0. +- Go to the RStudio download + page +- Under _All Installers_ select the version that matches your distribution, and + install it with your preferred method (e.g., with Debian/Ubuntu `sudo dpkg -i rstudio-xxxx.yy.zz-uuu-amd64.deb` at the terminal). +- Once it's installed, open RStudio to make sure it works and you don't get any + error messages. +- Follow the steps in the [instructions for everyone](#for-everyone) + +::::::::::::::::::::::::: + +### For everyone + +After installing R and RStudio, you need to install a couple of +packages that will be used during the workshop. We will also learn +about package installation during the course to explain the following +commands. For now, simply follow the instructions below: + +- Start RStudio by double-clicking the icon and then type: + +```r +install.packages(c("BiocManager", "remotes")) +BiocManager::install(c("tidyverse", "SummarizedExperiment", "hexbin", + "patchwork", "gridExtra", "lubridate")) +``` From 149a734c1a7fbd8467319c9707364573849e6a13 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Fri, 29 Dec 2023 06:11:39 +0900 Subject: [PATCH 054/334] New translations setup.md (Japanese) --- locale/ja/learners/setup.md | 158 ++++++++++++++++++++++++++++++++++++ 1 file changed, 158 insertions(+) create mode 100644 locale/ja/learners/setup.md diff --git a/locale/ja/learners/setup.md b/locale/ja/learners/setup.md new file mode 100644 index 000000000..f962956ed --- /dev/null +++ b/locale/ja/learners/setup.md @@ -0,0 +1,158 @@ +--- +title: Setup +--- + +- Please make sure you have a spreadsheet editor at hand, such as + LibreOffice, Microsoft Excel or Google Sheets. + +- Install R, RStudio and packages (see below). + +### R and RStudio + +- RとRStudioは別々にダウンロード、インストールする。 R is the + underlying statistical computing environment, but using R alone is + no fun. RStudio is a graphical integrated development environment + (IDE) that makes using R much easier and more interactive. You need + to install R before you install RStudio. After installing both + programs, you will need to install some specific R packages within + RStudio. Follow the instructions below for your operating system, + and then follow the instructions to install packages. + +### You are running Windows + +<br> + +::::::::::::::: solution + +## すでにRとRStudioがインストールされている場合 + +- Open RStudio, and click on "Help" > "Check for updates". If a new version is + available, quit RStudio, and download the latest version for RStudio. + +- To check which version of R you are using, start RStudio and the first thing + that appears in the console indicates the version of R you are + running. Alternatively, you can type `sessionInfo()`, which will also display + which version of R you are running. Go on + the [CRAN website](https://cran.r-project.org/bin/windows/base/) and check + whether a more recent version is available. If so, please download and install + it. You can [check here](https://cran.r-project.org/bin/windows/base/rw-FAQ.html#How-do-I-UNinstall-R_003f) for + more information on how to remove old versions from your system if you wish to do so. + +- Follow the steps in the instructions [for everyone](#for-everyone) at the + bottom of this page. + +::::::::::::::::::::::::: + +::::::::::::::: solution + +## If you don't have R and RStudio installed + +- Download R from + the [CRAN website](https://cran.r-project.org/bin/windows/base/release.htm). + +- Run the `.exe` file that was just downloaded + +- Go to the [RStudio download page](https://www.rstudio.com/products/rstudio/download/#download) + +- Under _All Installers_ select **RStudio xxxx.yy.zz-uuu.exe - Windows 10/11** (where x, y, z, and u represent version numbers) + +- ファイルをダブルクリックしてインストールする + +- Once it's installed, open RStudio to make sure it works and you don't get any + error messages + +- Follow the steps in the instructions [for everyone](#for-everyone) at the + bottom of this page. + +::::::::::::::::::::::::: + +### You are running macOS + +<br> + +::::::::::::::: solution + +## すでにRとRStudioがインストールされている場合 + +- Open RStudio, and click on "Help" > "Check for updates". If a new version is + available, quit RStudio, and download the latest version for RStudio. + +- To check the version of R you are using, start RStudio and the first thing + that appears on the terminal indicates the version of R you are running. Alternatively, you can type `sessionInfo()`, which will + also display which version of R you are running. Go on + the [CRAN website](https://cran.r-project.org/bin/macosx/) and check + whether a more recent version is available. If so, please download and install + it. + +- Follow the steps in the instructions [for everyone](#for-everyone) at the + bottom of this page. + +::::::::::::::::::::::::: + +::::::::::::::: solution + +## If you don't have R and RStudio installed + +- Download R from + the [CRAN website](https://cran.r-project.org/bin/macosx/). + +- Select the `.pkg` file for the latest R version + +- Double click on the downloaded file to install R + +- It is also a good idea to install [XQuartz](https://www.xquartz.org/) (needed + by some packages) + +- Go to the [RStudio download page](https://www.rstudio.com/products/rstudio/download/#download) + +- Under _All Installers_ select **RStudio xxxx.yy.zz-uuu.dmg - macOS 10.15+** (where x, y, z, and u represent version numbers) + +- Double click the file to install RStudio + +- Once it's installed, open RStudio to make sure it works and you don't get any + error messages. + +- Follow the steps in the instructions [for everyone](#for-everyone) at the + bottom of this page. + +::::::::::::::::::::::::: + +### You are running Linux + +<br> + +::::::::::::::: solution + +## Install R using your package manager and RStudio + +- Follow the instructions for your distribution + from [CRAN](https://cloud.r-project.org/bin/linux), they provide information + to get the most recent version of R for common distributions. For most + distributions, you could use your package manager (e.g., for Debian/Ubuntu run + `sudo apt-get install r-base`, and for Fedora `sudo yum install R`), but we + don't recommend this approach as the versions provided by this are + usually out of date. In any case, make sure you have at least R 4.2.0. +- Go to the RStudio download + page +- Under _All Installers_ select the version that matches your distribution, and + install it with your preferred method (e.g., with Debian/Ubuntu `sudo dpkg -i rstudio-xxxx.yy.zz-uuu-amd64.deb` at the terminal). +- Once it's installed, open RStudio to make sure it works and you don't get any + error messages. +- Follow the steps in the [instructions for everyone](#for-everyone) + +::::::::::::::::::::::::: + +### For everyone + +After installing R and RStudio, you need to install a couple of +packages that will be used during the workshop. We will also learn +about package installation during the course to explain the following +commands. For now, simply follow the instructions below: + +- Start RStudio by double-clicking the icon and then type: + +```r +install.packages(c("BiocManager", "remotes")) +BiocManager::install(c("tidyverse", "SummarizedExperiment", "hexbin", + "patchwork", "gridExtra", "lubridate")) +``` From 84b24ef36e4fde116231b5f89f24e7a57dd37181 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Fri, 29 Dec 2023 06:11:41 +0900 Subject: [PATCH 055/334] New translations setup.md (Portuguese) --- locale/pt/learners/setup.md | 158 ++++++++++++++++++++++++++++++++++++ 1 file changed, 158 insertions(+) create mode 100644 locale/pt/learners/setup.md diff --git a/locale/pt/learners/setup.md b/locale/pt/learners/setup.md new file mode 100644 index 000000000..2c33990f5 --- /dev/null +++ b/locale/pt/learners/setup.md @@ -0,0 +1,158 @@ +--- +title: Setup +--- + +- Please make sure you have a spreadsheet editor at hand, such as + LibreOffice, Microsoft Excel or Google Sheets. + +- Install R, RStudio and packages (see below). + +### R and RStudio + +- R and RStudio are separate downloads and installations. R is the + underlying statistical computing environment, but using R alone is + no fun. RStudio is a graphical integrated development environment + (IDE) that makes using R much easier and more interactive. You need + to install R before you install RStudio. After installing both + programs, you will need to install some specific R packages within + RStudio. Follow the instructions below for your operating system, + and then follow the instructions to install packages. + +### You are running Windows + +<br> + +::::::::::::::: solution + +## If you already have R and RStudio installed + +- Open RStudio, and click on "Help" > "Check for updates". If a new version is + available, quit RStudio, and download the latest version for RStudio. + +- To check which version of R you are using, start RStudio and the first thing + that appears in the console indicates the version of R you are + running. Alternatively, you can type `sessionInfo()`, which will also display + which version of R you are running. Go on + the [CRAN website](https://cran.r-project.org/bin/windows/base/) and check + whether a more recent version is available. If so, please download and install + it. You can [check here](https://cran.r-project.org/bin/windows/base/rw-FAQ.html#How-do-I-UNinstall-R_003f) for + more information on how to remove old versions from your system if you wish to do so. + +- Follow the steps in the instructions [for everyone](#for-everyone) at the + bottom of this page. + +::::::::::::::::::::::::: + +::::::::::::::: solution + +## If you don't have R and RStudio installed + +- Download R from + the [CRAN website](https://cran.r-project.org/bin/windows/base/release.htm). + +- Run the `.exe` file that was just downloaded + +- Go to the [RStudio download page](https://www.rstudio.com/products/rstudio/download/#download) + +- Under _All Installers_ select **RStudio xxxx.yy.zz-uuu.exe - Windows 10/11** (where x, y, z, and u represent version numbers) + +- Double click the file to install it + +- Once it's installed, open RStudio to make sure it works and you don't get any + error messages + +- Follow the steps in the instructions [for everyone](#for-everyone) at the + bottom of this page. + +::::::::::::::::::::::::: + +### You are running macOS + +<br> + +::::::::::::::: solution + +## If you already have R and RStudio installed + +- Open RStudio, and click on "Help" > "Check for updates". If a new version is + available, quit RStudio, and download the latest version for RStudio. + +- To check the version of R you are using, start RStudio and the first thing + that appears on the terminal indicates the version of R you are running. Alternatively, you can type `sessionInfo()`, which will + also display which version of R you are running. Go on + the [CRAN website](https://cran.r-project.org/bin/macosx/) and check + whether a more recent version is available. If so, please download and install + it. + +- Follow the steps in the instructions [for everyone](#for-everyone) at the + bottom of this page. + +::::::::::::::::::::::::: + +::::::::::::::: solution + +## If you don't have R and RStudio installed + +- Download R from + the [CRAN website](https://cran.r-project.org/bin/macosx/). + +- Select the `.pkg` file for the latest R version + +- Double click on the downloaded file to install R + +- It is also a good idea to install [XQuartz](https://www.xquartz.org/) (needed + by some packages) + +- Go to the [RStudio download page](https://www.rstudio.com/products/rstudio/download/#download) + +- Under _All Installers_ select **RStudio xxxx.yy.zz-uuu.dmg - macOS 10.15+** (where x, y, z, and u represent version numbers) + +- Double click the file to install RStudio + +- Once it's installed, open RStudio to make sure it works and you don't get any + error messages. + +- Follow the steps in the instructions [for everyone](#for-everyone) at the + bottom of this page. + +::::::::::::::::::::::::: + +### You are running Linux + +<br> + +::::::::::::::: solution + +## Install R using your package manager and RStudio + +- Follow the instructions for your distribution + from [CRAN](https://cloud.r-project.org/bin/linux), they provide information + to get the most recent version of R for common distributions. For most + distributions, you could use your package manager (e.g., for Debian/Ubuntu run + `sudo apt-get install r-base`, and for Fedora `sudo yum install R`), but we + don't recommend this approach as the versions provided by this are + usually out of date. In any case, make sure you have at least R 4.2.0. +- Go to the RStudio download + page +- Under _All Installers_ select the version that matches your distribution, and + install it with your preferred method (e.g., with Debian/Ubuntu `sudo dpkg -i rstudio-xxxx.yy.zz-uuu-amd64.deb` at the terminal). +- Once it's installed, open RStudio to make sure it works and you don't get any + error messages. +- Follow the steps in the [instructions for everyone](#for-everyone) + +::::::::::::::::::::::::: + +### For everyone + +After installing R and RStudio, you need to install a couple of +packages that will be used during the workshop. We will also learn +about package installation during the course to explain the following +commands. For now, simply follow the instructions below: + +- Start RStudio by double-clicking the icon and then type: + +```r +install.packages(c("BiocManager", "remotes")) +BiocManager::install(c("tidyverse", "SummarizedExperiment", "hexbin", + "patchwork", "gridExtra", "lubridate")) +``` From 1fffdfb108cb19e1dc249f952ccb4dab67504cf9 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Fri, 29 Dec 2023 06:11:42 +0900 Subject: [PATCH 056/334] New translations setup.md (Chinese Simplified) --- locale/zh/learners/setup.md | 158 ++++++++++++++++++++++++++++++++++++ 1 file changed, 158 insertions(+) create mode 100644 locale/zh/learners/setup.md diff --git a/locale/zh/learners/setup.md b/locale/zh/learners/setup.md new file mode 100644 index 000000000..2c33990f5 --- /dev/null +++ b/locale/zh/learners/setup.md @@ -0,0 +1,158 @@ +--- +title: Setup +--- + +- Please make sure you have a spreadsheet editor at hand, such as + LibreOffice, Microsoft Excel or Google Sheets. + +- Install R, RStudio and packages (see below). + +### R and RStudio + +- R and RStudio are separate downloads and installations. R is the + underlying statistical computing environment, but using R alone is + no fun. RStudio is a graphical integrated development environment + (IDE) that makes using R much easier and more interactive. You need + to install R before you install RStudio. After installing both + programs, you will need to install some specific R packages within + RStudio. Follow the instructions below for your operating system, + and then follow the instructions to install packages. + +### You are running Windows + +<br> + +::::::::::::::: solution + +## If you already have R and RStudio installed + +- Open RStudio, and click on "Help" > "Check for updates". If a new version is + available, quit RStudio, and download the latest version for RStudio. + +- To check which version of R you are using, start RStudio and the first thing + that appears in the console indicates the version of R you are + running. Alternatively, you can type `sessionInfo()`, which will also display + which version of R you are running. Go on + the [CRAN website](https://cran.r-project.org/bin/windows/base/) and check + whether a more recent version is available. If so, please download and install + it. You can [check here](https://cran.r-project.org/bin/windows/base/rw-FAQ.html#How-do-I-UNinstall-R_003f) for + more information on how to remove old versions from your system if you wish to do so. + +- Follow the steps in the instructions [for everyone](#for-everyone) at the + bottom of this page. + +::::::::::::::::::::::::: + +::::::::::::::: solution + +## If you don't have R and RStudio installed + +- Download R from + the [CRAN website](https://cran.r-project.org/bin/windows/base/release.htm). + +- Run the `.exe` file that was just downloaded + +- Go to the [RStudio download page](https://www.rstudio.com/products/rstudio/download/#download) + +- Under _All Installers_ select **RStudio xxxx.yy.zz-uuu.exe - Windows 10/11** (where x, y, z, and u represent version numbers) + +- Double click the file to install it + +- Once it's installed, open RStudio to make sure it works and you don't get any + error messages + +- Follow the steps in the instructions [for everyone](#for-everyone) at the + bottom of this page. + +::::::::::::::::::::::::: + +### You are running macOS + +<br> + +::::::::::::::: solution + +## If you already have R and RStudio installed + +- Open RStudio, and click on "Help" > "Check for updates". If a new version is + available, quit RStudio, and download the latest version for RStudio. + +- To check the version of R you are using, start RStudio and the first thing + that appears on the terminal indicates the version of R you are running. Alternatively, you can type `sessionInfo()`, which will + also display which version of R you are running. Go on + the [CRAN website](https://cran.r-project.org/bin/macosx/) and check + whether a more recent version is available. If so, please download and install + it. + +- Follow the steps in the instructions [for everyone](#for-everyone) at the + bottom of this page. + +::::::::::::::::::::::::: + +::::::::::::::: solution + +## If you don't have R and RStudio installed + +- Download R from + the [CRAN website](https://cran.r-project.org/bin/macosx/). + +- Select the `.pkg` file for the latest R version + +- Double click on the downloaded file to install R + +- It is also a good idea to install [XQuartz](https://www.xquartz.org/) (needed + by some packages) + +- Go to the [RStudio download page](https://www.rstudio.com/products/rstudio/download/#download) + +- Under _All Installers_ select **RStudio xxxx.yy.zz-uuu.dmg - macOS 10.15+** (where x, y, z, and u represent version numbers) + +- Double click the file to install RStudio + +- Once it's installed, open RStudio to make sure it works and you don't get any + error messages. + +- Follow the steps in the instructions [for everyone](#for-everyone) at the + bottom of this page. + +::::::::::::::::::::::::: + +### You are running Linux + +<br> + +::::::::::::::: solution + +## Install R using your package manager and RStudio + +- Follow the instructions for your distribution + from [CRAN](https://cloud.r-project.org/bin/linux), they provide information + to get the most recent version of R for common distributions. For most + distributions, you could use your package manager (e.g., for Debian/Ubuntu run + `sudo apt-get install r-base`, and for Fedora `sudo yum install R`), but we + don't recommend this approach as the versions provided by this are + usually out of date. In any case, make sure you have at least R 4.2.0. +- Go to the RStudio download + page +- Under _All Installers_ select the version that matches your distribution, and + install it with your preferred method (e.g., with Debian/Ubuntu `sudo dpkg -i rstudio-xxxx.yy.zz-uuu-amd64.deb` at the terminal). +- Once it's installed, open RStudio to make sure it works and you don't get any + error messages. +- Follow the steps in the [instructions for everyone](#for-everyone) + +::::::::::::::::::::::::: + +### For everyone + +After installing R and RStudio, you need to install a couple of +packages that will be used during the workshop. We will also learn +about package installation during the course to explain the following +commands. For now, simply follow the instructions below: + +- Start RStudio by double-clicking the icon and then type: + +```r +install.packages(c("BiocManager", "remotes")) +BiocManager::install(c("tidyverse", "SummarizedExperiment", "hexbin", + "patchwork", "gridExtra", "lubridate")) +``` From 492b2adffb4ed94e39c65ec6385f9662517eef3c Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Fri, 29 Dec 2023 06:11:43 +0900 Subject: [PATCH 057/334] New translations learner-profiles.md (French) --- locale/fr/profiles/learner-profiles.md | 5 +++++ 1 file changed, 5 insertions(+) create mode 100644 locale/fr/profiles/learner-profiles.md diff --git a/locale/fr/profiles/learner-profiles.md b/locale/fr/profiles/learner-profiles.md new file mode 100644 index 000000000..75b2c5cad --- /dev/null +++ b/locale/fr/profiles/learner-profiles.md @@ -0,0 +1,5 @@ +--- +title: FIXME +--- + +This is a placeholder file. Please add content here. From 6c3a489240be6b7ba64585063db4c15f2d4c4a28 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Fri, 29 Dec 2023 06:11:44 +0900 Subject: [PATCH 058/334] New translations learner-profiles.md (Spanish) --- locale/es/profiles/learner-profiles.md | 5 +++++ 1 file changed, 5 insertions(+) create mode 100644 locale/es/profiles/learner-profiles.md diff --git a/locale/es/profiles/learner-profiles.md b/locale/es/profiles/learner-profiles.md new file mode 100644 index 000000000..75b2c5cad --- /dev/null +++ b/locale/es/profiles/learner-profiles.md @@ -0,0 +1,5 @@ +--- +title: FIXME +--- + +This is a placeholder file. Please add content here. From 438e85e8761574f01e989183180d8679ae87bdfe Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Fri, 29 Dec 2023 06:11:45 +0900 Subject: [PATCH 059/334] New translations learner-profiles.md (Japanese) --- locale/ja/profiles/learner-profiles.md | 5 +++++ 1 file changed, 5 insertions(+) create mode 100644 locale/ja/profiles/learner-profiles.md diff --git a/locale/ja/profiles/learner-profiles.md b/locale/ja/profiles/learner-profiles.md new file mode 100644 index 000000000..75b2c5cad --- /dev/null +++ b/locale/ja/profiles/learner-profiles.md @@ -0,0 +1,5 @@ +--- +title: FIXME +--- + +This is a placeholder file. Please add content here. From cecb486fc797e1e4ad9ed0759d7740352df18665 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Fri, 29 Dec 2023 06:11:46 +0900 Subject: [PATCH 060/334] New translations learner-profiles.md (Portuguese) --- locale/pt/profiles/learner-profiles.md | 5 +++++ 1 file changed, 5 insertions(+) create mode 100644 locale/pt/profiles/learner-profiles.md diff --git a/locale/pt/profiles/learner-profiles.md b/locale/pt/profiles/learner-profiles.md new file mode 100644 index 000000000..75b2c5cad --- /dev/null +++ b/locale/pt/profiles/learner-profiles.md @@ -0,0 +1,5 @@ +--- +title: FIXME +--- + +This is a placeholder file. Please add content here. From 98732821e93f82f68807dc37a7515b4373a102e5 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Fri, 29 Dec 2023 06:11:47 +0900 Subject: [PATCH 061/334] New translations learner-profiles.md (Chinese Simplified) --- locale/zh/profiles/learner-profiles.md | 5 +++++ 1 file changed, 5 insertions(+) create mode 100644 locale/zh/profiles/learner-profiles.md diff --git a/locale/zh/profiles/learner-profiles.md b/locale/zh/profiles/learner-profiles.md new file mode 100644 index 000000000..75b2c5cad --- /dev/null +++ b/locale/zh/profiles/learner-profiles.md @@ -0,0 +1,5 @@ +--- +title: FIXME +--- + +This is a placeholder file. Please add content here. From c5318ad3bb7e3ad9799db7db5121f4db03575c2b Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Fri, 29 Dec 2023 06:11:48 +0900 Subject: [PATCH 062/334] New translations code_of_conduct.md (French) --- locale/fr/CODE_OF_CONDUCT.md | 13 +++++++++++++ 1 file changed, 13 insertions(+) create mode 100644 locale/fr/CODE_OF_CONDUCT.md diff --git a/locale/fr/CODE_OF_CONDUCT.md b/locale/fr/CODE_OF_CONDUCT.md new file mode 100644 index 000000000..11895988e --- /dev/null +++ b/locale/fr/CODE_OF_CONDUCT.md @@ -0,0 +1,13 @@ +--- +title: Contributor Code of Conduct +--- + +As contributors and maintainers of this project, +we pledge to follow the [The Carpentries Code of Conduct][coc]. + +Instances of abusive, harassing, or otherwise unacceptable behavior +may be reported by following our [reporting guidelines][coc-reporting]. + +[coc-reporting]: https://docs.carpentries.org/topic_folders/policies/incident-reporting.html + +[coc]: https://docs.carpentries.org/topic_folders/policies/code-of-conduct.html From d26028c18c60f97155f24b4af8412f696bc0d933 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Fri, 29 Dec 2023 06:11:49 +0900 Subject: [PATCH 063/334] New translations code_of_conduct.md (Spanish) --- locale/es/CODE_OF_CONDUCT.md | 13 +++++++++++++ 1 file changed, 13 insertions(+) create mode 100644 locale/es/CODE_OF_CONDUCT.md diff --git a/locale/es/CODE_OF_CONDUCT.md b/locale/es/CODE_OF_CONDUCT.md new file mode 100644 index 000000000..11895988e --- /dev/null +++ b/locale/es/CODE_OF_CONDUCT.md @@ -0,0 +1,13 @@ +--- +title: Contributor Code of Conduct +--- + +As contributors and maintainers of this project, +we pledge to follow the [The Carpentries Code of Conduct][coc]. + +Instances of abusive, harassing, or otherwise unacceptable behavior +may be reported by following our [reporting guidelines][coc-reporting]. + +[coc-reporting]: https://docs.carpentries.org/topic_folders/policies/incident-reporting.html + +[coc]: https://docs.carpentries.org/topic_folders/policies/code-of-conduct.html From 9d7f5e379a435f5c3007e4ce0790646e11634561 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Fri, 29 Dec 2023 06:11:50 +0900 Subject: [PATCH 064/334] New translations code_of_conduct.md (Japanese) --- locale/ja/CODE_OF_CONDUCT.md | 13 +++++++++++++ 1 file changed, 13 insertions(+) create mode 100644 locale/ja/CODE_OF_CONDUCT.md diff --git a/locale/ja/CODE_OF_CONDUCT.md b/locale/ja/CODE_OF_CONDUCT.md new file mode 100644 index 000000000..11895988e --- /dev/null +++ b/locale/ja/CODE_OF_CONDUCT.md @@ -0,0 +1,13 @@ +--- +title: Contributor Code of Conduct +--- + +As contributors and maintainers of this project, +we pledge to follow the [The Carpentries Code of Conduct][coc]. + +Instances of abusive, harassing, or otherwise unacceptable behavior +may be reported by following our [reporting guidelines][coc-reporting]. + +[coc-reporting]: https://docs.carpentries.org/topic_folders/policies/incident-reporting.html + +[coc]: https://docs.carpentries.org/topic_folders/policies/code-of-conduct.html From 863ea134b226b2618346d20818cca8344d85782e Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Fri, 29 Dec 2023 06:11:51 +0900 Subject: [PATCH 065/334] New translations code_of_conduct.md (Portuguese) --- locale/pt/CODE_OF_CONDUCT.md | 13 +++++++++++++ 1 file changed, 13 insertions(+) create mode 100644 locale/pt/CODE_OF_CONDUCT.md diff --git a/locale/pt/CODE_OF_CONDUCT.md b/locale/pt/CODE_OF_CONDUCT.md new file mode 100644 index 000000000..11895988e --- /dev/null +++ b/locale/pt/CODE_OF_CONDUCT.md @@ -0,0 +1,13 @@ +--- +title: Contributor Code of Conduct +--- + +As contributors and maintainers of this project, +we pledge to follow the [The Carpentries Code of Conduct][coc]. + +Instances of abusive, harassing, or otherwise unacceptable behavior +may be reported by following our [reporting guidelines][coc-reporting]. + +[coc-reporting]: https://docs.carpentries.org/topic_folders/policies/incident-reporting.html + +[coc]: https://docs.carpentries.org/topic_folders/policies/code-of-conduct.html From 33116ff72da1a7297b30cfa87a10f12c1f8c6192 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Fri, 29 Dec 2023 06:11:52 +0900 Subject: [PATCH 066/334] New translations code_of_conduct.md (Chinese Simplified) --- locale/zh/CODE_OF_CONDUCT.md | 13 +++++++++++++ 1 file changed, 13 insertions(+) create mode 100644 locale/zh/CODE_OF_CONDUCT.md diff --git a/locale/zh/CODE_OF_CONDUCT.md b/locale/zh/CODE_OF_CONDUCT.md new file mode 100644 index 000000000..11895988e --- /dev/null +++ b/locale/zh/CODE_OF_CONDUCT.md @@ -0,0 +1,13 @@ +--- +title: Contributor Code of Conduct +--- + +As contributors and maintainers of this project, +we pledge to follow the [The Carpentries Code of Conduct][coc]. + +Instances of abusive, harassing, or otherwise unacceptable behavior +may be reported by following our [reporting guidelines][coc-reporting]. + +[coc-reporting]: https://docs.carpentries.org/topic_folders/policies/incident-reporting.html + +[coc]: https://docs.carpentries.org/topic_folders/policies/code-of-conduct.html From 081e0ba66fdae51a9dd559deeb2cfb84950e9bc1 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Fri, 29 Dec 2023 06:11:54 +0900 Subject: [PATCH 067/334] New translations config.yaml (French) --- locale/fr/config.yaml | 61 +++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 61 insertions(+) create mode 100644 locale/fr/config.yaml diff --git a/locale/fr/config.yaml b/locale/fr/config.yaml new file mode 100644 index 000000000..204cb59c5 --- /dev/null +++ b/locale/fr/config.yaml @@ -0,0 +1,61 @@ +#------------------------------------------------------------ +#Values for this lesson. +#------------------------------------------------------------ +#Which carpentry is this (swc, dc, lc, or cp)? +#swc: Software Carpentry +#dc: Data Carpentry +#lc: Library Carpentry +#cp: Carpentries (to use for instructor training for instance) +#incubator: The Carpentries Incubator +carpentry: 'incubator' +#Overall title for pages. +title: 'Introduction to data analysis with R and Bioconductor' +#Date the lesson was created (YYYY-MM-DD, this is empty by default) +created: '2020-09-14' +#Comma-separated list of keywords for the lesson +keywords: 'software, data, lesson, The Carpentries' +#Life cycle stage of the lesson +#possible values: pre-alpha, alpha, beta, stable +life_cycle: 'stable' +#License of the lesson +license: 'CC-BY 4.0' +#Link to the source repository for this lesson +source: 'https://github.com/carpentries-incubator/bioc-intro' +#Default branch of your lesson +branch: 'main' +#Who to contact if there are any issues +contact: 'laurent.gatto@uclouvain.be' +#Navigation ------------------------------------------------ +#Use the following menu items to specify the order of +#individual pages in each dropdown section. Leave blank to +#include all pages in the folder. +#Example ------------- +#episodes: +#- introduction.md +#- first-steps.md +#learners: +#- setup.md +#instructors: +#- instructor-notes.md +#profiles: +#- one-learner.md +#- another-learner.md +#Order of episodes in your lesson +episodes: + - 10-data-organisation.Rmd + - 20-r-rstudio.Rmd + - 23-starting-with-r.Rmd + - 25-starting-with-data.Rmd + - 30-dplyr.Rmd + - 40-visualization.Rmd + - 60-next-steps.Rmd +#Information for Learners +learners: +#Information for Instructors +instructors: +#Learner Profiles +profiles: +#Customisation --------------------------------------------- +#This space below is where custom yaml items (e.g. pinning +#sandpaper and varnish versions) should live +url: 'https://carpentries-incubator.github.io/bioc-intro' From 8cec0b62ec42b66381d61e7f17b50623eeabaf09 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Fri, 29 Dec 2023 06:11:55 +0900 Subject: [PATCH 068/334] New translations config.yaml (Spanish) --- locale/es/config.yaml | 61 +++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 61 insertions(+) create mode 100644 locale/es/config.yaml diff --git a/locale/es/config.yaml b/locale/es/config.yaml new file mode 100644 index 000000000..204cb59c5 --- /dev/null +++ b/locale/es/config.yaml @@ -0,0 +1,61 @@ +#------------------------------------------------------------ +#Values for this lesson. +#------------------------------------------------------------ +#Which carpentry is this (swc, dc, lc, or cp)? +#swc: Software Carpentry +#dc: Data Carpentry +#lc: Library Carpentry +#cp: Carpentries (to use for instructor training for instance) +#incubator: The Carpentries Incubator +carpentry: 'incubator' +#Overall title for pages. +title: 'Introduction to data analysis with R and Bioconductor' +#Date the lesson was created (YYYY-MM-DD, this is empty by default) +created: '2020-09-14' +#Comma-separated list of keywords for the lesson +keywords: 'software, data, lesson, The Carpentries' +#Life cycle stage of the lesson +#possible values: pre-alpha, alpha, beta, stable +life_cycle: 'stable' +#License of the lesson +license: 'CC-BY 4.0' +#Link to the source repository for this lesson +source: 'https://github.com/carpentries-incubator/bioc-intro' +#Default branch of your lesson +branch: 'main' +#Who to contact if there are any issues +contact: 'laurent.gatto@uclouvain.be' +#Navigation ------------------------------------------------ +#Use the following menu items to specify the order of +#individual pages in each dropdown section. Leave blank to +#include all pages in the folder. +#Example ------------- +#episodes: +#- introduction.md +#- first-steps.md +#learners: +#- setup.md +#instructors: +#- instructor-notes.md +#profiles: +#- one-learner.md +#- another-learner.md +#Order of episodes in your lesson +episodes: + - 10-data-organisation.Rmd + - 20-r-rstudio.Rmd + - 23-starting-with-r.Rmd + - 25-starting-with-data.Rmd + - 30-dplyr.Rmd + - 40-visualization.Rmd + - 60-next-steps.Rmd +#Information for Learners +learners: +#Information for Instructors +instructors: +#Learner Profiles +profiles: +#Customisation --------------------------------------------- +#This space below is where custom yaml items (e.g. pinning +#sandpaper and varnish versions) should live +url: 'https://carpentries-incubator.github.io/bioc-intro' From 5cb42a0a8b8ae4c8e995e7026bde565c8e9c110a Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Fri, 29 Dec 2023 06:11:56 +0900 Subject: [PATCH 069/334] New translations config.yaml (Japanese) --- locale/ja/config.yaml | 61 +++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 61 insertions(+) create mode 100644 locale/ja/config.yaml diff --git a/locale/ja/config.yaml b/locale/ja/config.yaml new file mode 100644 index 000000000..204cb59c5 --- /dev/null +++ b/locale/ja/config.yaml @@ -0,0 +1,61 @@ +#------------------------------------------------------------ +#Values for this lesson. +#------------------------------------------------------------ +#Which carpentry is this (swc, dc, lc, or cp)? +#swc: Software Carpentry +#dc: Data Carpentry +#lc: Library Carpentry +#cp: Carpentries (to use for instructor training for instance) +#incubator: The Carpentries Incubator +carpentry: 'incubator' +#Overall title for pages. +title: 'Introduction to data analysis with R and Bioconductor' +#Date the lesson was created (YYYY-MM-DD, this is empty by default) +created: '2020-09-14' +#Comma-separated list of keywords for the lesson +keywords: 'software, data, lesson, The Carpentries' +#Life cycle stage of the lesson +#possible values: pre-alpha, alpha, beta, stable +life_cycle: 'stable' +#License of the lesson +license: 'CC-BY 4.0' +#Link to the source repository for this lesson +source: 'https://github.com/carpentries-incubator/bioc-intro' +#Default branch of your lesson +branch: 'main' +#Who to contact if there are any issues +contact: 'laurent.gatto@uclouvain.be' +#Navigation ------------------------------------------------ +#Use the following menu items to specify the order of +#individual pages in each dropdown section. Leave blank to +#include all pages in the folder. +#Example ------------- +#episodes: +#- introduction.md +#- first-steps.md +#learners: +#- setup.md +#instructors: +#- instructor-notes.md +#profiles: +#- one-learner.md +#- another-learner.md +#Order of episodes in your lesson +episodes: + - 10-data-organisation.Rmd + - 20-r-rstudio.Rmd + - 23-starting-with-r.Rmd + - 25-starting-with-data.Rmd + - 30-dplyr.Rmd + - 40-visualization.Rmd + - 60-next-steps.Rmd +#Information for Learners +learners: +#Information for Instructors +instructors: +#Learner Profiles +profiles: +#Customisation --------------------------------------------- +#This space below is where custom yaml items (e.g. pinning +#sandpaper and varnish versions) should live +url: 'https://carpentries-incubator.github.io/bioc-intro' From 4677a459b3b06aa5a3a2a92718d574462b402a3c Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Fri, 29 Dec 2023 06:11:57 +0900 Subject: [PATCH 070/334] New translations config.yaml (Portuguese) --- locale/pt/config.yaml | 61 +++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 61 insertions(+) create mode 100644 locale/pt/config.yaml diff --git a/locale/pt/config.yaml b/locale/pt/config.yaml new file mode 100644 index 000000000..204cb59c5 --- /dev/null +++ b/locale/pt/config.yaml @@ -0,0 +1,61 @@ +#------------------------------------------------------------ +#Values for this lesson. +#------------------------------------------------------------ +#Which carpentry is this (swc, dc, lc, or cp)? +#swc: Software Carpentry +#dc: Data Carpentry +#lc: Library Carpentry +#cp: Carpentries (to use for instructor training for instance) +#incubator: The Carpentries Incubator +carpentry: 'incubator' +#Overall title for pages. +title: 'Introduction to data analysis with R and Bioconductor' +#Date the lesson was created (YYYY-MM-DD, this is empty by default) +created: '2020-09-14' +#Comma-separated list of keywords for the lesson +keywords: 'software, data, lesson, The Carpentries' +#Life cycle stage of the lesson +#possible values: pre-alpha, alpha, beta, stable +life_cycle: 'stable' +#License of the lesson +license: 'CC-BY 4.0' +#Link to the source repository for this lesson +source: 'https://github.com/carpentries-incubator/bioc-intro' +#Default branch of your lesson +branch: 'main' +#Who to contact if there are any issues +contact: 'laurent.gatto@uclouvain.be' +#Navigation ------------------------------------------------ +#Use the following menu items to specify the order of +#individual pages in each dropdown section. Leave blank to +#include all pages in the folder. +#Example ------------- +#episodes: +#- introduction.md +#- first-steps.md +#learners: +#- setup.md +#instructors: +#- instructor-notes.md +#profiles: +#- one-learner.md +#- another-learner.md +#Order of episodes in your lesson +episodes: + - 10-data-organisation.Rmd + - 20-r-rstudio.Rmd + - 23-starting-with-r.Rmd + - 25-starting-with-data.Rmd + - 30-dplyr.Rmd + - 40-visualization.Rmd + - 60-next-steps.Rmd +#Information for Learners +learners: +#Information for Instructors +instructors: +#Learner Profiles +profiles: +#Customisation --------------------------------------------- +#This space below is where custom yaml items (e.g. pinning +#sandpaper and varnish versions) should live +url: 'https://carpentries-incubator.github.io/bioc-intro' From 188475df27b70d2c6955f392d7156a66895bac05 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Fri, 29 Dec 2023 06:11:58 +0900 Subject: [PATCH 071/334] New translations config.yaml (Chinese Simplified) --- locale/zh/config.yaml | 61 +++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 61 insertions(+) create mode 100644 locale/zh/config.yaml diff --git a/locale/zh/config.yaml b/locale/zh/config.yaml new file mode 100644 index 000000000..204cb59c5 --- /dev/null +++ b/locale/zh/config.yaml @@ -0,0 +1,61 @@ +#------------------------------------------------------------ +#Values for this lesson. +#------------------------------------------------------------ +#Which carpentry is this (swc, dc, lc, or cp)? +#swc: Software Carpentry +#dc: Data Carpentry +#lc: Library Carpentry +#cp: Carpentries (to use for instructor training for instance) +#incubator: The Carpentries Incubator +carpentry: 'incubator' +#Overall title for pages. +title: 'Introduction to data analysis with R and Bioconductor' +#Date the lesson was created (YYYY-MM-DD, this is empty by default) +created: '2020-09-14' +#Comma-separated list of keywords for the lesson +keywords: 'software, data, lesson, The Carpentries' +#Life cycle stage of the lesson +#possible values: pre-alpha, alpha, beta, stable +life_cycle: 'stable' +#License of the lesson +license: 'CC-BY 4.0' +#Link to the source repository for this lesson +source: 'https://github.com/carpentries-incubator/bioc-intro' +#Default branch of your lesson +branch: 'main' +#Who to contact if there are any issues +contact: 'laurent.gatto@uclouvain.be' +#Navigation ------------------------------------------------ +#Use the following menu items to specify the order of +#individual pages in each dropdown section. Leave blank to +#include all pages in the folder. +#Example ------------- +#episodes: +#- introduction.md +#- first-steps.md +#learners: +#- setup.md +#instructors: +#- instructor-notes.md +#profiles: +#- one-learner.md +#- another-learner.md +#Order of episodes in your lesson +episodes: + - 10-data-organisation.Rmd + - 20-r-rstudio.Rmd + - 23-starting-with-r.Rmd + - 25-starting-with-data.Rmd + - 30-dplyr.Rmd + - 40-visualization.Rmd + - 60-next-steps.Rmd +#Information for Learners +learners: +#Information for Instructors +instructors: +#Learner Profiles +profiles: +#Customisation --------------------------------------------- +#This space below is where custom yaml items (e.g. pinning +#sandpaper and varnish versions) should live +url: 'https://carpentries-incubator.github.io/bioc-intro' From cf4a691510f7ce57221640045ecd8363ef7bf195 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Fri, 29 Dec 2023 06:11:59 +0900 Subject: [PATCH 072/334] New translations contributing.md (French) --- locale/fr/CONTRIBUTING.md | 164 ++++++++++++++++++++++++++++++++++++++ 1 file changed, 164 insertions(+) create mode 100644 locale/fr/CONTRIBUTING.md diff --git a/locale/fr/CONTRIBUTING.md b/locale/fr/CONTRIBUTING.md new file mode 100644 index 000000000..e80f40421 --- /dev/null +++ b/locale/fr/CONTRIBUTING.md @@ -0,0 +1,164 @@ +# Contributing + +[Software Carpentry][swc-site] and [Data Carpentry][dc-site] are open source projects, +and we welcome contributions of all kinds: +new lessons, +fixes to existing material, +bug reports, +and reviews of proposed changes are all welcome. + +## Contributor Agreement + +By contributing, +you agree that we may redistribute your work under [our license](LICENSE.md). +In exchange, +we will address your issues and/or assess your change proposal as promptly as we can, +and help you become a member of our community. +Everyone involved in [Software Carpentry][swc-site] and [Data Carpentry][dc-site] +agrees to abide by our [code of conduct](CONDUCT.md). + +## How to Contribute + +The easiest way to get started is to file an issue +to tell us about a spelling mistake, +some awkward wording, +or a factual error. +This is a good way to introduce yourself +and to meet some of our community members. + +1. If you do not have a [GitHub][github] account, + you can [send us comments by email][contact]. + However, + we will be able to respond more quickly if you use one of the other methods described below. + +2. If you have a [GitHub][github] account, + or are willing to [create one][github-join], + but do not know how to use Git, + you can report problems or suggest improvements by [creating an issue][issues]. + This allows us to assign the item to someone + and to respond to it in a threaded discussion. + +3. If you are comfortable with Git, + and would like to add or change material, + you can submit a pull request (PR). + Instructions for doing this are [included below](#using-github). + +## Where to Contribute + +1. If you wish to change this lesson, + please work in https\://github.com/swcarpentry/shell-novice, + which can be viewed at https\://swcarpentry.github.io/shell-novice. + +2. If you wish to change the example lesson, + please work in https\://github.com/carpentries/lesson-example, + which documents the format of our lessons + and can be viewed at https\://carpentries.github.io/lesson-example. + +3. If you wish to change the template used for workshop websites, + please work in https\://github.com/carpentries/workshop-template. + The home page of that repository explains how to set up workshop websites, + while the extra pages in https\://carpentries.github.io/workshop-template + provide more background on our design choices. + +4. If you wish to change CSS style files, tools, + or HTML boilerplate for lessons or workshops stored in `_includes` or `_layouts`, + please work in https\://github.com/carpentries/styles. + +## What to Contribute + +There are many ways to contribute, +from writing new exercises and improving existing ones +to updating or filling in the documentation +and submitting [bug reports][issues] +about things that don't work, aren't clear, or are missing. +If you are looking for ideas, +please see [the list of issues for this repository][issues], +or the issues for [Data Carpentry][dc-issues] +and [Software Carpentry][swc-issues] projects. + +Comments on issues and reviews of pull requests are just as welcome: +we are smarter together than we are on our own. +Reviews from novices and newcomers are particularly valuable: +it's easy for people who have been using these lessons for a while +to forget how impenetrable some of this material can be, +so fresh eyes are always welcome. + +## What _Not_ to Contribute + +Our lessons already contain more material than we can cover in a typical workshop, +so we are usually _not_ looking for more concepts or tools to add to them. +As a rule, +if you want to introduce a new idea, +you must (a) estimate how long it will take to teach +and (b) explain what you would take out to make room for it. +The first encourages contributors to be honest about requirements; +the second, to think hard about priorities. + +We are also not looking for exercises or other material that only run on one platform. +Our workshops typically contain a mixture of Windows, macOS, and Linux users; +in order to be usable, +our lessons must run equally well on all three. + +## Using GitHub + +If you choose to contribute via GitHub, +you may want to look at +[How to Contribute to an Open Source Project on GitHub][how-contribute]. +In brief: + +1. The published copy of the lesson is in the `gh-pages` branch of the repository + (so that GitHub will regenerate it automatically). + Please create all branches from that, + and merge the [master repository][repo]'s `gh-pages` branch into your `gh-pages` branch + before starting work. + Please do _not_ work directly in your `gh-pages` branch, + since that will make it difficult for you to work on other contributions. + +2. We use [GitHub flow][github-flow] to manage changes: + 1. Create a new branch in your desktop copy of this repository for each significant change. + 2. Commit the change in that branch. + 3. Push that branch to your fork of this repository on GitHub. + 4. Submit a pull request from that branch to the [master repository][repo]. + 5. If you receive feedback, + make changes on your desktop and push to your branch on GitHub: + the pull request will update automatically. + +Each lesson has two maintainers who review issues and pull requests +or encourage others to do so. +The maintainers are community volunteers, +and have final say over what gets merged into the lesson. + +## Other Resources + +General discussion of [Software Carpentry][swc-site] and [Data Carpentry][dc-site] +happens on the [discussion mailing list][discuss-list], +which everyone is welcome to join. +You can also [reach us by email][contact]. + +[contact]: mailto:admin@software-carpentry.org + +[dc-issues]: https://github.com/issues?q=user%3Adatacarpentry + +[dc-lessons]: http://datacarpentry.org/lessons/ + +[dc-site]: http://datacarpentry.org/ + +[discuss-list]: http://lists.software-carpentry.org/listinfo/discuss + +[github]: http://github.com + +[github-flow]: https://guides.github.com/introduction/flow/ + +[github-join]: https://github.com/join + +[how-contribute]: https://egghead.io/series/how-to-contribute-to-an-open-source-project-on-github + +[issues]: https://github.com/swcarpentry/shell-novice/issues/ + +[repo]: https://github.com/swcarpentry/shell-novice/ + +[swc-issues]: https://github.com/issues?q=user%3Aswcarpentry + +[swc-lessons]: http://software-carpentry.org/lessons/ + +[swc-site]: http://software-carpentry.org/ From 96f9e06238af68839eff4ad098951e4514fbd942 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Fri, 29 Dec 2023 06:12:00 +0900 Subject: [PATCH 073/334] New translations contributing.md (Spanish) --- locale/es/CONTRIBUTING.md | 164 ++++++++++++++++++++++++++++++++++++++ 1 file changed, 164 insertions(+) create mode 100644 locale/es/CONTRIBUTING.md diff --git a/locale/es/CONTRIBUTING.md b/locale/es/CONTRIBUTING.md new file mode 100644 index 000000000..e80f40421 --- /dev/null +++ b/locale/es/CONTRIBUTING.md @@ -0,0 +1,164 @@ +# Contributing + +[Software Carpentry][swc-site] and [Data Carpentry][dc-site] are open source projects, +and we welcome contributions of all kinds: +new lessons, +fixes to existing material, +bug reports, +and reviews of proposed changes are all welcome. + +## Contributor Agreement + +By contributing, +you agree that we may redistribute your work under [our license](LICENSE.md). +In exchange, +we will address your issues and/or assess your change proposal as promptly as we can, +and help you become a member of our community. +Everyone involved in [Software Carpentry][swc-site] and [Data Carpentry][dc-site] +agrees to abide by our [code of conduct](CONDUCT.md). + +## How to Contribute + +The easiest way to get started is to file an issue +to tell us about a spelling mistake, +some awkward wording, +or a factual error. +This is a good way to introduce yourself +and to meet some of our community members. + +1. If you do not have a [GitHub][github] account, + you can [send us comments by email][contact]. + However, + we will be able to respond more quickly if you use one of the other methods described below. + +2. If you have a [GitHub][github] account, + or are willing to [create one][github-join], + but do not know how to use Git, + you can report problems or suggest improvements by [creating an issue][issues]. + This allows us to assign the item to someone + and to respond to it in a threaded discussion. + +3. If you are comfortable with Git, + and would like to add or change material, + you can submit a pull request (PR). + Instructions for doing this are [included below](#using-github). + +## Where to Contribute + +1. If you wish to change this lesson, + please work in https\://github.com/swcarpentry/shell-novice, + which can be viewed at https\://swcarpentry.github.io/shell-novice. + +2. If you wish to change the example lesson, + please work in https\://github.com/carpentries/lesson-example, + which documents the format of our lessons + and can be viewed at https\://carpentries.github.io/lesson-example. + +3. If you wish to change the template used for workshop websites, + please work in https\://github.com/carpentries/workshop-template. + The home page of that repository explains how to set up workshop websites, + while the extra pages in https\://carpentries.github.io/workshop-template + provide more background on our design choices. + +4. If you wish to change CSS style files, tools, + or HTML boilerplate for lessons or workshops stored in `_includes` or `_layouts`, + please work in https\://github.com/carpentries/styles. + +## What to Contribute + +There are many ways to contribute, +from writing new exercises and improving existing ones +to updating or filling in the documentation +and submitting [bug reports][issues] +about things that don't work, aren't clear, or are missing. +If you are looking for ideas, +please see [the list of issues for this repository][issues], +or the issues for [Data Carpentry][dc-issues] +and [Software Carpentry][swc-issues] projects. + +Comments on issues and reviews of pull requests are just as welcome: +we are smarter together than we are on our own. +Reviews from novices and newcomers are particularly valuable: +it's easy for people who have been using these lessons for a while +to forget how impenetrable some of this material can be, +so fresh eyes are always welcome. + +## What _Not_ to Contribute + +Our lessons already contain more material than we can cover in a typical workshop, +so we are usually _not_ looking for more concepts or tools to add to them. +As a rule, +if you want to introduce a new idea, +you must (a) estimate how long it will take to teach +and (b) explain what you would take out to make room for it. +The first encourages contributors to be honest about requirements; +the second, to think hard about priorities. + +We are also not looking for exercises or other material that only run on one platform. +Our workshops typically contain a mixture of Windows, macOS, and Linux users; +in order to be usable, +our lessons must run equally well on all three. + +## Using GitHub + +If you choose to contribute via GitHub, +you may want to look at +[How to Contribute to an Open Source Project on GitHub][how-contribute]. +In brief: + +1. The published copy of the lesson is in the `gh-pages` branch of the repository + (so that GitHub will regenerate it automatically). + Please create all branches from that, + and merge the [master repository][repo]'s `gh-pages` branch into your `gh-pages` branch + before starting work. + Please do _not_ work directly in your `gh-pages` branch, + since that will make it difficult for you to work on other contributions. + +2. We use [GitHub flow][github-flow] to manage changes: + 1. Create a new branch in your desktop copy of this repository for each significant change. + 2. Commit the change in that branch. + 3. Push that branch to your fork of this repository on GitHub. + 4. Submit a pull request from that branch to the [master repository][repo]. + 5. If you receive feedback, + make changes on your desktop and push to your branch on GitHub: + the pull request will update automatically. + +Each lesson has two maintainers who review issues and pull requests +or encourage others to do so. +The maintainers are community volunteers, +and have final say over what gets merged into the lesson. + +## Other Resources + +General discussion of [Software Carpentry][swc-site] and [Data Carpentry][dc-site] +happens on the [discussion mailing list][discuss-list], +which everyone is welcome to join. +You can also [reach us by email][contact]. + +[contact]: mailto:admin@software-carpentry.org + +[dc-issues]: https://github.com/issues?q=user%3Adatacarpentry + +[dc-lessons]: http://datacarpentry.org/lessons/ + +[dc-site]: http://datacarpentry.org/ + +[discuss-list]: http://lists.software-carpentry.org/listinfo/discuss + +[github]: http://github.com + +[github-flow]: https://guides.github.com/introduction/flow/ + +[github-join]: https://github.com/join + +[how-contribute]: https://egghead.io/series/how-to-contribute-to-an-open-source-project-on-github + +[issues]: https://github.com/swcarpentry/shell-novice/issues/ + +[repo]: https://github.com/swcarpentry/shell-novice/ + +[swc-issues]: https://github.com/issues?q=user%3Aswcarpentry + +[swc-lessons]: http://software-carpentry.org/lessons/ + +[swc-site]: http://software-carpentry.org/ From 10e559035b9c3c71020c2309ca785179713dddd7 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Fri, 29 Dec 2023 06:12:01 +0900 Subject: [PATCH 074/334] New translations contributing.md (Japanese) --- locale/ja/CONTRIBUTING.md | 164 ++++++++++++++++++++++++++++++++++++++ 1 file changed, 164 insertions(+) create mode 100644 locale/ja/CONTRIBUTING.md diff --git a/locale/ja/CONTRIBUTING.md b/locale/ja/CONTRIBUTING.md new file mode 100644 index 000000000..e80f40421 --- /dev/null +++ b/locale/ja/CONTRIBUTING.md @@ -0,0 +1,164 @@ +# Contributing + +[Software Carpentry][swc-site] and [Data Carpentry][dc-site] are open source projects, +and we welcome contributions of all kinds: +new lessons, +fixes to existing material, +bug reports, +and reviews of proposed changes are all welcome. + +## Contributor Agreement + +By contributing, +you agree that we may redistribute your work under [our license](LICENSE.md). +In exchange, +we will address your issues and/or assess your change proposal as promptly as we can, +and help you become a member of our community. +Everyone involved in [Software Carpentry][swc-site] and [Data Carpentry][dc-site] +agrees to abide by our [code of conduct](CONDUCT.md). + +## How to Contribute + +The easiest way to get started is to file an issue +to tell us about a spelling mistake, +some awkward wording, +or a factual error. +This is a good way to introduce yourself +and to meet some of our community members. + +1. If you do not have a [GitHub][github] account, + you can [send us comments by email][contact]. + However, + we will be able to respond more quickly if you use one of the other methods described below. + +2. If you have a [GitHub][github] account, + or are willing to [create one][github-join], + but do not know how to use Git, + you can report problems or suggest improvements by [creating an issue][issues]. + This allows us to assign the item to someone + and to respond to it in a threaded discussion. + +3. If you are comfortable with Git, + and would like to add or change material, + you can submit a pull request (PR). + Instructions for doing this are [included below](#using-github). + +## Where to Contribute + +1. If you wish to change this lesson, + please work in https\://github.com/swcarpentry/shell-novice, + which can be viewed at https\://swcarpentry.github.io/shell-novice. + +2. If you wish to change the example lesson, + please work in https\://github.com/carpentries/lesson-example, + which documents the format of our lessons + and can be viewed at https\://carpentries.github.io/lesson-example. + +3. If you wish to change the template used for workshop websites, + please work in https\://github.com/carpentries/workshop-template. + The home page of that repository explains how to set up workshop websites, + while the extra pages in https\://carpentries.github.io/workshop-template + provide more background on our design choices. + +4. If you wish to change CSS style files, tools, + or HTML boilerplate for lessons or workshops stored in `_includes` or `_layouts`, + please work in https\://github.com/carpentries/styles. + +## What to Contribute + +There are many ways to contribute, +from writing new exercises and improving existing ones +to updating or filling in the documentation +and submitting [bug reports][issues] +about things that don't work, aren't clear, or are missing. +If you are looking for ideas, +please see [the list of issues for this repository][issues], +or the issues for [Data Carpentry][dc-issues] +and [Software Carpentry][swc-issues] projects. + +Comments on issues and reviews of pull requests are just as welcome: +we are smarter together than we are on our own. +Reviews from novices and newcomers are particularly valuable: +it's easy for people who have been using these lessons for a while +to forget how impenetrable some of this material can be, +so fresh eyes are always welcome. + +## What _Not_ to Contribute + +Our lessons already contain more material than we can cover in a typical workshop, +so we are usually _not_ looking for more concepts or tools to add to them. +As a rule, +if you want to introduce a new idea, +you must (a) estimate how long it will take to teach +and (b) explain what you would take out to make room for it. +The first encourages contributors to be honest about requirements; +the second, to think hard about priorities. + +We are also not looking for exercises or other material that only run on one platform. +Our workshops typically contain a mixture of Windows, macOS, and Linux users; +in order to be usable, +our lessons must run equally well on all three. + +## Using GitHub + +If you choose to contribute via GitHub, +you may want to look at +[How to Contribute to an Open Source Project on GitHub][how-contribute]. +In brief: + +1. The published copy of the lesson is in the `gh-pages` branch of the repository + (so that GitHub will regenerate it automatically). + Please create all branches from that, + and merge the [master repository][repo]'s `gh-pages` branch into your `gh-pages` branch + before starting work. + Please do _not_ work directly in your `gh-pages` branch, + since that will make it difficult for you to work on other contributions. + +2. We use [GitHub flow][github-flow] to manage changes: + 1. Create a new branch in your desktop copy of this repository for each significant change. + 2. Commit the change in that branch. + 3. Push that branch to your fork of this repository on GitHub. + 4. Submit a pull request from that branch to the [master repository][repo]. + 5. If you receive feedback, + make changes on your desktop and push to your branch on GitHub: + the pull request will update automatically. + +Each lesson has two maintainers who review issues and pull requests +or encourage others to do so. +The maintainers are community volunteers, +and have final say over what gets merged into the lesson. + +## Other Resources + +General discussion of [Software Carpentry][swc-site] and [Data Carpentry][dc-site] +happens on the [discussion mailing list][discuss-list], +which everyone is welcome to join. +You can also [reach us by email][contact]. + +[contact]: mailto:admin@software-carpentry.org + +[dc-issues]: https://github.com/issues?q=user%3Adatacarpentry + +[dc-lessons]: http://datacarpentry.org/lessons/ + +[dc-site]: http://datacarpentry.org/ + +[discuss-list]: http://lists.software-carpentry.org/listinfo/discuss + +[github]: http://github.com + +[github-flow]: https://guides.github.com/introduction/flow/ + +[github-join]: https://github.com/join + +[how-contribute]: https://egghead.io/series/how-to-contribute-to-an-open-source-project-on-github + +[issues]: https://github.com/swcarpentry/shell-novice/issues/ + +[repo]: https://github.com/swcarpentry/shell-novice/ + +[swc-issues]: https://github.com/issues?q=user%3Aswcarpentry + +[swc-lessons]: http://software-carpentry.org/lessons/ + +[swc-site]: http://software-carpentry.org/ From 4d20e1117698e0a4dbdadd0f0ada3bce91ec2dd5 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Fri, 29 Dec 2023 06:12:02 +0900 Subject: [PATCH 075/334] New translations contributing.md (Portuguese) --- locale/pt/CONTRIBUTING.md | 164 ++++++++++++++++++++++++++++++++++++++ 1 file changed, 164 insertions(+) create mode 100644 locale/pt/CONTRIBUTING.md diff --git a/locale/pt/CONTRIBUTING.md b/locale/pt/CONTRIBUTING.md new file mode 100644 index 000000000..e80f40421 --- /dev/null +++ b/locale/pt/CONTRIBUTING.md @@ -0,0 +1,164 @@ +# Contributing + +[Software Carpentry][swc-site] and [Data Carpentry][dc-site] are open source projects, +and we welcome contributions of all kinds: +new lessons, +fixes to existing material, +bug reports, +and reviews of proposed changes are all welcome. + +## Contributor Agreement + +By contributing, +you agree that we may redistribute your work under [our license](LICENSE.md). +In exchange, +we will address your issues and/or assess your change proposal as promptly as we can, +and help you become a member of our community. +Everyone involved in [Software Carpentry][swc-site] and [Data Carpentry][dc-site] +agrees to abide by our [code of conduct](CONDUCT.md). + +## How to Contribute + +The easiest way to get started is to file an issue +to tell us about a spelling mistake, +some awkward wording, +or a factual error. +This is a good way to introduce yourself +and to meet some of our community members. + +1. If you do not have a [GitHub][github] account, + you can [send us comments by email][contact]. + However, + we will be able to respond more quickly if you use one of the other methods described below. + +2. If you have a [GitHub][github] account, + or are willing to [create one][github-join], + but do not know how to use Git, + you can report problems or suggest improvements by [creating an issue][issues]. + This allows us to assign the item to someone + and to respond to it in a threaded discussion. + +3. If you are comfortable with Git, + and would like to add or change material, + you can submit a pull request (PR). + Instructions for doing this are [included below](#using-github). + +## Where to Contribute + +1. If you wish to change this lesson, + please work in https\://github.com/swcarpentry/shell-novice, + which can be viewed at https\://swcarpentry.github.io/shell-novice. + +2. If you wish to change the example lesson, + please work in https\://github.com/carpentries/lesson-example, + which documents the format of our lessons + and can be viewed at https\://carpentries.github.io/lesson-example. + +3. If you wish to change the template used for workshop websites, + please work in https\://github.com/carpentries/workshop-template. + The home page of that repository explains how to set up workshop websites, + while the extra pages in https\://carpentries.github.io/workshop-template + provide more background on our design choices. + +4. If you wish to change CSS style files, tools, + or HTML boilerplate for lessons or workshops stored in `_includes` or `_layouts`, + please work in https\://github.com/carpentries/styles. + +## What to Contribute + +There are many ways to contribute, +from writing new exercises and improving existing ones +to updating or filling in the documentation +and submitting [bug reports][issues] +about things that don't work, aren't clear, or are missing. +If you are looking for ideas, +please see [the list of issues for this repository][issues], +or the issues for [Data Carpentry][dc-issues] +and [Software Carpentry][swc-issues] projects. + +Comments on issues and reviews of pull requests are just as welcome: +we are smarter together than we are on our own. +Reviews from novices and newcomers are particularly valuable: +it's easy for people who have been using these lessons for a while +to forget how impenetrable some of this material can be, +so fresh eyes are always welcome. + +## What _Not_ to Contribute + +Our lessons already contain more material than we can cover in a typical workshop, +so we are usually _not_ looking for more concepts or tools to add to them. +As a rule, +if you want to introduce a new idea, +you must (a) estimate how long it will take to teach +and (b) explain what you would take out to make room for it. +The first encourages contributors to be honest about requirements; +the second, to think hard about priorities. + +We are also not looking for exercises or other material that only run on one platform. +Our workshops typically contain a mixture of Windows, macOS, and Linux users; +in order to be usable, +our lessons must run equally well on all three. + +## Using GitHub + +If you choose to contribute via GitHub, +you may want to look at +[How to Contribute to an Open Source Project on GitHub][how-contribute]. +In brief: + +1. The published copy of the lesson is in the `gh-pages` branch of the repository + (so that GitHub will regenerate it automatically). + Please create all branches from that, + and merge the [master repository][repo]'s `gh-pages` branch into your `gh-pages` branch + before starting work. + Please do _not_ work directly in your `gh-pages` branch, + since that will make it difficult for you to work on other contributions. + +2. We use [GitHub flow][github-flow] to manage changes: + 1. Create a new branch in your desktop copy of this repository for each significant change. + 2. Commit the change in that branch. + 3. Push that branch to your fork of this repository on GitHub. + 4. Submit a pull request from that branch to the [master repository][repo]. + 5. If you receive feedback, + make changes on your desktop and push to your branch on GitHub: + the pull request will update automatically. + +Each lesson has two maintainers who review issues and pull requests +or encourage others to do so. +The maintainers are community volunteers, +and have final say over what gets merged into the lesson. + +## Other Resources + +General discussion of [Software Carpentry][swc-site] and [Data Carpentry][dc-site] +happens on the [discussion mailing list][discuss-list], +which everyone is welcome to join. +You can also [reach us by email][contact]. + +[contact]: mailto:admin@software-carpentry.org + +[dc-issues]: https://github.com/issues?q=user%3Adatacarpentry + +[dc-lessons]: http://datacarpentry.org/lessons/ + +[dc-site]: http://datacarpentry.org/ + +[discuss-list]: http://lists.software-carpentry.org/listinfo/discuss + +[github]: http://github.com + +[github-flow]: https://guides.github.com/introduction/flow/ + +[github-join]: https://github.com/join + +[how-contribute]: https://egghead.io/series/how-to-contribute-to-an-open-source-project-on-github + +[issues]: https://github.com/swcarpentry/shell-novice/issues/ + +[repo]: https://github.com/swcarpentry/shell-novice/ + +[swc-issues]: https://github.com/issues?q=user%3Aswcarpentry + +[swc-lessons]: http://software-carpentry.org/lessons/ + +[swc-site]: http://software-carpentry.org/ From 28e9e887ec20bc63141e55ffec0b0c94af4669c9 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Fri, 29 Dec 2023 06:12:04 +0900 Subject: [PATCH 076/334] New translations contributing.md (Chinese Simplified) --- locale/zh/CONTRIBUTING.md | 164 ++++++++++++++++++++++++++++++++++++++ 1 file changed, 164 insertions(+) create mode 100644 locale/zh/CONTRIBUTING.md diff --git a/locale/zh/CONTRIBUTING.md b/locale/zh/CONTRIBUTING.md new file mode 100644 index 000000000..e80f40421 --- /dev/null +++ b/locale/zh/CONTRIBUTING.md @@ -0,0 +1,164 @@ +# Contributing + +[Software Carpentry][swc-site] and [Data Carpentry][dc-site] are open source projects, +and we welcome contributions of all kinds: +new lessons, +fixes to existing material, +bug reports, +and reviews of proposed changes are all welcome. + +## Contributor Agreement + +By contributing, +you agree that we may redistribute your work under [our license](LICENSE.md). +In exchange, +we will address your issues and/or assess your change proposal as promptly as we can, +and help you become a member of our community. +Everyone involved in [Software Carpentry][swc-site] and [Data Carpentry][dc-site] +agrees to abide by our [code of conduct](CONDUCT.md). + +## How to Contribute + +The easiest way to get started is to file an issue +to tell us about a spelling mistake, +some awkward wording, +or a factual error. +This is a good way to introduce yourself +and to meet some of our community members. + +1. If you do not have a [GitHub][github] account, + you can [send us comments by email][contact]. + However, + we will be able to respond more quickly if you use one of the other methods described below. + +2. If you have a [GitHub][github] account, + or are willing to [create one][github-join], + but do not know how to use Git, + you can report problems or suggest improvements by [creating an issue][issues]. + This allows us to assign the item to someone + and to respond to it in a threaded discussion. + +3. If you are comfortable with Git, + and would like to add or change material, + you can submit a pull request (PR). + Instructions for doing this are [included below](#using-github). + +## Where to Contribute + +1. If you wish to change this lesson, + please work in https\://github.com/swcarpentry/shell-novice, + which can be viewed at https\://swcarpentry.github.io/shell-novice. + +2. If you wish to change the example lesson, + please work in https\://github.com/carpentries/lesson-example, + which documents the format of our lessons + and can be viewed at https\://carpentries.github.io/lesson-example. + +3. If you wish to change the template used for workshop websites, + please work in https\://github.com/carpentries/workshop-template. + The home page of that repository explains how to set up workshop websites, + while the extra pages in https\://carpentries.github.io/workshop-template + provide more background on our design choices. + +4. If you wish to change CSS style files, tools, + or HTML boilerplate for lessons or workshops stored in `_includes` or `_layouts`, + please work in https\://github.com/carpentries/styles. + +## What to Contribute + +There are many ways to contribute, +from writing new exercises and improving existing ones +to updating or filling in the documentation +and submitting [bug reports][issues] +about things that don't work, aren't clear, or are missing. +If you are looking for ideas, +please see [the list of issues for this repository][issues], +or the issues for [Data Carpentry][dc-issues] +and [Software Carpentry][swc-issues] projects. + +Comments on issues and reviews of pull requests are just as welcome: +we are smarter together than we are on our own. +Reviews from novices and newcomers are particularly valuable: +it's easy for people who have been using these lessons for a while +to forget how impenetrable some of this material can be, +so fresh eyes are always welcome. + +## What _Not_ to Contribute + +Our lessons already contain more material than we can cover in a typical workshop, +so we are usually _not_ looking for more concepts or tools to add to them. +As a rule, +if you want to introduce a new idea, +you must (a) estimate how long it will take to teach +and (b) explain what you would take out to make room for it. +The first encourages contributors to be honest about requirements; +the second, to think hard about priorities. + +We are also not looking for exercises or other material that only run on one platform. +Our workshops typically contain a mixture of Windows, macOS, and Linux users; +in order to be usable, +our lessons must run equally well on all three. + +## Using GitHub + +If you choose to contribute via GitHub, +you may want to look at +[How to Contribute to an Open Source Project on GitHub][how-contribute]. +In brief: + +1. The published copy of the lesson is in the `gh-pages` branch of the repository + (so that GitHub will regenerate it automatically). + Please create all branches from that, + and merge the [master repository][repo]'s `gh-pages` branch into your `gh-pages` branch + before starting work. + Please do _not_ work directly in your `gh-pages` branch, + since that will make it difficult for you to work on other contributions. + +2. We use [GitHub flow][github-flow] to manage changes: + 1. Create a new branch in your desktop copy of this repository for each significant change. + 2. Commit the change in that branch. + 3. Push that branch to your fork of this repository on GitHub. + 4. Submit a pull request from that branch to the [master repository][repo]. + 5. If you receive feedback, + make changes on your desktop and push to your branch on GitHub: + the pull request will update automatically. + +Each lesson has two maintainers who review issues and pull requests +or encourage others to do so. +The maintainers are community volunteers, +and have final say over what gets merged into the lesson. + +## Other Resources + +General discussion of [Software Carpentry][swc-site] and [Data Carpentry][dc-site] +happens on the [discussion mailing list][discuss-list], +which everyone is welcome to join. +You can also [reach us by email][contact]. + +[contact]: mailto:admin@software-carpentry.org + +[dc-issues]: https://github.com/issues?q=user%3Adatacarpentry + +[dc-lessons]: http://datacarpentry.org/lessons/ + +[dc-site]: http://datacarpentry.org/ + +[discuss-list]: http://lists.software-carpentry.org/listinfo/discuss + +[github]: http://github.com + +[github-flow]: https://guides.github.com/introduction/flow/ + +[github-join]: https://github.com/join + +[how-contribute]: https://egghead.io/series/how-to-contribute-to-an-open-source-project-on-github + +[issues]: https://github.com/swcarpentry/shell-novice/issues/ + +[repo]: https://github.com/swcarpentry/shell-novice/ + +[swc-issues]: https://github.com/issues?q=user%3Aswcarpentry + +[swc-lessons]: http://software-carpentry.org/lessons/ + +[swc-site]: http://software-carpentry.org/ From 0c19b3d4ef512cb2c768ce96950a8b1ddfc17e9c Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Fri, 29 Dec 2023 06:12:05 +0900 Subject: [PATCH 077/334] New translations license.md (French) --- locale/fr/LICENSE.md | 86 ++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 86 insertions(+) create mode 100644 locale/fr/LICENSE.md diff --git a/locale/fr/LICENSE.md b/locale/fr/LICENSE.md new file mode 100644 index 000000000..bc98317a1 --- /dev/null +++ b/locale/fr/LICENSE.md @@ -0,0 +1,86 @@ +--- +title: Licenses +--- + +## Instructional Material + +All Software Carpentry, Data Carpentry, and Library Carpentry instructional material is +made available under the [Creative Commons Attribution +license][cc-by-human]. The following is a human-readable summary of +(and not a substitute for) the [full legal text of the CC BY 4.0 +license][cc-by-legal]. + +You are free: + +- to **Share**---copy and redistribute the material in any medium or format +- to **Adapt**---remix, transform, and build upon the material + +for any purpose, even commercially. + +The licensor cannot revoke these freedoms as long as you follow the +license terms. + +Under the following terms: + +- **Attribution**---You must give appropriate credit (mentioning that + your work is derived from work that is Copyright © Software + Carpentry and, where practical, linking to + http\://software-carpentry.org/), provide a [link to the + license][cc-by-human], and indicate if changes were made. You may do + so in any reasonable manner, but not in any way that suggests the + licensor endorses you or your use. + +**No additional restrictions**---You may not apply legal terms or +technological measures that legally restrict others from doing +anything the license permits. With the understanding that: + +Notices: + +- You do not have to comply with the license for elements of the + material in the public domain or where your use is permitted by an + applicable exception or limitation. +- No warranties are given. The license may not give you all of the + permissions necessary for your intended use. For example, other + rights such as publicity, privacy, or moral rights may limit how you + use the material. + +## Software + +Except where otherwise noted, the example programs and other software +provided by Software Carpentry and Data Carpentry are made available under the +[OSI][osi]-approved +[MIT license][mit-license]. + +Permission is hereby granted, free of charge, to any person obtaining +a copy of this software and associated documentation files (the +"Software"), to deal in the Software without restriction, including +without limitation the rights to use, copy, modify, merge, publish, +distribute, sublicense, and/or sell copies of the Software, and to +permit persons to whom the Software is furnished to do so, subject to +the following conditions: + +The above copyright notice and this permission notice shall be +included in all copies or substantial portions of the Software. + +THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, +EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF +MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND +NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE +LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION +OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION +WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. + +## Trademark + +"Software Carpentry" and "Data Carpentry" and their respective logos +are registered trademarks of [Community Initiatives][ci]. + +[cc-by-human]: https://creativecommons.org/licenses/by/4.0/ + +[cc-by-legal]: https://creativecommons.org/licenses/by/4.0/legalcode + +[mit-license]: https://opensource.org/licenses/mit-license.html + +[ci]: http://communityin.org/ + +[osi]: https://opensource.org From 2a423f84f5bbc6d1cf4af4741a887a66cf522074 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Fri, 29 Dec 2023 06:12:06 +0900 Subject: [PATCH 078/334] New translations license.md (Spanish) --- locale/es/LICENSE.md | 86 ++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 86 insertions(+) create mode 100644 locale/es/LICENSE.md diff --git a/locale/es/LICENSE.md b/locale/es/LICENSE.md new file mode 100644 index 000000000..bc98317a1 --- /dev/null +++ b/locale/es/LICENSE.md @@ -0,0 +1,86 @@ +--- +title: Licenses +--- + +## Instructional Material + +All Software Carpentry, Data Carpentry, and Library Carpentry instructional material is +made available under the [Creative Commons Attribution +license][cc-by-human]. The following is a human-readable summary of +(and not a substitute for) the [full legal text of the CC BY 4.0 +license][cc-by-legal]. + +You are free: + +- to **Share**---copy and redistribute the material in any medium or format +- to **Adapt**---remix, transform, and build upon the material + +for any purpose, even commercially. + +The licensor cannot revoke these freedoms as long as you follow the +license terms. + +Under the following terms: + +- **Attribution**---You must give appropriate credit (mentioning that + your work is derived from work that is Copyright © Software + Carpentry and, where practical, linking to + http\://software-carpentry.org/), provide a [link to the + license][cc-by-human], and indicate if changes were made. You may do + so in any reasonable manner, but not in any way that suggests the + licensor endorses you or your use. + +**No additional restrictions**---You may not apply legal terms or +technological measures that legally restrict others from doing +anything the license permits. With the understanding that: + +Notices: + +- You do not have to comply with the license for elements of the + material in the public domain or where your use is permitted by an + applicable exception or limitation. +- No warranties are given. The license may not give you all of the + permissions necessary for your intended use. For example, other + rights such as publicity, privacy, or moral rights may limit how you + use the material. + +## Software + +Except where otherwise noted, the example programs and other software +provided by Software Carpentry and Data Carpentry are made available under the +[OSI][osi]-approved +[MIT license][mit-license]. + +Permission is hereby granted, free of charge, to any person obtaining +a copy of this software and associated documentation files (the +"Software"), to deal in the Software without restriction, including +without limitation the rights to use, copy, modify, merge, publish, +distribute, sublicense, and/or sell copies of the Software, and to +permit persons to whom the Software is furnished to do so, subject to +the following conditions: + +The above copyright notice and this permission notice shall be +included in all copies or substantial portions of the Software. + +THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, +EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF +MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND +NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE +LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION +OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION +WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. + +## Trademark + +"Software Carpentry" and "Data Carpentry" and their respective logos +are registered trademarks of [Community Initiatives][ci]. + +[cc-by-human]: https://creativecommons.org/licenses/by/4.0/ + +[cc-by-legal]: https://creativecommons.org/licenses/by/4.0/legalcode + +[mit-license]: https://opensource.org/licenses/mit-license.html + +[ci]: http://communityin.org/ + +[osi]: https://opensource.org From 76d0254decfc395fa09a6802baa4487c671ca82d Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Fri, 29 Dec 2023 06:12:07 +0900 Subject: [PATCH 079/334] New translations license.md (Japanese) --- locale/ja/LICENSE.md | 86 ++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 86 insertions(+) create mode 100644 locale/ja/LICENSE.md diff --git a/locale/ja/LICENSE.md b/locale/ja/LICENSE.md new file mode 100644 index 000000000..bc98317a1 --- /dev/null +++ b/locale/ja/LICENSE.md @@ -0,0 +1,86 @@ +--- +title: Licenses +--- + +## Instructional Material + +All Software Carpentry, Data Carpentry, and Library Carpentry instructional material is +made available under the [Creative Commons Attribution +license][cc-by-human]. The following is a human-readable summary of +(and not a substitute for) the [full legal text of the CC BY 4.0 +license][cc-by-legal]. + +You are free: + +- to **Share**---copy and redistribute the material in any medium or format +- to **Adapt**---remix, transform, and build upon the material + +for any purpose, even commercially. + +The licensor cannot revoke these freedoms as long as you follow the +license terms. + +Under the following terms: + +- **Attribution**---You must give appropriate credit (mentioning that + your work is derived from work that is Copyright © Software + Carpentry and, where practical, linking to + http\://software-carpentry.org/), provide a [link to the + license][cc-by-human], and indicate if changes were made. You may do + so in any reasonable manner, but not in any way that suggests the + licensor endorses you or your use. + +**No additional restrictions**---You may not apply legal terms or +technological measures that legally restrict others from doing +anything the license permits. With the understanding that: + +Notices: + +- You do not have to comply with the license for elements of the + material in the public domain or where your use is permitted by an + applicable exception or limitation. +- No warranties are given. The license may not give you all of the + permissions necessary for your intended use. For example, other + rights such as publicity, privacy, or moral rights may limit how you + use the material. + +## Software + +Except where otherwise noted, the example programs and other software +provided by Software Carpentry and Data Carpentry are made available under the +[OSI][osi]-approved +[MIT license][mit-license]. + +Permission is hereby granted, free of charge, to any person obtaining +a copy of this software and associated documentation files (the +"Software"), to deal in the Software without restriction, including +without limitation the rights to use, copy, modify, merge, publish, +distribute, sublicense, and/or sell copies of the Software, and to +permit persons to whom the Software is furnished to do so, subject to +the following conditions: + +The above copyright notice and this permission notice shall be +included in all copies or substantial portions of the Software. + +THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, +EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF +MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND +NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE +LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION +OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION +WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. + +## Trademark + +"Software Carpentry" and "Data Carpentry" and their respective logos +are registered trademarks of [Community Initiatives][ci]. + +[cc-by-human]: https://creativecommons.org/licenses/by/4.0/ + +[cc-by-legal]: https://creativecommons.org/licenses/by/4.0/legalcode + +[mit-license]: https://opensource.org/licenses/mit-license.html + +[ci]: http://communityin.org/ + +[osi]: https://opensource.org From b80afdca6a049925d47da533c821423b930697ce Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Fri, 29 Dec 2023 06:12:08 +0900 Subject: [PATCH 080/334] New translations license.md (Portuguese) --- locale/pt/LICENSE.md | 86 ++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 86 insertions(+) create mode 100644 locale/pt/LICENSE.md diff --git a/locale/pt/LICENSE.md b/locale/pt/LICENSE.md new file mode 100644 index 000000000..bc98317a1 --- /dev/null +++ b/locale/pt/LICENSE.md @@ -0,0 +1,86 @@ +--- +title: Licenses +--- + +## Instructional Material + +All Software Carpentry, Data Carpentry, and Library Carpentry instructional material is +made available under the [Creative Commons Attribution +license][cc-by-human]. The following is a human-readable summary of +(and not a substitute for) the [full legal text of the CC BY 4.0 +license][cc-by-legal]. + +You are free: + +- to **Share**---copy and redistribute the material in any medium or format +- to **Adapt**---remix, transform, and build upon the material + +for any purpose, even commercially. + +The licensor cannot revoke these freedoms as long as you follow the +license terms. + +Under the following terms: + +- **Attribution**---You must give appropriate credit (mentioning that + your work is derived from work that is Copyright © Software + Carpentry and, where practical, linking to + http\://software-carpentry.org/), provide a [link to the + license][cc-by-human], and indicate if changes were made. You may do + so in any reasonable manner, but not in any way that suggests the + licensor endorses you or your use. + +**No additional restrictions**---You may not apply legal terms or +technological measures that legally restrict others from doing +anything the license permits. With the understanding that: + +Notices: + +- You do not have to comply with the license for elements of the + material in the public domain or where your use is permitted by an + applicable exception or limitation. +- No warranties are given. The license may not give you all of the + permissions necessary for your intended use. For example, other + rights such as publicity, privacy, or moral rights may limit how you + use the material. + +## Software + +Except where otherwise noted, the example programs and other software +provided by Software Carpentry and Data Carpentry are made available under the +[OSI][osi]-approved +[MIT license][mit-license]. + +Permission is hereby granted, free of charge, to any person obtaining +a copy of this software and associated documentation files (the +"Software"), to deal in the Software without restriction, including +without limitation the rights to use, copy, modify, merge, publish, +distribute, sublicense, and/or sell copies of the Software, and to +permit persons to whom the Software is furnished to do so, subject to +the following conditions: + +The above copyright notice and this permission notice shall be +included in all copies or substantial portions of the Software. + +THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, +EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF +MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND +NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE +LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION +OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION +WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. + +## Trademark + +"Software Carpentry" and "Data Carpentry" and their respective logos +are registered trademarks of [Community Initiatives][ci]. + +[cc-by-human]: https://creativecommons.org/licenses/by/4.0/ + +[cc-by-legal]: https://creativecommons.org/licenses/by/4.0/legalcode + +[mit-license]: https://opensource.org/licenses/mit-license.html + +[ci]: http://communityin.org/ + +[osi]: https://opensource.org From 85f92e8964ed3e9bc3854f8a5fecc5e1c41c50e6 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Fri, 29 Dec 2023 06:12:09 +0900 Subject: [PATCH 081/334] New translations license.md (Chinese Simplified) --- locale/zh/LICENSE.md | 86 ++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 86 insertions(+) create mode 100644 locale/zh/LICENSE.md diff --git a/locale/zh/LICENSE.md b/locale/zh/LICENSE.md new file mode 100644 index 000000000..bc98317a1 --- /dev/null +++ b/locale/zh/LICENSE.md @@ -0,0 +1,86 @@ +--- +title: Licenses +--- + +## Instructional Material + +All Software Carpentry, Data Carpentry, and Library Carpentry instructional material is +made available under the [Creative Commons Attribution +license][cc-by-human]. The following is a human-readable summary of +(and not a substitute for) the [full legal text of the CC BY 4.0 +license][cc-by-legal]. + +You are free: + +- to **Share**---copy and redistribute the material in any medium or format +- to **Adapt**---remix, transform, and build upon the material + +for any purpose, even commercially. + +The licensor cannot revoke these freedoms as long as you follow the +license terms. + +Under the following terms: + +- **Attribution**---You must give appropriate credit (mentioning that + your work is derived from work that is Copyright © Software + Carpentry and, where practical, linking to + http\://software-carpentry.org/), provide a [link to the + license][cc-by-human], and indicate if changes were made. You may do + so in any reasonable manner, but not in any way that suggests the + licensor endorses you or your use. + +**No additional restrictions**---You may not apply legal terms or +technological measures that legally restrict others from doing +anything the license permits. With the understanding that: + +Notices: + +- You do not have to comply with the license for elements of the + material in the public domain or where your use is permitted by an + applicable exception or limitation. +- No warranties are given. The license may not give you all of the + permissions necessary for your intended use. For example, other + rights such as publicity, privacy, or moral rights may limit how you + use the material. + +## Software + +Except where otherwise noted, the example programs and other software +provided by Software Carpentry and Data Carpentry are made available under the +[OSI][osi]-approved +[MIT license][mit-license]. + +Permission is hereby granted, free of charge, to any person obtaining +a copy of this software and associated documentation files (the +"Software"), to deal in the Software without restriction, including +without limitation the rights to use, copy, modify, merge, publish, +distribute, sublicense, and/or sell copies of the Software, and to +permit persons to whom the Software is furnished to do so, subject to +the following conditions: + +The above copyright notice and this permission notice shall be +included in all copies or substantial portions of the Software. + +THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, +EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF +MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND +NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE +LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION +OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION +WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. + +## Trademark + +"Software Carpentry" and "Data Carpentry" and their respective logos +are registered trademarks of [Community Initiatives][ci]. + +[cc-by-human]: https://creativecommons.org/licenses/by/4.0/ + +[cc-by-legal]: https://creativecommons.org/licenses/by/4.0/legalcode + +[mit-license]: https://opensource.org/licenses/mit-license.html + +[ci]: http://communityin.org/ + +[osi]: https://opensource.org From 3e88b7e1a3e386a5d60c7d6fa055e433e32c439a Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Fri, 29 Dec 2023 06:12:11 +0900 Subject: [PATCH 082/334] New translations readme.md (French) --- locale/fr/README.md | 75 +++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 75 insertions(+) create mode 100644 locale/fr/README.md diff --git a/locale/fr/README.md b/locale/fr/README.md new file mode 100644 index 000000000..8ab3d42f4 --- /dev/null +++ b/locale/fr/README.md @@ -0,0 +1,75 @@ +# Introduction to genomic data analysis with R and Bioconductor + +[![Create a Slack Account with us](https://img.shields.io/badge/Create_Slack_Account-The_Carpentries-071159.svg)](https://swc-slack-invite.herokuapp.com/) + +## Contributing + +We welcome all contributions to improve the lesson! Maintainers will +do their best to help you if you have any questions, concerns, or +experience any difficulties along the way. + +We'd like to ask you to familiarize yourself with our Contribution +Guide and have a look at the [more detailed +guidelines][lesson-example] on proper formatting, ways to render the +lesson locally, and even how to write new episodes. + +Please see the current list of [issues][FIXME] for ideas for +contributing to this repository. For making your contribution, we use +the GitHub flow, which is nicely explained in the chapter +Contributing to a +Project +in Pro Git by Scott Chacon. + +Look for the tag +![good\_first\_issue](https://img.shields.io/badge/-good%20first%20issue-gold.svg). This +indicates that the maintainers will welcome a pull request fixing this +issue. + +## Useful links + +- If you're going to be developing lesson material for the first time + according to our design principles, consider reading the + [Carpentries Curriculum Development Handbook][cdh] +- Consult the [Lesson Example][lesson-example] website to find out more about + working with the lesson template + +## Lesson team + +This lesson has been developed and is current maintained by + +- Laurent Gatto (maintainer) +- Charlotte Soneson +- Jenny Drnevich +- Robert Castelo +- Kevin Rue-Albert + +We would also like to acknowledge the contributions of: + +- Oliver Crook, Sarah Kaspar, Nick Hirschmueller, Lisa Breckels and Maria Doyle for their contributions during the Bioconductor introduction workshop in Heidelberg, as part of EuroBioc2021 |> 2022. +- Axelle Loriot, Marco Chiapelle, Manon Martin and Toby Hodges for various contributions and discussions. +- lmsimp, alorot, manonmartin, mchiapello, stavares843, JennyZadeh, csdaw, ninja-1337, fursham-h, lagerratrobe, fmichonneau, federicomarini, tobyhodges for pull requests. + +If we have contributed but we missed you, apologies, and feel free to add yourself with a PR. + +## Authors + +A list of contributors to the lesson can be found in [AUTHORS](AUTHORS) + +## Citation + +To cite this lesson, please consult with [CITATION](CITATION) + +[lesson-example]: https://carpentries.github.io/lesson-example + +[cdh]: https://cdh.carpentries.org + +## Testing locally + +To test locally, run the following in the lessons directory: + +```r +sandpaper::serve() +``` + +For more details, see the [workbench installation +instructions](https\://carpentries.github.io/workbench/#installation]. From 843ecebc689e5dd89277ac2d6acec0c68d41db72 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Fri, 29 Dec 2023 06:12:12 +0900 Subject: [PATCH 083/334] New translations readme.md (Spanish) --- locale/es/README.md | 75 +++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 75 insertions(+) create mode 100644 locale/es/README.md diff --git a/locale/es/README.md b/locale/es/README.md new file mode 100644 index 000000000..8ab3d42f4 --- /dev/null +++ b/locale/es/README.md @@ -0,0 +1,75 @@ +# Introduction to genomic data analysis with R and Bioconductor + +[![Create a Slack Account with us](https://img.shields.io/badge/Create_Slack_Account-The_Carpentries-071159.svg)](https://swc-slack-invite.herokuapp.com/) + +## Contributing + +We welcome all contributions to improve the lesson! Maintainers will +do their best to help you if you have any questions, concerns, or +experience any difficulties along the way. + +We'd like to ask you to familiarize yourself with our Contribution +Guide and have a look at the [more detailed +guidelines][lesson-example] on proper formatting, ways to render the +lesson locally, and even how to write new episodes. + +Please see the current list of [issues][FIXME] for ideas for +contributing to this repository. For making your contribution, we use +the GitHub flow, which is nicely explained in the chapter +Contributing to a +Project +in Pro Git by Scott Chacon. + +Look for the tag +![good\_first\_issue](https://img.shields.io/badge/-good%20first%20issue-gold.svg). This +indicates that the maintainers will welcome a pull request fixing this +issue. + +## Useful links + +- If you're going to be developing lesson material for the first time + according to our design principles, consider reading the + [Carpentries Curriculum Development Handbook][cdh] +- Consult the [Lesson Example][lesson-example] website to find out more about + working with the lesson template + +## Lesson team + +This lesson has been developed and is current maintained by + +- Laurent Gatto (maintainer) +- Charlotte Soneson +- Jenny Drnevich +- Robert Castelo +- Kevin Rue-Albert + +We would also like to acknowledge the contributions of: + +- Oliver Crook, Sarah Kaspar, Nick Hirschmueller, Lisa Breckels and Maria Doyle for their contributions during the Bioconductor introduction workshop in Heidelberg, as part of EuroBioc2021 |> 2022. +- Axelle Loriot, Marco Chiapelle, Manon Martin and Toby Hodges for various contributions and discussions. +- lmsimp, alorot, manonmartin, mchiapello, stavares843, JennyZadeh, csdaw, ninja-1337, fursham-h, lagerratrobe, fmichonneau, federicomarini, tobyhodges for pull requests. + +If we have contributed but we missed you, apologies, and feel free to add yourself with a PR. + +## Authors + +A list of contributors to the lesson can be found in [AUTHORS](AUTHORS) + +## Citation + +To cite this lesson, please consult with [CITATION](CITATION) + +[lesson-example]: https://carpentries.github.io/lesson-example + +[cdh]: https://cdh.carpentries.org + +## Testing locally + +To test locally, run the following in the lessons directory: + +```r +sandpaper::serve() +``` + +For more details, see the [workbench installation +instructions](https\://carpentries.github.io/workbench/#installation]. From 5ce6b519747b7c2470245c5f3db637abc5f52763 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Fri, 29 Dec 2023 06:12:13 +0900 Subject: [PATCH 084/334] New translations readme.md (Japanese) --- locale/ja/README.md | 75 +++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 75 insertions(+) create mode 100644 locale/ja/README.md diff --git a/locale/ja/README.md b/locale/ja/README.md new file mode 100644 index 000000000..8ab3d42f4 --- /dev/null +++ b/locale/ja/README.md @@ -0,0 +1,75 @@ +# Introduction to genomic data analysis with R and Bioconductor + +[![Create a Slack Account with us](https://img.shields.io/badge/Create_Slack_Account-The_Carpentries-071159.svg)](https://swc-slack-invite.herokuapp.com/) + +## Contributing + +We welcome all contributions to improve the lesson! Maintainers will +do their best to help you if you have any questions, concerns, or +experience any difficulties along the way. + +We'd like to ask you to familiarize yourself with our Contribution +Guide and have a look at the [more detailed +guidelines][lesson-example] on proper formatting, ways to render the +lesson locally, and even how to write new episodes. + +Please see the current list of [issues][FIXME] for ideas for +contributing to this repository. For making your contribution, we use +the GitHub flow, which is nicely explained in the chapter +Contributing to a +Project +in Pro Git by Scott Chacon. + +Look for the tag +![good\_first\_issue](https://img.shields.io/badge/-good%20first%20issue-gold.svg). This +indicates that the maintainers will welcome a pull request fixing this +issue. + +## Useful links + +- If you're going to be developing lesson material for the first time + according to our design principles, consider reading the + [Carpentries Curriculum Development Handbook][cdh] +- Consult the [Lesson Example][lesson-example] website to find out more about + working with the lesson template + +## Lesson team + +This lesson has been developed and is current maintained by + +- Laurent Gatto (maintainer) +- Charlotte Soneson +- Jenny Drnevich +- Robert Castelo +- Kevin Rue-Albert + +We would also like to acknowledge the contributions of: + +- Oliver Crook, Sarah Kaspar, Nick Hirschmueller, Lisa Breckels and Maria Doyle for their contributions during the Bioconductor introduction workshop in Heidelberg, as part of EuroBioc2021 |> 2022. +- Axelle Loriot, Marco Chiapelle, Manon Martin and Toby Hodges for various contributions and discussions. +- lmsimp, alorot, manonmartin, mchiapello, stavares843, JennyZadeh, csdaw, ninja-1337, fursham-h, lagerratrobe, fmichonneau, federicomarini, tobyhodges for pull requests. + +If we have contributed but we missed you, apologies, and feel free to add yourself with a PR. + +## Authors + +A list of contributors to the lesson can be found in [AUTHORS](AUTHORS) + +## Citation + +To cite this lesson, please consult with [CITATION](CITATION) + +[lesson-example]: https://carpentries.github.io/lesson-example + +[cdh]: https://cdh.carpentries.org + +## Testing locally + +To test locally, run the following in the lessons directory: + +```r +sandpaper::serve() +``` + +For more details, see the [workbench installation +instructions](https\://carpentries.github.io/workbench/#installation]. From 22c5ee0e064b11fd98496bc394c501ae1a7c525c Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Fri, 29 Dec 2023 06:12:14 +0900 Subject: [PATCH 085/334] New translations readme.md (Portuguese) --- locale/pt/README.md | 75 +++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 75 insertions(+) create mode 100644 locale/pt/README.md diff --git a/locale/pt/README.md b/locale/pt/README.md new file mode 100644 index 000000000..8ab3d42f4 --- /dev/null +++ b/locale/pt/README.md @@ -0,0 +1,75 @@ +# Introduction to genomic data analysis with R and Bioconductor + +[![Create a Slack Account with us](https://img.shields.io/badge/Create_Slack_Account-The_Carpentries-071159.svg)](https://swc-slack-invite.herokuapp.com/) + +## Contributing + +We welcome all contributions to improve the lesson! Maintainers will +do their best to help you if you have any questions, concerns, or +experience any difficulties along the way. + +We'd like to ask you to familiarize yourself with our Contribution +Guide and have a look at the [more detailed +guidelines][lesson-example] on proper formatting, ways to render the +lesson locally, and even how to write new episodes. + +Please see the current list of [issues][FIXME] for ideas for +contributing to this repository. For making your contribution, we use +the GitHub flow, which is nicely explained in the chapter +Contributing to a +Project +in Pro Git by Scott Chacon. + +Look for the tag +![good\_first\_issue](https://img.shields.io/badge/-good%20first%20issue-gold.svg). This +indicates that the maintainers will welcome a pull request fixing this +issue. + +## Useful links + +- If you're going to be developing lesson material for the first time + according to our design principles, consider reading the + [Carpentries Curriculum Development Handbook][cdh] +- Consult the [Lesson Example][lesson-example] website to find out more about + working with the lesson template + +## Lesson team + +This lesson has been developed and is current maintained by + +- Laurent Gatto (maintainer) +- Charlotte Soneson +- Jenny Drnevich +- Robert Castelo +- Kevin Rue-Albert + +We would also like to acknowledge the contributions of: + +- Oliver Crook, Sarah Kaspar, Nick Hirschmueller, Lisa Breckels and Maria Doyle for their contributions during the Bioconductor introduction workshop in Heidelberg, as part of EuroBioc2021 |> 2022. +- Axelle Loriot, Marco Chiapelle, Manon Martin and Toby Hodges for various contributions and discussions. +- lmsimp, alorot, manonmartin, mchiapello, stavares843, JennyZadeh, csdaw, ninja-1337, fursham-h, lagerratrobe, fmichonneau, federicomarini, tobyhodges for pull requests. + +If we have contributed but we missed you, apologies, and feel free to add yourself with a PR. + +## Authors + +A list of contributors to the lesson can be found in [AUTHORS](AUTHORS) + +## Citation + +To cite this lesson, please consult with [CITATION](CITATION) + +[lesson-example]: https://carpentries.github.io/lesson-example + +[cdh]: https://cdh.carpentries.org + +## Testing locally + +To test locally, run the following in the lessons directory: + +```r +sandpaper::serve() +``` + +For more details, see the [workbench installation +instructions](https\://carpentries.github.io/workbench/#installation]. From e67b26607dccf9ef2b4526731cd851f27c8d9494 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Fri, 29 Dec 2023 06:12:16 +0900 Subject: [PATCH 086/334] New translations readme.md (Chinese Simplified) --- locale/zh/README.md | 75 +++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 75 insertions(+) create mode 100644 locale/zh/README.md diff --git a/locale/zh/README.md b/locale/zh/README.md new file mode 100644 index 000000000..8ab3d42f4 --- /dev/null +++ b/locale/zh/README.md @@ -0,0 +1,75 @@ +# Introduction to genomic data analysis with R and Bioconductor + +[![Create a Slack Account with us](https://img.shields.io/badge/Create_Slack_Account-The_Carpentries-071159.svg)](https://swc-slack-invite.herokuapp.com/) + +## Contributing + +We welcome all contributions to improve the lesson! Maintainers will +do their best to help you if you have any questions, concerns, or +experience any difficulties along the way. + +We'd like to ask you to familiarize yourself with our Contribution +Guide and have a look at the [more detailed +guidelines][lesson-example] on proper formatting, ways to render the +lesson locally, and even how to write new episodes. + +Please see the current list of [issues][FIXME] for ideas for +contributing to this repository. For making your contribution, we use +the GitHub flow, which is nicely explained in the chapter +Contributing to a +Project +in Pro Git by Scott Chacon. + +Look for the tag +![good\_first\_issue](https://img.shields.io/badge/-good%20first%20issue-gold.svg). This +indicates that the maintainers will welcome a pull request fixing this +issue. + +## Useful links + +- If you're going to be developing lesson material for the first time + according to our design principles, consider reading the + [Carpentries Curriculum Development Handbook][cdh] +- Consult the [Lesson Example][lesson-example] website to find out more about + working with the lesson template + +## Lesson team + +This lesson has been developed and is current maintained by + +- Laurent Gatto (maintainer) +- Charlotte Soneson +- Jenny Drnevich +- Robert Castelo +- Kevin Rue-Albert + +We would also like to acknowledge the contributions of: + +- Oliver Crook, Sarah Kaspar, Nick Hirschmueller, Lisa Breckels and Maria Doyle for their contributions during the Bioconductor introduction workshop in Heidelberg, as part of EuroBioc2021 |> 2022. +- Axelle Loriot, Marco Chiapelle, Manon Martin and Toby Hodges for various contributions and discussions. +- lmsimp, alorot, manonmartin, mchiapello, stavares843, JennyZadeh, csdaw, ninja-1337, fursham-h, lagerratrobe, fmichonneau, federicomarini, tobyhodges for pull requests. + +If we have contributed but we missed you, apologies, and feel free to add yourself with a PR. + +## Authors + +A list of contributors to the lesson can be found in [AUTHORS](AUTHORS) + +## Citation + +To cite this lesson, please consult with [CITATION](CITATION) + +[lesson-example]: https://carpentries.github.io/lesson-example + +[cdh]: https://cdh.carpentries.org + +## Testing locally + +To test locally, run the following in the lessons directory: + +```r +sandpaper::serve() +``` + +For more details, see the [workbench installation +instructions](https\://carpentries.github.io/workbench/#installation]. From 6bd7b4cb92d0b2adae3cf691d7069adfe2050bc0 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Fri, 29 Dec 2023 06:12:17 +0900 Subject: [PATCH 087/334] New translations index.md (French) --- locale/fr/index.md | 14 ++++++++++++++ 1 file changed, 14 insertions(+) create mode 100644 locale/fr/index.md diff --git a/locale/fr/index.md b/locale/fr/index.md new file mode 100644 index 000000000..daa0cf39d --- /dev/null +++ b/locale/fr/index.md @@ -0,0 +1,14 @@ +--- +permalink: index.html +site: sandpaper::sandpaper_site +--- + +## About this course + +:::::::::::::::::::::::::::::::::::::::::: prereq + +## Prerequisites + +- Familiarity with tabular data and spreadsheets. + +:::::::::::::::::::::::::::::::::::::::::::::::::: From 1a93a93750265adcc0864fe34a6ccd67d3cec54d Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Fri, 29 Dec 2023 06:12:18 +0900 Subject: [PATCH 088/334] New translations index.md (Spanish) --- locale/es/index.md | 14 ++++++++++++++ 1 file changed, 14 insertions(+) create mode 100644 locale/es/index.md diff --git a/locale/es/index.md b/locale/es/index.md new file mode 100644 index 000000000..daa0cf39d --- /dev/null +++ b/locale/es/index.md @@ -0,0 +1,14 @@ +--- +permalink: index.html +site: sandpaper::sandpaper_site +--- + +## About this course + +:::::::::::::::::::::::::::::::::::::::::: prereq + +## Prerequisites + +- Familiarity with tabular data and spreadsheets. + +:::::::::::::::::::::::::::::::::::::::::::::::::: From f99fa30ee97b7b59de250b16193e6056880e4a78 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Fri, 29 Dec 2023 06:12:19 +0900 Subject: [PATCH 089/334] New translations index.md (Japanese) --- locale/ja/index.md | 14 ++++++++++++++ 1 file changed, 14 insertions(+) create mode 100644 locale/ja/index.md diff --git a/locale/ja/index.md b/locale/ja/index.md new file mode 100644 index 000000000..daa0cf39d --- /dev/null +++ b/locale/ja/index.md @@ -0,0 +1,14 @@ +--- +permalink: index.html +site: sandpaper::sandpaper_site +--- + +## About this course + +:::::::::::::::::::::::::::::::::::::::::: prereq + +## Prerequisites + +- Familiarity with tabular data and spreadsheets. + +:::::::::::::::::::::::::::::::::::::::::::::::::: From 45b5c45edda4c72f2562470bda9c0ee9e9efc80a Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Fri, 29 Dec 2023 06:12:20 +0900 Subject: [PATCH 090/334] New translations index.md (Portuguese) --- locale/pt/index.md | 14 ++++++++++++++ 1 file changed, 14 insertions(+) create mode 100644 locale/pt/index.md diff --git a/locale/pt/index.md b/locale/pt/index.md new file mode 100644 index 000000000..daa0cf39d --- /dev/null +++ b/locale/pt/index.md @@ -0,0 +1,14 @@ +--- +permalink: index.html +site: sandpaper::sandpaper_site +--- + +## About this course + +:::::::::::::::::::::::::::::::::::::::::: prereq + +## Prerequisites + +- Familiarity with tabular data and spreadsheets. + +:::::::::::::::::::::::::::::::::::::::::::::::::: From 6dfe8b04cb78678972d754b60ccc341c5bab8566 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Fri, 29 Dec 2023 06:12:21 +0900 Subject: [PATCH 091/334] New translations index.md (Chinese Simplified) --- locale/zh/index.md | 14 ++++++++++++++ 1 file changed, 14 insertions(+) create mode 100644 locale/zh/index.md diff --git a/locale/zh/index.md b/locale/zh/index.md new file mode 100644 index 000000000..daa0cf39d --- /dev/null +++ b/locale/zh/index.md @@ -0,0 +1,14 @@ +--- +permalink: index.html +site: sandpaper::sandpaper_site +--- + +## About this course + +:::::::::::::::::::::::::::::::::::::::::: prereq + +## Prerequisites + +- Familiarity with tabular data and spreadsheets. + +:::::::::::::::::::::::::::::::::::::::::::::::::: From a5ea00b2d51573cc75b85aa4cf3064503827e672 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Fri, 29 Dec 2023 07:14:07 +0900 Subject: [PATCH 092/334] New translations 20-r-rstudio.md (Portuguese) --- locale/pt/episodes/20-r-rstudio.Rmd | 21 ++++++++++----------- 1 file changed, 10 insertions(+), 11 deletions(-) diff --git a/locale/pt/episodes/20-r-rstudio.Rmd b/locale/pt/episodes/20-r-rstudio.Rmd index 8bdebbd77..43348eb71 100644 --- a/locale/pt/episodes/20-r-rstudio.Rmd +++ b/locale/pt/episodes/20-r-rstudio.Rmd @@ -582,11 +582,10 @@ sessionInfo() - The [reprex](https://cran.rstudio.com/web/packages/reprex/) package is very helpful to create reproducible examples when asking for - help. The rOpenSci community call "How to ask questions so they get - answered" (Github - link and video - recording) includes a presentation of - the reprex package and of its philosophy. + help. A comunidade rOpenSci "How to ask questions so they get + answered" ([Github + link](https\://github. om/ropensci/commcalls/issues/14) e [gravação de vídeo](https://vimeo.com/208749032)) inclui uma apresentação de + o pacote reprex e sua filosofia. ## R packages @@ -594,8 +593,8 @@ sessionInfo() As we have seen above, R packages play a fundamental role in R. The make use of a package's functionality, assuming it is installed, we -first need to load it to be able to use it. This is done with the -`library()` function. Below, we load `ggplot2`. +first need to load it to be able to use it. Isto é feito com a função +`library()`. Abaixo, carregamos o `ggplot2`. ```{r loadp, eval=FALSE, purl=TRUE} library("ggplot2") @@ -605,8 +604,8 @@ library("ggplot2") The default package repository is The _Comprehensive R Archive Network_ (CRAN), and any package that is available on CRAN can be -installed with the `install.packages()` function. Below, for example, -we install the `dplyr` package that we will learn about later. +installed with the `install.packages()` function. Abaixo, por exemplo, +instalamos o pacote `dplyr` que aprenderemos mais tarde. ```{r craninstall, eval=FALSE, purl=TRUE} install.packages("dplyr") @@ -615,7 +614,7 @@ install.packages("dplyr") This command will install the `dplyr` package as well as all its dependencies, i.e. all the packages that it relies on to function. -Another major R package repository is maintained by Bioconductor. [Bioconductor packages](https://bioconductor.org/packages/release/BiocViews.html#___Software) are managed and installed using a dedicated package, +Outro repositório de pacotes principais do R é mantido pelo Bioconductor. [Bioconductor packages](https://bioconductor.org/packages/release/BiocViews.html#___Software) are managed and installed using a dedicated package, namely `BiocManager`, that can be installed from CRAN with ```{r, eval=FALSE, purl=TRUE} @@ -631,7 +630,7 @@ BiocManager::install("SummarizedExperiment") BiocManager::install("DESeq2") ``` -By default, `BiocManager::install()` will also check all your installed packages and see if there are newer versions available. If there are, it will show them to you and ask you if you want to `Update all/some/none? [a/s/n]:` and then wait for your answer. While you should strive to have the most up-to-date package versions, in practice we recommend only updating packages in a fresh R session before any packages are loaded. +By default, `BiocManager::install()` will also check all your installed packages and see if there are newer versions available. Se houver, mostrará a você e perguntará se você quer `Atualizar todos/alguém/nenhum? [a/s/n]:` e depois espera pela sua resposta. Você deve se esforçar para ter as versões mais atualizadas dos pacotes, no entanto, na prática, recomendamos atualizar pacotes apenas em uma sessão nova em R antes de quaisquer pacotes serem carregados. :::::::::::::::::::::::::::::::::::::::: keypoints From 7fe7ca91506f7ac191a1cc75fc37db93f2f7c8a1 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Fri, 29 Dec 2023 07:14:15 +0900 Subject: [PATCH 093/334] New translations 25-starting-with-data.md (Portuguese) --- locale/pt/episodes/25-starting-with-data.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/locale/pt/episodes/25-starting-with-data.Rmd b/locale/pt/episodes/25-starting-with-data.Rmd index bc29da0cd..cd371d4de 100644 --- a/locale/pt/episodes/25-starting-with-data.Rmd +++ b/locale/pt/episodes/25-starting-with-data.Rmd @@ -474,7 +474,7 @@ example? Check your guesses using `str(country_climate)`: -- Are they what you expected? Why? Why not? +- Are they what you expected? Por quê? Why not? - Try again by adding `stringsAsFactors = TRUE` after the last variable when creating the data frame. What is happening now? From e173aa03e5062cf25f47db31105942b986b4b434 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Fri, 29 Dec 2023 07:14:24 +0900 Subject: [PATCH 094/334] New translations 40-visualization.md (Portuguese) --- locale/pt/episodes/40-visualization.Rmd | 51 ++++++++++++------------- 1 file changed, 24 insertions(+), 27 deletions(-) diff --git a/locale/pt/episodes/40-visualization.Rmd b/locale/pt/episodes/40-visualization.Rmd index b1ab2920c..a635340fc 100644 --- a/locale/pt/episodes/40-visualization.Rmd +++ b/locale/pt/episodes/40-visualization.Rmd @@ -33,8 +33,7 @@ rna <- read.csv("data/rnaseq.csv") ## Data Visualization -We start by loading the required packages. **`ggplot2`** is included in -the **`tidyverse`** package. +We start by loading the required packages. **`ggplot2`** está incluído no pacote **`tidyverse`**. ```{r load-package, message=FALSE, purl=TRUE} library("tidyverse") @@ -51,35 +50,33 @@ The Data Visualization Cheat Sheet will cover the basics and more advanced features of `ggplot2` and will help, in addition to serve as a reminder, getting an overview of the -many data representations available in the package. The following video -tutorials ([part 1](https://www.youtube.com/watch?v=h29g21z0a68) and -[2](https://www.youtube.com/watch?v=0m4yywqNPVY)) by Thomas Lin Pedersen -are also very instructive. +many data representations available in the package. Os seguintes tutoriais em vídeo +([parte 1](https://www.youtube.com/watch?v=h29g21z0a68) e +[2](https://www.youtube.com/watch?v=0m4yywqNPVY)) de Thomas Lin Pedersen +são também muito instrutivos. ## Plotting with `ggplot2` `ggplot2` is a plotting package that makes it simple to create complex -plots from data in a data frame. It provides a more programmatic -interface for specifying what variables to plot, how they are displayed, -and general visual properties. The theoretical foundation that supports -the `ggplot2` is the _Grammar of Graphics_ (@Wilkinson:2005). Using this -approach, we only need minimal changes if the underlying data change or -if we decide to change from a bar plot to a scatterplot. This helps in -creating publication quality plots with minimal amounts of adjustments -and tweaking. - -There is a book about `ggplot2` (@ggplot2book) that provides a good -overview, but it is outdated. The 3rd edition is in preparation and will -be [freely available online](https://ggplot2-book.org/). The `ggplot2` -webpage ([https://ggplot2.tidyverse.org](https://ggplot2.tidyverse.org)) provides ample documentation. - -`ggplot2` functions like data in the 'long' format, i.e., a column for -every dimension, and a row for every observation. Well-structured data -will save you lots of time when making figures with `ggplot2`. - -ggplot graphics are built step by step by adding new elements. Adding -layers in this fashion allows for extensive flexibility and -customization of plots. +plots from data in a data frame. Fornece uma interface +mais programática para especificar quais as variáveis a representar, como são apresentadas, +e propriedades visuais gerais. A base teórica que suporta +o `ggplot2` é a _Gramática de Gráficos_ (@Wilkinson:2005). Utilizando esta abordagem, apenas necessitamos de alterações mínimas se os dados subjacentes mudarem ou +se decidirmos mudar de um gráfico de barras para um gráfico de dispersão. Isto ajuda a +criar gráficos com qualidade de publicação com o mínimo de ajustes +e afinações. + +Existe um livro sobre `ggplot2` (@ggplot2book) que fornece uma boa visão geral, mas está desatualizado. A 3ª edição está a ser preparada e será +[disponível gratuitamente online] (https\://ggplot2-book.org/). A página `ggplot2` +([https://ggplot2.tidyverse.org](https://ggplot2.tidyverse.org)) fornece uma ampla documentação. + +O `ggplot2` funciona como dados no formato 'long', ou seja, uma coluna para +cada dimensão, e uma linha para cada observação. Dados bem estruturados +poupará muito tempo ao fazer figuras com `ggplot2`. + +os gráficos ggplot são construídos passo a passo através da adição de novos elementos. A adição de +camadas desta forma permite uma grande flexibilidade e +personalização das parcelas. > The idea behind the Grammar of Graphics it is that you can build every > graph from the same 3 components: (1) a data set, (2) a coordinate system, From c373b861ee6b58696ab5fe85ff2214d62be20d9b Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Fri, 29 Dec 2023 08:09:54 +0900 Subject: [PATCH 095/334] New translations 20-r-rstudio.md (Portuguese) --- locale/pt/episodes/20-r-rstudio.Rmd | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/locale/pt/episodes/20-r-rstudio.Rmd b/locale/pt/episodes/20-r-rstudio.Rmd index 43348eb71..70f6bc396 100644 --- a/locale/pt/episodes/20-r-rstudio.Rmd +++ b/locale/pt/episodes/20-r-rstudio.Rmd @@ -1,6 +1,6 @@ --- source: Rmd -title: R and RStudio +title: R e RStudio teaching: 30 exercises: 0 --- @@ -8,9 +8,9 @@ exercises: 0 ```{r, include=FALSE} ``` -::::::::::::::::::::::::::::::::::::::: objectives +::::::::::::::::::::::::::::::::::::::: Objetivos -- Describe the purpose of the RStudio Script, Console, Environment, and Plots panes. +- Descreva a finalidade dos painéis do RStudio: Script, Console, Environment e Plots. - Organise files and directories for a set of analyses as an R project, and understand the purpose of the working directory. - Use the built-in RStudio help interface to search for more information on R functions. - Demonstrate how to provide sufficient information for troubleshooting with the R user community. From e9580abcda754cbb02f1bfb6281cf002e575ab5e Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Fri, 29 Dec 2023 10:08:56 +0900 Subject: [PATCH 096/334] New translations 10-data-organisation.md (Spanish) --- locale/es/episodes/10-data-organisation.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/locale/es/episodes/10-data-organisation.Rmd b/locale/es/episodes/10-data-organisation.Rmd index 77b5925cd..02c4c6326 100644 --- a/locale/es/episodes/10-data-organisation.Rmd +++ b/locale/es/episodes/10-data-organisation.Rmd @@ -1,6 +1,6 @@ --- source: Rmd -title: Data organisation with spreadsheets +title: Organización de datos con hojas de cálculo teaching: 30 exercises: 30 --- From 34a7badd44c2bf8ca1a774d0b3ea980b9554d389 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Fri, 29 Dec 2023 10:09:00 +0900 Subject: [PATCH 097/334] New translations 10-data-organisation.md (Portuguese) --- locale/pt/episodes/10-data-organisation.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/locale/pt/episodes/10-data-organisation.Rmd b/locale/pt/episodes/10-data-organisation.Rmd index 888061af7..9c6928fb3 100644 --- a/locale/pt/episodes/10-data-organisation.Rmd +++ b/locale/pt/episodes/10-data-organisation.Rmd @@ -10,7 +10,7 @@ exercises: 30 ::::::::::::::::::::::::::::::::::::::: objectives -- Learn about spreadsheets, their strengths and weaknesses. +- Aprenda sobre planilhas, seus pontos fortes e fracos. - How do we format data in spreadsheets for effective data use? - Learn about common spreadsheet errors and how to correct them. - Organise your data according to tidy data principles. From ce4dc9d77afbf0d003446eb7884a2a5cc6eb6f18 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Fri, 29 Dec 2023 10:09:08 +0900 Subject: [PATCH 098/334] New translations 20-r-rstudio.md (Portuguese) --- locale/pt/episodes/20-r-rstudio.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/locale/pt/episodes/20-r-rstudio.Rmd b/locale/pt/episodes/20-r-rstudio.Rmd index 70f6bc396..b143508b9 100644 --- a/locale/pt/episodes/20-r-rstudio.Rmd +++ b/locale/pt/episodes/20-r-rstudio.Rmd @@ -8,7 +8,7 @@ exercises: 0 ```{r, include=FALSE} ``` -::::::::::::::::::::::::::::::::::::::: Objetivos + - Descreva a finalidade dos painéis do RStudio: Script, Console, Environment e Plots. - Organise files and directories for a set of analyses as an R project, and understand the purpose of the working directory. From 0a0103a4a19c12a36a3ac893543210433e4fba28 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Thu, 11 Jan 2024 12:38:30 +0900 Subject: [PATCH 099/334] New translations 10-data-organisation.md (Japanese) --- locale/ja/episodes/10-data-organisation.Rmd | 1277 +++++++++---------- 1 file changed, 634 insertions(+), 643 deletions(-) diff --git a/locale/ja/episodes/10-data-organisation.Rmd b/locale/ja/episodes/10-data-organisation.Rmd index b12c852cf..7fb6177a4 100644 --- a/locale/ja/episodes/10-data-organisation.Rmd +++ b/locale/ja/episodes/10-data-organisation.Rmd @@ -1,6 +1,6 @@ --- -source: Rmd -title: Data organisation with spreadsheets +source: RMD +title: スプレッドシートを使用したデータ整理 teaching: 30 exercises: 30 --- @@ -10,290 +10,286 @@ exercises: 30 ::::::::::::::::::::::::::::::::::::::: 目的 -- Learn about spreadsheets, their strengths and weaknesses. -- How do we format data in spreadsheets for effective data use? -- Learn about common spreadsheet errors and how to correct them. -- Organise your data according to tidy data principles. -- Learn about text-based spreadsheet formats such as the comma-separated (CSV) or tab-separated (TSV) formats. +- スプレッドシートとその長所と短所について学びます。 +- データを効果的に使用するには、スプレッドシート内のデータをどのようにフォーマットすればよいでしょうか? +- 一般的なスプレッドシートのエラーとその修正方法について説明します。 +- きちんとしたデータの原則に従ってデータを整理します。 +- カンマ区切り (CSV) 形式やタブ区切り (TSV) 形式などのテキストベースのスプレッドシート形式について説明します。 :::::::::::::::::::::::::::::::::::::::::::::::::: :::::::::::::::::::::::::::::::::::::::: questions -- How to organise tabular data? +- 表形式のデータを整理するにはどうすればよいですか? :::::::::::::::::::::::::::::::::::::::::::::::::: -## Spreadsheet programs +## 表計算プログラム -**Question** +**質問** - 優れたデータ統合用にスプレッドシートを使用するための基本的な原則は何でしょうか? -**Objective** +**客観的** -- Describe best practices for organizing data so computers can make - the best use of datasets. +- コンピューターがデータセットを + に活用できるようにデータを整理するためのベスト プラクティスについて説明します。 -**Keypoint** +**キーポイント** -- Good data organization is the foundation of any research project. +- 適切なデータ構成は、あらゆる研究プロジェクトの基礎です。 -Good data organization is the foundation of your research -project. Most researchers have data or do data entry in -spreadsheets. Spreadsheet programs are very useful graphical -interfaces for designing data tables and handling very basic data -quality control functions. See also @Broman:2018. +適切なデータ構成は、研究 +の基礎です。 ほとんどの研究者はデータを持っているか、 +シートにデータ入力を行っていません。 スプレッドシート プログラムは、データ テーブルを設計し、非常に基本的 +データ品質管理機能を処理するための非常に便利な +インターフェイスです。 @Broman:2018 も参照してください。 -### Spreadsheet outline +### スプレッドシートの概要 -Spreadsheets are good for data entry. Therefore we have a lot of data -in spreadsheets. Much of your time as a researcher will be spent in -this 'data wrangling' stage. It's not the most fun, but it's -necessary. We'll teach you how to think about data organization and -some practices for more effective data wrangling. +スプレッドシートはデータ入力に適しています。 したがって、スプレッドシートにはデータ +がたくさんあります。 研究者としての時間の多くは、この「データの検討」段階 +費やされることになります。 とても楽しいわけではありませんが、必要性は +です。 データの編成について考える方法と、より効果的なデータ ラングリングのための +かの実践方法を説明します。 -### What this lesson will not teach you +### このレッスンで教えられないこと -- How to do _statistics_ in a spreadsheet -- How to do _plotting_ in a spreadsheet -- How to _write code_ in spreadsheet programs +- スプレッドシートで _統計_ を行う方法 +- スプレッドシートで _プロット_ を行う方法 +- スプレッドシート プログラムで _コードを記述する_方法 -If you're looking to do this, a good reference is Head First -Excel, -published by O'Reilly. +これを実行したい場合は、O 発行の Head First +Excel +参考になります。 「ライリー。 -### Why aren't we teaching data analysis in spreadsheets +### なぜスプレッドシートでのデータ分析を教えないのか -- Data analysis in spreadsheets usually requires a lot of manual - work. If you want to change a parameter or run an analysis with a - new dataset, you usually have to redo everything by hand. (We do - know that you can create macros, but see the next point.) +- スプレッドシートでのデータ分析には通常、多くの + 作業が必要です。 パラメーターを変更したり、 + データセットを使用して分析を実行したりする場合は、通常、すべてを手動でやり直す必要があります。 (マクロを作成できることはわかりませ + が、次の点を参照してください。) -- It is also difficult to track or reproduce statistical or plotting - analyses done in spreadsheet programs when you want to go back to - your work or someone asks for details of your analysis. +- また、 + の作業に戻りたいときや、誰かが分析の詳細を尋ねたときに、スプレッドシート プログラムで行われた統計分析やプロット分析を追跡したり再現したりすること + 困難です。 -Many spreadsheet programs are available. Since most participants -utilise Excel as their primary spreadsheet program, this lesson will -make use of Excel examples. A free spreadsheet program that can also -be used is LibreOffice. Commands may differ a bit between programs, -but the general idea is the same. +多くの表計算プログラムが利用可能です。 ほとんどの参加者は主なスプレッドシート プログラムとして +を使用するため、このレッスンで +Excel の例を使用します。 +で使用できる表計算プログラムは LibreOffice です。 コマンドはプログラム間 +少し異なる場合がありますが、一般的な考え方は同じです。 -Spreadsheet programs encompass a lot of the things we need to be able -to do as researchers. We can use them for: +スプレッドシート プログラムには、研究者としてできる +にする必要のある多くのことが含まれています。 それらは次の目的で使用できます。 -- Data entry -- Organizing data -- Subsetting and sorting data -- Statistics -- Plotting +- データ入力 +- データの整理 +- データのサブセット化と並べ替え +- 統計 +- プロット -Spreadsheet programs use tables to represent and display data. Data -formatted as tables is also the main theme of this chapter, and we -will see how to organise data into tables in a standardised way to -ensure efficient downstream analysis. +スプレッドシート プログラムはテーブルを使用してデータを表し、表示します。 テーブルとしてフォーマットされたデータ +この章の主要テーマであり、効率的なダウンストリーム分析を +にするために、標準化された方法でデータをテーブルに編成する方法 +見ていきます。 ::::::::::::::::::::::::::::::::::::::: challenge -## Challenge: Discuss the following points with your neighbour +## 課題: 隣人と次の点について話し合ってください。 -- Have you used spreadsheets, in your research, courses, - or at home? -- What kind of operations do you do in spreadsheets? -- Which ones do you think spreadsheets are good for? -- Have you accidentally done something in a spreadsheet program that made you - frustrated or sad? +- 研究や + 、または自宅でスプレッドシートを使用したことがありますか? +- スプレッドシートではどのような操作を行っていますか? +- スプレッドシートは何に適していると思いますか? +- スプレッドシート プログラムで誤ってイライラ + たり悲しくなったりするようなことをしたことがありますか? :::::::::::::::::::::::::::::::::::::::::::::::::: -### Problems with spreadsheets +### スプレッドシートの問題 -Spreadsheets are good for data entry, but in reality we tend to -use spreadsheet programs for much more than data entry. We use them -to create data tables for publications, to generate summary -statistics, and make figures. +スプレッドシートはデータ入力には適していますが、実際にはデータ入力以外の目的で +シート プログラムを使用する傾向があります。 これら +を使用して、出版物のデータ テーブルを作成し、概要 +統計を生成し、図を作成します。 -Generating tables for publications in a spreadsheet is not -optimal - often, when formatting a data table for publication, we're -reporting key summary statistics in a way that is not really meant to -be read as data, and often involves special formatting -(merging cells, creating borders, making it pretty). We advise you to -do this sort of operation within your document editing software. +スプレッドシートでパブリケーション用のテーブルを生成することは +ではありません。多くの場合、パブリケーション用にデータ テーブルをフォーマットするとき、実際にはデータとして読み取ら +ことを意図していない方法で重要な概要統計をレポートすることに +、特殊なフォーマットが必要になることがよくあります。 +(セルを結合し、境界線を作成し、美しくする)。 この種の操作は文書編集ソフトウェア内で行う +をお勧めします。 -The latter two applications, generating statistics and figures, should -be used with caution: because of the graphical, drag and drop nature of -spreadsheet programs, it can be very difficult, if not impossible, to -replicate your steps (much less retrace anyone else's), particularly if your -stats or figures require you to do more complex calculations. Furthermore, -in doing calculations in a spreadsheet, it's easy to accidentally apply a -slightly different formula to multiple adjacent cells. When using a -command-line based statistics program like R or SAS, it's practically -impossible to apply a calculation to one observation in your -dataset but not another unless you're doing it on purpose. +統計と数値を生成する後者の 2 つのアプリケーションは、 +して使用する必要があります。1 スプレッド +プログラムのグラフィカルなドラッグ アンド ドロップの性質のため、 +手順を複製するのが不可能ではないにしても、非常に困難になる可能性があります (元に戻すことはさらに困難です)。特に +の統計や数値により、より複雑な計算が必要な場合はそうです。 さらに、スプレッドシートで計算を行う場合、 +わずかに異なる数式を +の隣接するセルに誤って適用してしまうことがよくあります。 R や SAS などのコマンドライン ベースの統計プログラムを使用する場合、意図的に実行しない限り、データセット内の観測値には計算を適用し、別の観測値には計算を適用しないことは事実上不可能です。 -### Using spreadsheets for data entry and cleaning +### データ入力とクリーニングにスプレッドシートを使用する -In this lesson, we will assume that you are most likely using Excel as -your primary spreadsheet program - there are others (gnumeric, Calc -from OpenOffice), and their functionality is similar, but Excel seems -to be the program most used by biologists and biomedical researchers. +このレッスンでは、 +なスプレッドシート プログラムとして Excel を使用している可能性が高いと仮定します。他にも (OpenOffice の gnumeric、Calc +) があり、機能は似ていますが、Excel が +よく使用されているプログラムであると思われます。生物学者や生物医学の研究者。 -In this lesson we're going to talk about: +このレッスンでは次のことについて話します。 -1. Formatting data tables in spreadsheets -2. Formatting problems -3. Exporting data +1. スプレッドシートでのデータテーブルの書式設定 +2. フォーマットの問題 +3. データのエクスポート -## Formatting data tables in spreadsheets +## スプレッドシートでのデータテーブルの書式設定 -**Questions** +**質問** -- How do we format data in spreadsheets for effective data use? +- データを効果的に使用するには、スプレッドシート内のデータをどのようにフォーマットすればよいでしょうか? -**Objectives** +**目的** -- Describe best practices for data entry and formatting in - spreadsheets. +- シートでのデータ入力と書式設定のベスト プラクティスについて説明します。 -- Apply best practices to arrange variables and observations in a - spreadsheet. +- ベスト プラクティスを適用して、変数と観測値を + シートに配置します。 -**Keypoints** +**キーポイント** -- Never modify your raw data. Always make a copy before making any - changes. +- 生データは決して変更しないでください。 + を加える前に必ずコピーを作成してください。 -- Keep track of all of the steps you take to clean your data in a - plain text file. +- データをクリーンアップするために実行したすべての手順を + テキスト ファイルに記録します。 -- Organise your data according to tidy data principles. +- きちんとしたデータの原則に従ってデータを整理します。 -The most common mistake made is treating spreadsheet programs like lab -notebooks, that is, relying on context, notes in the margin, spatial -layout of data and fields to convey information. As humans, we can -(usually) interpret these things, but computers don't view information -the same way, and unless we explain to the computer what every single -thing means (and that can be hard!), it will not be able to see how -our data fits together. +最もよくある間違いは、スプレッドシート プログラムを研究室の +ブックのように扱うことです。つまり、情報を伝えるためにコンテキスト、余白のメモ、データとフィールドの空間 +レイアウトに依存していることです。 人間として、これらのことを (通常は) 解釈できますが、コンピューターは情報 +同じようには見ません。そして、すべての +の意味をコンピューターに説明しない限り (それ +難しい場合があります!)、理解できません。 +データがどのように組み合わされるかを確認できます。 -Using the power of computers, we can manage and analyse data in much -more effective and faster ways, but to use that power, we have to set -up our data for the computer to be able to understand it (and -computers are very literal). +コンピューターの力を利用すると、 +効果的かつ高速な方法でデータを管理および分析できますが、その力を使用するには、コンピューターが理解できるようにデータを +する必要があります ( +コンピューターは非常に複雑です)。リテラル)。 -This is why it's extremely important to set up well-formatted tables -from the outset - before you even start entering data from your very -first preliminary experiment. Data organization is the foundation of -your research project. It can make it easier or harder to work with -your data throughout your analysis, so it's worth thinking about when -you're doing your data entry or setting up your experiment. You can -set things up in different ways in spreadsheets, but some of these -choices can limit your ability to work with the data in other programs -or have the you-of-6-months-from-now or your collaborator work with -the data. +このため、 +の予備実験からデータの入力を開始する前に、適切にフォーマットされた +をセットアップすることが非常に重要です。 データの整理は研究プロジェクトの +です。 分析全体を通じて +の操作が容易になるか困難になる可能性があるため、データ入力を行う +や実験を設定するときに考慮する価値があります。 +スプレッドシートではさまざまな方法で設定できますが、これらの +選択の一部によっては、 +のプログラムでデータを操作する能力が制限されたり、6 か月後の自分や共同作業者が共同作業したりすることが制限される可能性があります。 +データ。 -**Note:** the best layouts/formats (as well as software and -interfaces) for data entry and data analysis might be different. It is -important to take this into account, and ideally automate the -conversion from one to another. +**注:** データ入力とデータ分析に最適なレイアウト/形式 (およびソフトウェアと +) は異なる場合があります。 これを考慮し、理想的にはあるものから +のものへの変換を自動化することが +です。 -### Keeping track of your analyses +### 分析を追跡する -When you're working with spreadsheets, during data clean up or -analyses, it's very easy to end up with a spreadsheet that looks very -different from the one you started with. In order to be able to -reproduce your analyses or figure out what you did when a reviewer or -instructor asks for a different analysis, you should +スプレッドシートを使用しているとき、データのクリーンアップ +分析を行っているときに、最初のスプレッドシートとは +異なる外観のスプレッドシートが完成することがよくあります。 分析を +したり、査読者や講師が別の分析を要求したときに何をしたかを把握したりするには、次のこと +行う必要があります。 -- create a new file with your cleaned or analysed data. Don't modify - the original dataset, or you will never know where you started! +- クリーンアップまたは分析されたデータを含む新しいファイルを作成します。 元のデータセット + 変更しないでください。変更すると、どこから始めたのかわからなくなります。 -- keep track of the steps you took in your clean up or analysis. You - should track these steps as you would any step in an experiment. We - recommend that you do this in a plain text file stored in the same - folder as the data file. +- クリーンアップまたは分析で実行した手順を記録します。 実験の他のステップと同様に、これらのステップを追跡する必要があり + 。 データ ファイルと同じフォルダーに保存された + テキスト ファイルでこれを行うことをお勧め + ます。 -This might be an example of a spreadsheet setup: +これはスプレッドシート設定の例です。 ![](fig/spreadsheet-setup-updated.png) -Put these principles in to practice today during your exercises. +今日の演習中にこれらの原則を実践してください。 -While versioning is out of scope for this course, you can look at the -Carpentries lesson on -['Git'](https://swcarpentry.github.io/git-novice/) to learn how to -maintain **version control** over your data. See also this blog -post for a quick tutorial or -@Perez-Riverol:2016 for a more research-oriented use-case. +バージョン管理はこのコースの範囲外ですが、**バージョンを維持 +方法については、 +['Git'](https://swcarpentry.github.io/git-novice/) の +Carpentries レッスンを参照してください。データを制御**します。 簡単なチュートリアルについてはこの ブログ +投稿 を、より研究指向のユースケースについては +@Perez-Riverol:2016 も参照してください。 -### Structuring data in spreadsheets +### スプレッドシートでのデータの構造化 -The cardinal rules of using spreadsheet programs for data: +データにスプレッドシート プログラムを使用する際の基本ルールは次のとおりです。 -1. Put all your variables in columns - the thing you're measuring, - like 'weight' or 'temperature'. -2. Put each observation in its own row. -3. Don't combine multiple pieces of information in one cell. Sometimes - it just seems like one thing, but think if that's the only way - you'll want to be able to use or sort that data. -4. Leave the raw data raw - don't change it! -5. Export the cleaned data to a text-based format like CSV - (comma-separated values) format. This ensures that anyone can use - the data, and is required by most data repositories. +1. すべての変数を列に入力します。測定対象は「重量」や「温度」 + です。 +2. 各観測値を独自の行に配置します。 +3. 1 つのセルに複数の情報を組み合わせないでください。 場合によっては + それは単なる + つのことのように思えますが、それがそのデータを使用または並べ替えできるようにする唯一の方法であるかどうかを考えてください。 +4. 生データはそのままにしておきます。変更しないでください。 +5. クリーンアップされたデータを CSV + (カンマ区切り値) 形式などのテキストベースの形式にエクスポートします。 これにより、誰でも + を使用できるようになり、ほとんどのデータ リポジトリで必要になります。 -For instance, we have data from patients that visited several -hospitals in Brussels, Belgium. They recorded the date of the visit, -the hospital, the patients' gender, weight and blood group. +たとえば、ベルギーのブリュッセルにあるいくつか +病院を訪れた患者からのデータがあります。 彼らは、訪問日、 +、患者の性別、体重、血液型を記録しました。 -If we were to keep track of the data like this: +次のようにデータを追跡するとします。 ![](fig/multiple-info.png) -the problem is that the ABO and Rhesus groups are in the same `Blood` -type column. So, if they wanted to look at all observations of the A -group or look at weight distributions by ABO group, it would be tricky -to do this using this data setup. If instead we put the ABO and Rhesus -groups in different columns, you can see that it would be much easier. +問題は、ABO グループと Rhesus グループが同じ「Blood」 +タイプ列にあることです。 したがって、A +グループのすべての観測値を調べたり、ABO グループごとの重み分布を調べたりしたい場合、このデータ設定を使用してこれを行うのは難しいでしょう +。 代わりに、ABO グループと Rhesus +グループを別の列に配置すると、はるかに簡単になることがわかります。 ![](fig/single-info.png) -An important rule when setting up a datasheet, is that **columns are -used for variables** and **rows are used for observations**: +データシートを設定する際の重要なルールは、**列は変数に +され**、**行は観測に使用される**ということです。 -- columns are variables -- rows are observations -- cells are individual values +- 列は変数です +- 行は観測結果です +- セルは個別の値です ::::::::::::::::::::::::::::::::::::::: challenge -## Challenge: We're going to take a messy dataset and describe how we would clean it up. +## 課題: 乱雑なデータセットを取り上げ、それをクリーンアップする方法を説明します。 -1. Download a messy dataset by clicking - [here](https://github.com/UCLouvain-CBIO/WSBIM1207/raw/master/data/messy_covid.xlsx). +1. [ここ](https://github.com/UCLouvain-CBIO/WSBIM1207/raw/master/data/messy_covid.xlsx) + をクリックして、乱雑なデータセットをダウンロードします。 -2. Open up the data in a spreadsheet program. +2. スプレッドシート プログラムでデータを開きます。 -3. You can see that there are two tabs. The data contains various - clinical variables recorded in various hospitals in Brussels during - the first and second COVID-19 waves in 2020. As you can see, the - data have been recorded differently during the March and November - waves. Now you're the person in charge of this project and you want - to be able to start analyzing the data. +3. タブが 2 つあることがわかります。 このデータには、2020 年の新型コロナウイルス感染症 (COVID-19) の第 + 波と第 2 波の間にブリュッセルのさまざまな病院で記録されたさまざま + 臨床変数が含まれています。 ご覧のとおり、 + データは 3 月と 11 月 + の波では異なる方法で記録されています。 あなたはこのプロジェクトの責任者となり、 + データの分析を開始できるようにしたいと考えています。 -4. With the person next to you, identify what is wrong with this - spreadsheet. Also discuss the steps you would need to take to clean - up first and second wave tabs, and to put them all together in one - spreadsheet. +4. 隣にいる人と一緒に、この + スプレッドシートのどこが間違っているのかを特定してください。 また、最初と 2 番目の Wave タブをクリーンアップし、それらをすべて + つのスプレッドシートにまとめるために必要な手順について + 説明します。 -**Important:** Do not forget our first piece of advice: to create a -new file (or tab) for the cleaned data, never modify your original -(raw) data. +**重要:** 最初のアドバイスを忘れないでください。クリーンアップされたデータ用に +ファイル (またはタブ) を作成する場合は、元の +(生の) データを決して変更しないでください。 :::::::::::::::::::::::::::::::::::::::::::::::::: -After you go through this exercise, we'll discuss as a group what was -wrong with this data and how you would fix it. +この演習を終えた後、このデータの +が間違っていたのか、そしてそれをどのように修正するのかをグループで話し合います。 <!-- - Take about 10 minutes to work on this exercise. --> @@ -315,45 +311,45 @@ wrong with this data and how you would fix it. ::::::::::::::::::::::::::::::::::::::: challenge -## Challenge: Once you have tidied up the data, answer the following questions: +## 課題: データを整理したら、次の質問に答えてください。 -- How many men and women took part in the study? -- How many A, AB, and B types have been tested? -- As above, but disregarding the contaminated samples? -- How many Rhesus + and - have been tested? -- How many universal donors (O-) have been tested? -- What is the average weight of AB men? -- How many samples have been tested in the different hospitals? +- 何人の男性と女性が研究に参加しましたか? +- A、AB、B タイプは何人検査されましたか? +- 上記と同様ですが、汚染されたサンプルは無視しますか? +- アカゲザル + と - は何人検査されましたか? +- 何人のユニバーサルドナー (O-) が検査されましたか? +- AB型男性の平均体重はどれくらい? +- さまざまな病院で何件のサンプルが検査されましたか? :::::::::::::::::::::::::::::::::::::::::::::::::: -An **excellent reference**, in particular with regard to R scripting -is the _Tidy Data_ paper @Wickham:2014. +特に R スクリプト +に関する **優れた参考文献**は、_Tidy Data_ 論文 @Wickham:2014 です。 -## Common spreadsheet errors +## よくあるスプレッドシートのエラー -**Questions** +**質問** -- What are some common challenges with formatting data in spreadsheets - and how can we avoid them? +- スプレッドシート + のデータの書式設定に関する一般的な課題は何ですか?また、それらを回避するにはどうすればよいですか? -**Objectives** +**目的** -- Recognise and resolve common spreadsheet formatting problems. +- 一般的なスプレッドシートの書式設定の問題を認識して解決します。 -**Keypoints** +**キーポイント** -- Avoid using multiple tables within one spreadsheet. -- Avoid spreading data across multiple tabs. -- Record zeros as zeros. -- Use an appropriate null value to record missing data. -- Don't use formatting to convey information or to make your spreadsheet look pretty. -- Place comments in a separate column. -- Record units in column headers. -- Include only one piece of information in a cell. -- Avoid spaces, numbers and special characters in column headers. -- Avoid special characters in your data. -- Record metadata in a separate plain text file. +- 1 つのスプレッドシート内で複数のテーブルを使用しないでください。 +- データが複数のタブに分散しないようにします。 +- ゼロはゼロとして記録します。 +- 欠落データを記録するには、適切な null 値を使用します。 +- 情報を伝えたり、スプレッドシートを美しく見せるために書式設定を使用しないでください。 +- コメントは別の列に配置します。 +- 列ヘッダーに単位を記録します。 +- セルには 1 つの情報のみを含めます。 +- 列ヘッダーにはスペース、数字、特殊文字を使用しないでください。 +- データ内では特殊文字を避けてください。 +- メタデータを別のプレーン テキスト ファイルに記録します。 <!-- This lesson is meant to be used as a reference for discussion as --> @@ -363,466 +359,461 @@ is the _Tidy Data_ paper @Wickham:2014. <!-- refer to responses to the exercise in the previous lesson. --> -There are a few potential errors to be on the lookout for in your own -data as well as data from collaborators or the Internet. If you are -aware of the errors and the possible negative effect on downstream -data analysis and result interpretation, it might motivate yourself -and your project members to try and avoid them. Making small changes -to the way you format your data in spreadsheets, can have a great -impact on efficiency and reliability when it comes to data cleaning -and analysis. - -- [Using multiple tables](#tables) -- [Using multiple tabs](#tabs) -- [Not filling in zeros](#zeros) -- [Using problematic null values](#null) -- [Using formatting to convey information](#formatting) -- [Using formatting to make the data sheet look pretty](#formatting_pretty) -- [Placing comments or units in cells](#units) -- [Entering more than one piece of information in a cell](#info) -- [Using problematic field names](#field_name) -- [Using special characters in data](#special) -- [Inclusion of metadata in data table](#metadata) - -### Using multiple tables {#tables} - -A common strategy is creating multiple data tables within one -spreadsheet. This confuses the computer, so don't do this! When you -create multiple tables within one spreadsheet, you're drawing false -associations between things for the computer, which sees each row as -an observation. You're also potentially using the same field name in -multiple places, which will make it harder to clean your data up into -a usable form. The example below depicts the problem: +自分自身のデータ +でなく、共同作業者やインターネットからのデータにも、注意すべき潜在的なエラーがいくつかあります。 エラーや、下流 +データ分析 +結果の解釈に悪影響が及ぶ可能性があることを認識 +ていれば、自分やプロジェクト メンバーがエラーを回避しようとする動機になるかもしれません。 スプレッドシートでデータをフォーマットする方法に +変更を加えると、データのクリーニング +と分析の効率と信頼性に大きな影響 +を与える可能性があります。 + +- [複数のテーブルの使用](#tables) +- [複数のタブの使用](#tabs) +- [ゼロを埋めない](#zeros) +- [問題のある null 値の使用](#null) +- [情報を伝えるために書式設定を使用する](#formatting) +- [書式設定を使用してデータシートを美しく見せる](#formatting_pretty) +- [セル内にコメントまたはユニットを配置する](#units) +- [セルに複数の情報を入力する](#info) +- [問題のあるフィールド名の使用](#field_name) +- [データ内での特殊文字の使用](#special) +- [データテーブルへのメタデータの組み込み](#metadata) + +### 複数のテーブルの使用 {#tables} + +一般的な戦略は、1 つ +スプレッドシート内に複数のデータ テーブルを作成することです。 これはコンピュータを混乱させるので、行わないでください。\ +1 つのスプレッドシート内に複数のテーブルを作成すると、コンピュータにとっては、各行を観測 +として認識するため、物事の間に +た関連付けが描画されることになります。 また、同じフィールド名を +の場所で使用している可能性があり、データを使用可能な形式に +アップすることが困難になります。 以下の例は問題を示しています。 ![](fig/2_datasheet_example.jpg) -In the example above, the computer will see (for example) row 4 and -assume that all columns A-AF refer to the same sample. This row -actually represents four distinct samples (sample 1 for each of four -different collection dates - May 29th, June 12th, June 19th, and June -26th), as well as some calculated summary statistics (an average (avr) -and standard error of measurement (SEM)) for two of those -samples. Other rows are similarly problematic. - -### Using multiple tabs {#tabs} - -But what about workbook tabs? That seems like an easy way to organise -data, right? Well, yes and no. When you create extra tabs, you fail to -allow the computer to see connections in the data that are there (you -have to introduce spreadsheet application-specific functions or -scripting to ensure this connection). Say, for instance, you make a -separate tab for each day you take a measurement. - -This isn't good practice for two reasons: - -1. you are more likely to accidentally add inconsistencies to your - data if each time you take a measurement, you start recording data - in a new tab, and - -2. even if you manage to prevent all inconsistencies from creeping in, - you will add an extra step for yourself before you analyse the data - because you will have to combine these data into a single - datatable. You will have to explicitly tell the computer how to - combine tabs - and if the tabs are inconsistently formatted, you - might even have to do it manually. - -The next time you're entering data, and you go to create another tab -or table, ask yourself if you could avoid adding this tab by adding -another column to your original spreadsheet. We used multiple tabs in -our example of a messy data file, but now you've seen how you can -reorganise your data to consolidate across tabs. - -Your data sheet might get very long over the course of the -experiment. This makes it harder to enter data if you can't see your -headers at the top of the spreadsheet. But don't repeat your header -row. These can easily get mixed into the data, leading to problems -down the road. Instead you can freeze the column -headers -so that they remain visible even when you have a spreadsheet with many -rows. - -### Not filling in zeros {#zeros} - -It might be that when you're measuring something, it's usually a zero, -say the number of times a rabbit is observed in the survey. Why bother -writing in the number zero in that column, when it's mostly zeros? - -However, there's a difference between a zero and a blank cell in a -spreadsheet. To the computer, a zero is actually data. You measured or -counted it. A blank cell means that it wasn't measured and the -computer will interpret it as an unknown value (also known as a null -or missing value). - -The spreadsheets or statistical programs will likely misinterpret -blank cells that you intend to be zeros. By not entering the value of -your observation, you are telling your computer to represent that data -as unknown or missing (null). This can cause problems with subsequent -calculations or analyses. For example, the average of a set of numbers -which includes a single null value is always null (because the -computer can't guess the value of the missing observations). Because -of this, it's very important to record zeros as zeros and truly -missing data as nulls. - -### Using problematic null values {#null} - -**Example**: using -999 or other numerical values (or zero) to -represent missing data. - -**Solutions**: - -There are a few reasons why null values get represented differently -within a dataset. Sometimes confusing null values are automatically -recorded from the measuring device. If that's the case, there's not -much you can do, but it can be addressed in data cleaning with a tool -like +上の例では、コンピュータは、(たとえば) 行 4 と +、すべての列 A ~ A F が同じサンプルを参照しているとみなして表示します。 この行 +、実際には 4 つの異なるサンプル (5 月 29 日、6 月 12 日、6 月 19 日、および 6 月 +26 日の +つの異なる収集日のそれぞれのサンプル 1) と、計算されたいくつかの概要統計 (平均 (avr) +およびこれら +のサンプルのうち 2 つの標準測定誤差 (SEM))。 他の行にも同様に問題があります。 + +### 複数のタブの使用 {#tabs} + +しかし、ワークブックのタブはどうでしょうか? +データを整理する簡単な方法のように思えますよね? まあ、はい、いいえです。 追加のタブを作成すると、そこにあるデータの接続をコンピュータに認識させることができなく +ます (この接続を確保するには、 +スプレッドシート アプリケーション固有の関数を導入するか、 +スクリプトを導入する必要があります)。 たとえば、測定を行う日ごとに +のタブを作成するとします。 + +これは次の 2 つの理由から良い習慣ではありません。 + +1. 測定を行うたびに新しいタブでデータ + の記録を開始すると、誤って + データに不一致が追加される可能性が高くなります。 + +2. たとえすべての不一致が忍び寄るのを防ぐことができたとしても、 + これらのデータを単一 + データテーブルに結合する必要があるため、 + データを分析する前に余分な手順を追加することになります。 タブ + 結合する方法をコンピュータに明示的に指示する必要があります。また、タブの形式が一貫していない場合は、 + で結合する必要がある場合もあります。 + +次回データを入力するときに、別のタブ +またはテーブルを作成するときは、元のスプレッドシートに別の列を +追加することで、このタブの追加を回避できるかどうか自問してください。 乱雑なデータ ファイルの +では複数のタブを使用しましたが、データを +編成してタブ間で統合する方法がわかりました。 + +の過程でデータシートが非常に長くなる可能性があります。 これにより、スプレッドシートの上部に +ヘッダーが表示されない場合、データの入力が困難になります。 ただし、ヘッダー +行を繰り返さないでください。 これらは簡単にデータに混入し、 +的に問題が発生する可能性があります。 代わりに、列 +ヘッダーを固定する +ことができます。これにより、多くの +行を含むスプレッドシートがある場合でも、それらの行が表示されたままになります。 + +### ゼロを埋めない {#zeros} + +何かを測定するとき、調査でウサギが観察された回数は通常 +である可能性があります。 その列にはほとんどゼロがあるのに、なぜわざわざ +という数字のゼロを書き込むのでしょうか? + +ただし、 +スプレッドシートのゼロと空白のセルには違いがあります。 コンピューターにとって、ゼロは実際にはデータです。 あなたが測ったか、 +なかった。 空白のセルは、測定されていないことを意味し、 +はそれを未知の値 (ヌル +または欠損値とも呼ばれます) として解釈します。 + +スプレッドシートや統計プログラムでは、ゼロであるつもりの +の空白セルが誤って解釈される可能性があります。 観測値 +入力しないことにより、そのデータ +を不明または欠落 (null) として表すようにコンピュータに指示することになります。 これにより、後続の +の計算または分析で問題が発生する可能性があります。 たとえば、単一の null 値を含む一連の数値 +の平均は常に null です ( +は欠落している観測値を推測できないため)。 このうち +であるため、ゼロをゼロとして記録し、真に +欠損データをヌルとして記録することが非常に重要です。 + +### 問題のある null 値 {#null}の使用 + +**例**: -999 またはその他の数値 (またはゼロ) を +に使用すると、欠損データを表します。 + +**解決策**: + +データセット内で null 値が異なる +で表現される理由はいくつかあります。 紛らわしいヌル値が測定装置から自動的に +として記録される場合があります。 その場合、できることは +ではありませんが、ツール +を使用してデータ クリーニングで対処できます 1 [OpenRefine](https://www.datacarpentry.org/OpenRefine-ecology-lesson/) -before analysis. Other times different null values are used to convey -different reasons why the data isn't there. This is important -information to capture, but is in effect using one column to capture -two pieces of information. Like for using formatting to convey -information it would be good here to create a new -column like 'data_missing' and use that column to capture the -different reasons. - -Whatever the reason, it's a problem if unknown or missing data is -recorded as -999, 999, or 0. - -Many statistical programs will not recognise that these are intended -to represent missing (null) values. How these values are interpreted -will depend on the software you use to analyse your data. It is -essential to use a clearly defined and consistent null indicator. - -Blanks (most applications) and NA (for R) are good -choices. @White:2013 explain good choices for indicating null values -for different software applications in their article: +前分析。 また、データが存在しないさまざまな理由を伝えるために、 +な null 値が使用されることもあります。 これは取得すべき重要 +情報ですが、実際には 1 つの列を使用して +2 つの情報を取得することになります。 フォーマットを使用して +情報を伝える と同様に、ここでは「data_missing」のような新しい +列を作成し、その列を使用して +の異なる理由をキャプチャすると良いでしょう。 + +理由が何であれ、不明または欠落しているデータが +として -999、999、または 0 として記録されている場合は問題です。 + +多くの統計プログラムは、これらが欠損 (NULL) 値を表す +であることを認識しません。 これらの値 +がどのように解釈されるかは、データの分析に使用するソフトウェアによって異なります。 明確に定義された一貫性のある null インジケーターを使用することが +です。 + +空白 (ほとんどのアプリケーション) と NA (R の場合) が +選択肢として適しています。 @White:2013 は、記事の中で、さまざまなソフトウェア アプリケーションに対して null 値 +を示すための適切な選択肢について説明しています。 ![](fig/3_white_table_1.jpg) -### Using formatting to convey information {#formatting} +### フォーマットを使用して情報を伝える {#formatting} -**Example**: highlighting cells, rows or columns that should be -excluded from an analysis, leaving blank rows to indicate -separations in data. +**例**: 分析から +必要があるセル、行、または列を強調表示し、空白の行を残してデータの +分離を示します。 ![](fig/formatting.png) -**Solution**: create a new field to encode which data should be -excluded. +**解決策**: 新しいフィールドを作成して、 +データをエンコードします。 ![](fig/good_formatting.png) -### Using formatting to make the data sheet look pretty {#formatting_pretty} - -**Example**: merging cells. - -**Solution**: If you're not careful, formatting a worksheet to be more -aesthetically pleasing can compromise your computer's ability to see -associations in the data. Merged cells will make your data unreadable -by statistics software. Consider restructuring your data in such a way -that you will not need to merge cells to organise your data. - -### Placing comments or units in cells {#units} - -Most analysis software can't see Excel or LibreOffice comments, and -would be confused by comments placed within your data cells. As -described above for formatting, create another field if you need to -add notes to cells. Similarly, don't include units in cells: ideally, -all the measurements you place in one column should be in the same -unit, but if for some reason they aren't, create another field and -specify the units the cell is in. - -### Entering more than one piece of information in a cell {#info} - -**Example**: Recording ABO and Rhesus groups in one cell, such as A+, -B+, A-, ... - -**Solution**: Don't include more than one piece of information in a -cell. This will limit the ways in which you can analyse your data. If -you need both these measurements, design your data sheet to include -this information. For example, include one column for the ABO group and -one for the Rhesus group. - -### Using problematic field names {#field_name} - -Choose descriptive field names, but be careful not to include spaces, -numbers, or special characters of any kind. Spaces can be -misinterpreted by parsers that use whitespace as delimiters and some -programs don't like field names that are text strings that start with -numbers. - -Underscores (`_`) are a good alternative to spaces. Consider writing -names in camel case (like this: ExampleFileName) to improve -readability. Remember that abbreviations that make sense at the moment -may not be so obvious in 6 months, but don't overdo it with names that -are excessively long. Including the units in the field names avoids -confusion and enables others to readily interpret your fields. - -**Examples** - -| Good Name | Good Alternative | Avoid | -| ---------------------------------------------------------- | -------------------------------------- | ------------------------------------ | -| Max_temp_C | MaxTemp | Maximum Temp (°C) | -| Precipitation_mm | Precipitation | precmm | -| Mean_year_growth | MeanYearGrowth | Mean growth/year | -| sex | sex | M/F | -| weight | weight | w. | -| cell_type | CellType | Cell Type | -| Observation_01 | first_observation | 1st Obs | - -### Using special characters in data {#special} - -**Example**: You treat your spreadsheet program as a word processor -when writing notes, for example copying data directly from Word or -other applications. - -**Solution**: This is a common strategy. For example, when writing -longer text in a cell, people often include line breaks, em-dashes, -etc. in their spreadsheet. Also, when copying data in from -applications such as Word, formatting and fancy non-standard -characters (such as left- and right-aligned quotation marks) are -included. When exporting this data into a coding/statistical -environment or into a relational database, dangerous things may occur, -such as lines being cut in half and encoding errors being thrown. - -General best practice is to avoid adding characters such as newlines, -tabs, and vertical tabs. In other words, treat a text cell as if it -were a simple web form that can only contain text and spaces. - -### Inclusion of metadata in data table {#metadata} - -**Example**: You add a legend at the top or bottom of your data table -explaining column meaning, units, exceptions, etc. - -**Solution**: Recording data about your data ("metadata") is -essential. You may be on intimate terms with your dataset while you -are collecting and analysing it, but the chances that you will still -remember that the variable "sglmemgp" means single member of group, -for example, or the exact algorithm you used to transform a variable -or create a derived one, after a few months, a year, or more are slim. - -As well, there are many reasons other people may want to examine or -use your data - to understand your findings, to verify your findings, -to review your submitted publication, to replicate your results, to -design a similar study, or even to archive your data for access and -re-use by others. While digital data by definition are -machine-readable, understanding their meaning is a job for human -beings. The importance of documenting your data during the collection -and analysis phase of your research cannot be overestimated, -especially if your research is going to be part of the scholarly -record. - -However, metadata should not be contained in the data file -itself. Unlike a table in a paper or a supplemental file, metadata (in -the form of legends) should not be included in a data file since this -information is not data, and including it can disrupt how computer -programs interpret your data file. Rather, metadata should be stored -as a separate file in the same directory as your data file, preferably -in plain text format with a name that clearly associates it with your -data file. Because metadata files are free text format, they also -allow you to encode comments, units, information about how null values -are encoded, etc. that are important to document but can disrupt the -formatting of your data file. - -Additionally, file or database level metadata describes how files that -make up the dataset relate to each other; what format they are in; and -whether they supercede or are superceded by previous files. A -folder-level readme.txt file is the classic way of accounting for all -the files and folders in a project. - -(Text on metadata adapted from the online course Research Data -[MANTRA](https://datalib.edina.ac.uk/mantra) by EDINA and Data Library, -University of Edinburgh. MANTRA is licensed under a Creative Commons -Attribution 4.0 International -License.) - -## Exporting data - -**Question** - -- How can we export data from spreadsheets in a way that is useful for - downstream applications? - -**Objectives** - -- Store spreadsheet data in universal file formats. -- Export data from a spreadsheet to a CSV file. - -**Keypoints** - -- Data stored in common spreadsheet formats will often not be read - correctly into data analysis software, introducing errors into your - data. - -- Exporting data from spreadsheets to formats like CSV or TSV puts it - in a format that can be used consistently by most programs. - -Storing the data you're going to work with for your analyses in Excel -default file format (`*.xls` or `*.xlsx` - depending on the Excel -version) isn't a good idea. Why? - -- Because it is a proprietary format, and it is possible that in the - future, technology won't exist (or will become sufficiently rare) to - make it inconvenient, if not impossible, to open the file. - -- Other spreadsheet software may not be able to open files saved in a - proprietary Excel format. - -- Different versions of Excel may handle data differently, leading to - inconsistencies. [Dates](https://datacarpentry.org/spreadsheet-ecology-lesson/03-dates-as-data/index.html) - is a well-documented example of inconsistencies in data storage. - -- Finally, more journals and grant agencies are requiring you to - deposit your data in a data repository, and most of them don't - accept Excel format. It needs to be in one of the formats discussed - below. - -- The above points also apply to other formats such as open data - formats used by LibreOffice / Open Office. These formats are not - static and do not get parsed the same way by different software - packages. - -Storing data in a universal, open, and static format will help deal -with this problem. Try tab-delimited (tab separated values or TSV) or -comma-delimited (comma separated values or CSV). CSV files are plain -text files where the columns are separated by commas, hence 'comma -separated values' or CSV. The advantage of a CSV file over an -Excel/SPSS/etc. file is that we can open and read a CSV file using -just about any software, including plain text editors like TextEdit or -NotePad. Data in a CSV file can also be easily imported into other -formats and environments, such as SQLite and R. We're not tied to a -certain version of a certain expensive program when we work with CSV -files, so it's a good format to work with for maximum portability and -endurance. Most spreadsheet programs can save to delimited text -formats like CSV easily, although they may give you a warning during -the file export. - -To save a file you have opened in Excel in CSV format: - -1. From the top menu select 'File' and 'Save as'. -2. In the 'Format' field, from the list, select 'Comma Separated - Values' (`*.csv`). -3. Double check the file name and the location where you want to save - it and hit 'Save'. - -An important note for backwards compatibility: you can open CSV files -in Excel! +### 書式設定を使用してデータシートを美しく見せる {#formatting_pretty} + +**例**: セルを結合します。 + +**解決策**: 注意しないと、ワークシートを +より美しく見えるように書式設定すると、データ内の +関連付けを認識するコンピュータの機能が損なわれる可能性があります。 セルを結合すると、統計ソフトウェアで +を読み取ることができなくなります。 データを整理する +にセルを結合する必要がないような方法でデータを再構築することを検討してください。 + +### セル {#units}にコメントまたはユニットを配置する + +ほとんどの分析ソフトウェアは Excel や LibreOffice のコメントを表示できないため、 +データ セル内に配置されたコメントによって混乱する可能性があります。 書式設定について +で説明したように、セルに +を追加する必要がある場合は、別のフィールドを作成します。 同様に、セルに単位を含めないでください。理想的には、1 つの列に配置する +の測定値が同じ +単位内にある必要がありますが、何らかの理由でそうでない場合は、別のフィールドを作成し、 +セルの単位を指定します。で。 + +### セル {#info}に複数の情報を入力する + +**例**: A+、 +B+、A- などの ABO グループとアカゲザル グループを 1 つのセルに記録する + +**解決策**: +セルに複数の情報を含めないでください。 これにより、データを分析できる方法が制限されます。\ +の場合、これらの測定値の両方が必要な場合は、 +この情報を含めるようにデータシートを設計します。 たとえば、ABO グループには 1 つの列を含め、Rhesus グループには +つの列を含めます。 + +### 問題のあるフィールド名 {#field_name} の使用 + +説明的なフィールド名を選択しますが、スペース、 +数字、またはいかなる種類の特殊文字も含めないように注意してください。 スペースは、空白を区切り文字として使用するパーサーによって +て解釈される可能性があり、一部の +プログラムは +数字で始まるテキスト文字列であるフィールド名を好みません。 + +アンダースコア (`_`) はスペースの代わりに使用できます。 +読みやすさを向上させるために、 +名前をキャメルケースで記述することを検討してください (例: ExampleFileName)。 現時点では意味のある略語 +も、6 か月後にはそれほど明確ではなくなる可能性があることに注意してください。ただし、 +が長すぎる名前を付けすぎないでください。 フィールド名に +を含めることで混乱が回避され、他の人がフィールドを簡単に解釈できるようになります。 + +**例** + +| いい名前 | 良い代替品 | 避ける | +| ------------------------------------------------- | --------------------------- | ---------------------------- | +| 最高_温度_C | 最大温度 | 最高温度 (°C) | +| 降水量_mm | 降水量 | プレcmm | +| 平均_年_成長 | 平均年成長 | 平均成長率/年 | +| セックス | セックス | 男/女 | +| 重さ | 重さ | w。 | +| セル_タイプ | セルタイプ | 細胞の種類 | +| 観察_01 | 最初の_観察 | 1回目の観測 | + +### データ {#special}での特殊文字の使用 + +**例**: たとえば、Word または +のアプリケーションからデータを直接コピーするなど、メモを書くときにスプレッドシート プログラムをワード プロセッサ +として扱います。 + +**解決策**: これは一般的な戦略です。 たとえば、セルに +の長いテキストを書き込む場合、スプレッドシートに改行、全角ダッシュ、 +などを含めることがよくあります。 また、Word など +アプリケーションからデータをコピーする場合、書式設定や派手な +標準文字 (左揃えと右揃えの引用符など) +が含まれます。 このデータをコーディング/ +環境またはリレーショナル データベースにエクスポートすると、行が半分に切断されたり、エンコード エラーが発生したりする +、危険なことが発生する可能性があります。 + +一般的なベスト プラクティスは、改行、 +タブ、垂直タブなどの文字の追加を避けることです。 言い換えれば、テキスト セルを、テキストとスペースのみを含めることができる単純な Web フォームで +かのように扱います。 + +### データテーブル {#metadata}へのメタデータの組み込み + +**例**: データ テーブル +の上部または下部に、列の意味、単位、例外などを説明する凡例を追加します。 + +**解決策**: データに関するデータ (「メタデータ」) を記録することは +ではありません。 データセットを +して分析している間は、データセットと親密な関係にあるかもしれませんが、変数「sglmemgp」がグループの単一のメンバー (たとえば +を意味すること、または以前に使用した正確なアルゴリズムを意味することをまだ覚えている可能性は +ありません。変数 +変換するか、派生変数を作成すると、数か月後、1 年後、またはそれ以上かかります。 + +また、他の人があなたのデータを調べたり、使用したりする理由はたくさんあります。あなたの発見を理解するため、 +を検証するため、 +提出された出版物をレビューするため、結果を再現するため、 +同様の研究を計画するため、さらには他の人がアクセスしたり +利用できるようにデータをアーカイブします。 デジタルデータは定義上、 +可読ではありませんが、その意味を理解することは +の仕事です。 研究の収集段階および分析段階でデータを文書化することの重要性は、特に研究が +記録の一部となる場合には +過大評価することはできません +。 + +ただし、データファイル +自体にはメタデータを含めないでください。 論文や補足ファイルの表とは異なり、メタデータ ( +形式) はデータ ファイルに含めるべきではありません。この情報はデータではなく、 +データを含めるとコンピューター プログラムがデータ ファイルを解釈する +が混乱する可能性があるためです。 むしろ、メタデータは、データ ファイルと同じディレクトリに別のファイルとして保存する必要があります。できればファイルと明確に関連付けられる名前を付けてプレーン テキスト形式で保存する必要があります。 メタデータ ファイルはフリー テキスト形式であるため、コメント、単位、 +値のエンコード方法に関する情報などをエンコードすることも +ます。これらの情報は文書化するには重要ですが、データ ファイルの +設定を混乱させる可能性があります。 + +さらに、ファイルまたはデータベース レベルのメタデータは、データセットを構成するファイルが相互に +ように関連するかを記述します。どのような形式であるか。 +は、以前のファイルに優先されるか、または以前のファイルによって置き換えられるか。 +フォルダー レベルの readme.txt ファイルは、プロジェクト内のすべて +ファイルとフォルダーを説明する古典的な方法です。 + +(メタデータに関するテキストは、EDINA および +大学データ ライブラリによるオンライン コース Research Data +[MANTRA](https://datalib.edina.ac.uk/mantra) から改変されました。 MANTRA は クリエイティブ コモンズ +表示 4.0 国際 +ライセンス に基づいてライセンスされています。 + +## データのエクスポート + +**質問** + +- ストリーム アプリケーションに役立つ方法でスプレッドシートからデータをエクスポートするにはどうすればよいでしょうか? + +**目的** + +- スプレッドシート データをユニバーサル ファイル形式で保存します。 +- スプレッドシートから CSV ファイルにデータをエクスポートします。 + +**キーポイント** + +- 一般的なスプレッドシート形式で保存されたデータは、データ分析ソフトウェアに + 読み込まれないことが多く、 + にエラーが生じます。 + +- スプレッドシートから CSV や TSV などの形式にデータをエクスポートすると、ほとんどのプログラムで一貫して使用できる形式でデータが + になります。 + +分析に使用するデータを Excel +既定のファイル形式 (Excel +バージョンに応じて `*.xls` または `*.xlsx`) で保存することはお勧めできません。 なぜ? + +- これは独自の形式であり、 + 的にはファイルを開くことが不可能では + にしても不便になる技術が存在しなくなる (または十分にまれになる) 可能性があるためです。 + +- 他の表計算ソフトウェアでは、 + の Excel 形式で保存されたファイルを開くことができない場合があります。 + +- Excel のバージョンが異なるとデータの処理方法が異なる場合があり、 + 整合が発生する可能性があります。 [日付](https://datacarpentry.org/spreadsheet-ecology-lesson/03-dates-as-data/index.html) + は、データ ストレージにおける不整合の十分に文書化された例です。 + +- 最後に、データをデータ リポジトリに + することを要求するジャーナルや補助金機関が増えています。また、そのほとんどは Excel 形式を受け入れ + ん。 + で説明する形式のいずれかである必要があります。 + +- 上記の点は、LibreOffice / Open Office で使用されるオープン データ + 形式などの他の形式にも当てはまります。 これらの形式は + ではなく、 + ソフトウェア パッケージによって同じ方法で解析されません。 + +データを汎用的でオープンな静的形式で保存すると、この問題に +するのに役立ちます。 タブ区切り (タブ区切り値または TSV) または +カンマ区切り (カンマ区切り値または CSV) を試してください。 CSV ファイルは、列がカンマで区切られたプレーン +テキスト ファイルです。したがって、「カンマ +で区切られた値」または CSV と呼ばれます。 Excel/SPSS/などと +した CSV ファイルの利点ファイルは、TextEdit や +などのプレーン テキスト エディタを含む、ほぼすべてのソフトウェア +を使用して CSV ファイルを開いて読み取ることができるということです。 CSV ファイル内のデータは、SQLite や R などの他 +形式や環境にも簡単にインポートできます。CSV +ファイルを使用する場合、特定の高価なプログラムの +のバージョンに縛られることがないので、最大限の移植性と +性を実現するために使用するフォーマット。 ほとんどのスプレッドシート プログラムは +などの区切りテキスト形式で簡単に保存できますが、ファイルのエクスポート +に警告が表示される場合があります。 + +Excel で開いたファイルを CSV 形式で保存するには: + +1. 上部のメニューから「ファイル」と「名前を付けて保存」を選択します。 +2. [形式] フィールドのリストから、[カンマ区切りの + 値] (`*.csv`) を選択します。 +3. ファイル名と + する場所を再確認し、「保存」をクリックします。 + +下位互換性に関する重要な注意: +ファイルは Excel で開くことができます。 ```{r, results="markup", fig.cap="Saving an Excel file to CSV.", echo=FALSE, purl=FALSE, out.width="70%", fig.align="center"} -knitr::include_graphics("fig/excel-to-csv.png") +Knitr::include_graphics("fig/excel-to-csv.png") ``` -**A note on R and `xls`**: There are R packages that can read `xls` -files (as well as Google spreadsheets). It is even possible to access -different worksheets in the `xls` documents. +**R と `xls` に関するメモ**: `xls` +ファイル (および Google スプレッドシート) を読み取ることができる R パッケージがあります。 「xls」ドキュメント内の +ワークシートにアクセスすることも可能です。 -**But** +**しかし** -- some of these only work on Windows. -- this equates to replacing a (simple but manual) export to `csv` with - additional complexity/dependencies in the data analysis R code. -- data formatting best practice still apply. -- Is there really a good reason why `csv` (or similar) is not - adequate? +- これらの中には Windows でのみ動作するものもあります。 +- これは、データ分析 R コードの追加の複雑さ/依存性を + して、(単純だが手動の) `csv` へのエクスポートを置き換えることに相当します。 +- データ形式のベスト プラクティスは引き続き適用されます。 +- `csv` (または類似のもの) が + では不十分である正当な理由は本当にあるのでしょうか? -### Caveats on commas +### カンマに関する注意事項 -In some datasets, the data values themselves may include commas -(,). In that case, the software which you use (including Excel) will -most likely incorrectly display the data in columns. This is because -the commas which are a part of the data values will be interpreted as -delimiters. +一部のデータセットでは、データ値自体にカンマ +(,) が含まれる場合があります。 その場合、使用しているソフトウェア (Excel を含む) により、列内のデータが誤って表示される可能性が +ます。 これは、データ値の一部であるカンマの +が +区切り文字として解釈されるためです。 -For example, our data might look like this: +たとえば、データは次のようになります。 ``` -species_id,genus,species,taxa -AB,Amphispiza,bilineata,Bird -AH,Ammospermophilus,harrisi,Rodent, not censused -AS,Ammodramus,savannarum,Bird -BA,Baiomys,taylori,Rodent +種 ID、属、種、分類群 +AB、Amphispiza、bilineata、鳥類 +AH、Ammospermophilus、harrisi、げっ歯類、国勢調査されていない +AS、Ammodramus、savannarum、鳥類 +BA、Baiomys、taylori、げっ歯類 ``` -In the record `AH,Ammospermophilus,harrisi,Rodent, not censused` the -value for `taxa` includes a comma (`Rodent, not censused`). If we try -to read the above into Excel (or other spreadsheet program), we will -get something like this: +レコード「AH,Ammospermophilus,harrisi,Rodent, not censused」では、「taxa」の値 +コンマが含まれています (「Rodent, not censused」)。 上記を Excel (または他のスプレッドシート プログラム) に読み込むために +を試みると、 +のような結果が得られます。 ```{r, results="markup", fig.cap="The risks of having commas inside comma-separated data.", echo=FALSE, purl=FALSE, out.width="80%", fig.align="center"} -knitr::include_graphics("fig/csv-mistake.png") +Knitr::include_graphics("fig/csv-mistake.png") ``` -The value for `taxa` was split into two columns (instead of being put -in one column `D`). This can propagate to a number of further -errors. For example, the extra column will be interpreted as a column -with many missing values (and without a proper header). In addition to -that, the value in column `D` for the record in row 3 (so the one -where the value for 'taxa' contained the comma) is now incorrect. +`taxa` の値は (1 つの列 `D` に +を入れる代わりに) 2 つの列に分割されました。 これはさらに多数の +エラーに伝播する可能性があります。 たとえば、追加の列は、欠損値が多数ある (適切なヘッダーがない) 列 +として解釈されます。 +に加えて、行 3 のレコードの列 `D` の値 (つまり、'taxa' の値にカンマが含まれている +の値) も正しくなくなりました。 -If you want to store your data in `csv` format and expect that your -data values may contain commas, you can avoid the problem discussed -above by putting the values in quotes (""). Applying this rule, our -data might look like this: +データを「csv」形式で保存し、 +データ値にカンマが含まれる可能性があることが予想される場合は、値を引用符 ("") で囲むことで、上記 +で説明した問題を回避できます。 このルールを適用すると、 +データは次のようになります。 ``` -species_id,genus,species,taxa -"AB","Amphispiza","bilineata","Bird" -"AH","Ammospermophilus","harrisi","Rodent, not censused" -"AS","Ammodramus","savannarum","Bird" -"BA","Baiomys","taylori","Rodent" +種 ID、属、種、分類群 +"AB"、"Amphispiza"、"bilineata"、"Bird" +"AH"、"Ammospermophilus"、"harrisi"、"げっ歯類、国勢調査されていない" +"AS"、"Ammodramus" 、"サバンナルム"、"鳥" +"BA"、"バイオミス"、"テイロリ"、"げっ歯類" ``` -Now opening this file as a `csv` in Excel will not lead to an extra -column, because Excel will only use commas that fall outside of -quotation marks as delimiting characters. - -Alternatively, if you are working with data that contains commas, you -likely will need to use another delimiter when working in a -spreadsheet[^decsep]. In this case, consider using tabs as your delimiter and -working with TSV files. TSV files can be exported from spreadsheet -programs in the same way as CSV files. - -[^decsep]: This is particularly relevant in European - countries where the comma is used as a decimal - separator. In such cases, the default value separator in a - csv file will be the semi-colon (;), or values will be - systematically quoted. - -If you are working with an already existing dataset in which the data -values are not included in "" but which have commas as both delimiters -and parts of data values, you are potentially facing a major problem -with data cleaning. If the dataset you're dealing with contains -hundreds or thousands of records, cleaning them up manually (by either -removing commas from the data values or putting the values into -quotes - "") is not only going to take hours and hours but may -potentially end up with you accidentally introducing many errors. - -Cleaning up datasets is one of the major problems in many scientific -disciplines. The approach almost always depends on the particular -context. However, it is a good practice to clean the data in an -automated fashion, for example by writing and running a script. The -Python and R lessons will give you the basis for developing skills to -build relevant scripts. - -## Summary +Excel では +引用符の外側にあるカンマのみが区切り文字として使用されるため、このファイルを Excel で「csv」として開いても、余分な +列は生成されません。 + +あるいは、カンマを含むデータを操作している場合、 +シートで作業するときに別の区切り文字を使用する必要がある可能 +があります[^decsep]。 この場合、区切り文字としてタブを使用し、TSV ファイルを扱う場合は +使用することを検討してください。 TSV ファイルは、CSV ファイルと同じ方法でスプレッドシート +プログラムからエクスポートできます。 + +[^decsep]: これは、カンマが小数点の + として使用されるヨーロッパの + 諸国に特に関係します。 このような場合、 + csv ファイルのデフォルト値の区切り文字はセミコロン (;) になるか、値は体系的に引用符で囲まれた + になります。 + +データ +の値が "" に含まれていないものの、区切り文字 +とデータ値の一部としてカンマが含まれている既存のデータセットを操作している場合は、データ クリーニングに関する重大な問題 +に直面する可能性があります。 扱っているデータセットに +百または数千のレコードが含まれている場合、それらを手動でクリーンアップする ( +データ値からカンマを削除するか、値を +引用符 ("") で囲む) と、何時間もかかるだけではありません。ただし、誤って +のエラーが発生する可能性があります。 + +データセットのクリーンアップは、多くの +分野における主要な問題の 1 つです。 このアプローチは、ほとんどの場合、特定 +コンテキストに依存します。 ただし、スクリプトを作成して実行するなど、 +化された方法でデータをクリーンアップすることをお勧めします。 +Python と R のレッスンは、 +するスクリプトを構築するためのスキルを開発するための基礎を提供します。 + +## まとめ ```{r analysis, results="asis", fig.margin=TRUE, fig.cap="A typical data analysis workflow.", fig.width=7, fig.height=4, echo=FALSE, purl=FALSE} knitr::include_graphics("fig/analysis.png") ``` -A typical data analysis workflow is illustrated in the figure above, -where data is repeatedly transformed, visualised, and modelled. This -iteration is repeated multiple times until the data is understood. In -many real-life cases, however, most time is spent cleaning up and -preparing the data, rather than actually analysing and understanding -it. +典型的なデータ分析ワークフローは、上の図 +に示されており、データは繰り返し変換、視覚化、モデル化されます。 この +の繰り返しは、データが理解されるまで複数回繰り返されます。 ただし、 +の実際のケースでは、実際にデータを分析して理解すること +ではなく、データのクリーンアップと準備 +にほとんどの時間が費やされます。 -An agile data analysis workflow, with several fast iterations of the -transform/visualise/model cycle is only feasible if the data is -formatted in a predictable way and one can reason about the data -without having to look at it and/or fix it. +変換/視覚化/モデルのサイクルを高速で +回繰り返すアジャイルなデータ分析ワークフローは、データが予測可能な方法で +されており、データを調べたり +したりすることなく推論できる場合にのみ実現可能です。それ。 :::::::::::::::::::::::::::::::::::::::: keypoints -- Good data organization is the foundation of any research project. +- 適切なデータ構成は、あらゆる研究プロジェクトの基礎です。 -:::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::: From bef4a5bb7be0f61591c48c0c981fd777bdc7e1f2 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Thu, 11 Jan 2024 14:26:19 +0900 Subject: [PATCH 100/334] New translations 20-r-rstudio.md (Japanese) --- locale/ja/episodes/20-r-rstudio.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/locale/ja/episodes/20-r-rstudio.Rmd b/locale/ja/episodes/20-r-rstudio.Rmd index 6806f894e..1ab9ae4d1 100644 --- a/locale/ja/episodes/20-r-rstudio.Rmd +++ b/locale/ja/episodes/20-r-rstudio.Rmd @@ -23,7 +23,7 @@ exercises: 0 :::::::::::::::::::::::::::::::::::::::::::::::::: -## What is R? What is RStudio? +## What is R? RStudioとは何ですか? The term [R](https://www.r-project.org/) is used to refer to the _programming language_, the _environment for statistical computing_ From 9d0b5d7bb61f568b45a5218eadaee48a7893988b Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Thu, 11 Jan 2024 14:26:23 +0900 Subject: [PATCH 101/334] New translations 23-starting-with-r.md (Japanese) --- locale/ja/episodes/23-starting-with-r.Rmd | 1008 ++++++++++----------- 1 file changed, 503 insertions(+), 505 deletions(-) diff --git a/locale/ja/episodes/23-starting-with-r.Rmd b/locale/ja/episodes/23-starting-with-r.Rmd index 981b7aa59..79eabfdd7 100644 --- a/locale/ja/episodes/23-starting-with-r.Rmd +++ b/locale/ja/episodes/23-starting-with-r.Rmd @@ -1,6 +1,6 @@ --- -source: Rmd -title: Introduction to R +source: RMD +title: R の紹介 teaching: 60 exercises: 60 --- @@ -10,411 +10,409 @@ exercises: 60 ::::::::::::::::::::::::::::::::::::::: 目的 -- Define the following terms as they relate to R: object, assign, call, function, arguments, options. -- Assign values to objects in R. -- Learn how to _name_ objects -- Use comments to inform script. -- Solve simple arithmetic operations in R. -- Call functions and use arguments to change their default options. -- Inspect the content of vectors and manipulate their content. -- Subset and extract values from vectors. -- Analyze vectors with missing data. +- R に関連する次の用語を定義します: オブジェクト、代入、呼び出し、関数、引数、オプション。 +- R のオブジェクトに値を割り当てます。 +- オブジェクトに _名前を付ける_方法を学ぶ +- コメントを使用してスクリプトに情報を与えます。 +- R で単純な算術演算を解きます。 +- 関数を呼び出し、引数を使用してデフォルトのオプションを変更します。 +- ベクトルの内容を検査し、その内容を操作します。 +- ベクトルから値をサブセット化して抽出します。 +- データが欠落しているベクトルを解析します。 -:::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::: questions +::::::::::::::::::::::::::::::::::::::: 質問 -- First commands in R +- R の最初のコマンド -:::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::: -## Creating objects in R +## R でオブジェクトを作成する -You can get output from R simply by typing math in the console: +コンソールに math と入力するだけで、R から出力を取得できます。 ```{r, purl=TRUE} 3 + 5 12 / 7 ``` -However, to do useful and interesting things, we need to assign _values_ to -_objects_. To create an object, we need to give it a name followed by the -assignment operator `<-`, and the value we want to give it: +ただし、便利で興味深いことを行うには、_値_を +_オブジェクト_ に割り当てる必要があります。 オブジェクトを作成するには、オブジェクトに名前を付け、その後に +代入演算子 `<-` と、それに付けたい値を付ける必要があります。 ```{r, purl=TRUE} -weight_kg <- 55 +体重_kg <- 55 ``` -`<-` is the assignment operator. It assigns values on the right to -objects on the left. So, after executing `x <- 3`, the value of `x` is -`3`. The arrow can be read as 3 **goes into** `x`. For historical -reasons, you can also use `=` for assignments, but not in every -context. Because of the -[slight differences](https://blog.revolutionanalytics.com/2008/12/use-equals-or-arrow-for-assignment.html) -in syntax, it is good practice to always use `<-` for assignments. +`<-` は代入演算子です。 右側の値を左側の +個のオブジェクトに割り当てます。 したがって、「x <- 3」を実行すると、「x」の値は +`3` になります。 矢印は 3 **が `x` に入る** と読むことができます。 歴史的 +理由により、代入に `=` を使用することもできますが、 +のコンテキストで使用できるわけではありません。 構文に +わずかな違い](https\://blog.revolutionanalytics.com/2008/12/use-equals-or-arrow-for-assignment.html) +があるため、常に `< を使用することをお勧めします。 -` 割り当て用。 -In RStudio, typing <kbd>Alt</kbd> + <kbd>-</kbd> (push <kbd>Alt</kbd> -at the same time as the <kbd>-</kbd> key) will write `<-` in a single -keystroke in a PC, while typing <kbd>Option</kbd> + <kbd>-</kbd> (push <kbd>Option</kbd> at the same time as the <kbd>-</kbd> key) does the -same in a Mac. +RStudio では、 <kbd>オプション</kbd> を入力しながら、 <kbd>Alt</kbd> + <kbd>-</kbd> を入力すると ( <kbd>-</kbd> キーと同時に <kbd>Alt</kbd> +を押すと)、PC で +1 回のキーストロークで `<-` が書き込まれます。 + <kbd>-</kbd> ( <kbd>オプション</kbd> <kbd>-</kbd> キーと同時に押す) は、Mac でも +と同じことを行います。 -### Naming variables +### 変数に名前を付ける -Objects can be given any name such as `x`, `current_temperature`, or -`subject_id`. You want your object names to be explicit and not too -long. They cannot start with a number (`2x` is not valid, but `x2` -is). R is case sensitive (e.g., `weight_kg` is different from -`Weight_kg`). There are some names that cannot be used because they -are the names of fundamental functions in R (e.g., `if`, `else`, -`for`, see -[here](https://stat.ethz.ch/R-manual/R-devel/library/base/html/Reserved.html) -for a complete list). In general, even if it's allowed, it's best to -not use other function names (e.g., `c`, `T`, `mean`, `data`, `df`, -`weights`). If in doubt, check the help to see if the name is already -in use. It's also best to avoid dots (`.`) within an object name as in -`my.dataset`. There are many functions in R with dots in their names -for historical reasons, but because dots have a special meaning in R -(for methods) and other programming languages, it's best to avoid -them. It is also recommended to use nouns for object names, and verbs -for function names. It's important to be consistent in the styling of -your code (where you put spaces, how you name objects, etc.). Using a -consistent coding style makes your code clearer to read for your -future self and your collaborators. In R, some popular style guides -are [Google's](https://google.github.io/styleguide/Rguide.xml), the -[tidyverse's](https://style.tidyverse.org/) style and the Bioconductor -style -guide. The -tidyverse's is very comprehensive and may seem overwhelming at -first. You can install the -[**`lintr`**](https://github.com/jimhester/lintr) package to -automatically check for issues in the styling of your code. +オブジェクトには、「x」、「current_temperature」、または「subject_id」などの任意 +名前を付けることができます。 オブジェクト名は明示的で、長 +ないようにしたいと考えています。 数字で始めることはできません (「2x」は無効ですが、「x2」 +は有効です)。 R では大文字と小文字が区別されます (たとえば、`weight_kg` は +の `Weight_kg` とは異なります)。 R の基本的な関数の名前であるため、使用でき +名前が +かあります (例: `if`、`else`、 +`for`。2 [こちら](https://stat.ethz.ch)を参照)完全なリストについては、/R-manual/R-devel/library/base/html/Reserved.html) +)。 一般に、たとえ許可されていても、他の関数名 (例: `c`、`T`、`mean`、`data`、`df`、 +`weights`) は使用しないことが +です。 疑問がある場合は、ヘルプを参照して、その名前がすでに +で使用されているかどうかを確認してください。 また +`my.dataset` のように、オブジェクト名内にドット (`.`) を使用しないことも最善です。 R には歴史的な理由から名前にドットが含まれる関数が多数あります +が、R +(メソッド) や他のプログラミング言語ではドットが特別な意味を持っているため、ドットは避けるのが最善です +。 オブジェクト名には名詞を使用し、関数名には動詞 +を使用することもお勧めします。 +のスタイル (スペースを入れる場所、オブジェクトの名前など) に一貫性を持たせることが重要です。 +コーディング スタイルを使用すると、 +の自分や共同作業者にとって、コードがより明確に読みやすくなります。 R では、人気のあるスタイル ガイド +には、[Google の](https://google.github.io/styleguide/Rguide.xml)、 +[tidyverse の](https://style.tidyverse.org/) スタイル、およびBioconductor +スタイル +ガイド。 +Tidyverse は非常に包括的であり、最初は +では圧倒されるように思えるかもしれません。 +[**`lintr`**](https://github.com/jimhester/lintr) パッケージを +にインストールすると、コードのスタイルの問題が自動的にチェックされます。 -> **Objects vs. variables**: What are known as `objects` in `R` are -> known as `variables` in many other programming languages. Depending -> on the context, `object` and `variable` can have drastically -> different meanings. However, in this lesson, the two words are used -> synonymously. For more information -> [see here.](https://cran.r-project.org/doc/manuals/r-release/R-lang.html#Objects) +> **オブジェクトと変数**: 「R」で「オブジェクト」として知られているものは、他の多くのプログラミング言語では「変数」として知られて +> ます。 +> に応じて、「オブジェクト」と「変数」は +> に異なる意味を持つ可能性があります。 ただし、このレッスンでは、2 つの単語は +> 的に使用されます。 詳細については、 +> [ここを参照してください。](https://cran.r-project.org/doc/manuals/r-release/R-lang.html#Objects) -When assigning a value to an object, R does not print anything. You -can force R to print the value by using parentheses or by typing the -object name: +オブジェクトに値を割り当てるとき、R は何も出力しません。 かっこを使用するか +名を入力することで、 +に値を強制的に出力させることができます。 ```{r, purl=TRUE} -weight_kg <- 55 # doesn't print anything -(weight_kg <- 55) # but putting parenthesis around the call prints the value of `weight_kg` -weight_kg # and so does typing the name of the object +Weight_kg <- 55 # 何も出力しません +(weight_kg <- 55) # しかし、呼び出しを括弧で囲むと `weight_kg` の値 +が出力され、オブジェクトの名前を入力しても同様に出力されます ``` -Now that R has `weight_kg` in memory, we can do arithmetic with it. For -instance, we may want to convert this weight into pounds (weight in pounds is 2.2 times the weight in kg): +R のメモリに「weight_kg」があるので、それを使って算術演算を行うことができます。 +、この重量をポンドに変換したい場合があります (ポンドでの重量は kg での重量の 2.2 倍です)。 ```{r, purl=TRUE} -2.2 * weight_kg +2.2 * 体重_kg ``` -We can also change an object's value by assigning it a new one: +オブジェクトに新しい値を割り当てることで、オブジェクトの値を変更することもできます。 ```{r, purl=TRUE} -weight_kg <- 57.5 -2.2 * weight_kg +体重kg <- 57.5 +2.2 * 体重kg ``` -This means that assigning a value to one object does not change the values of -other objects For example, let's store the animal's weight in pounds in a new -object, `weight_lb`: +これは、 +つのオブジェクトに値を割り当てても、他のオブジェクトの値は変更されないことを意味します。たとえば、動物の体重をポンド単位で新しい +オブジェクト `weight_lb` に保存してみましょう。 ```{r, purl=TRUE} -weight_lb <- 2.2 * weight_kg +体重ポンド <- 2.2 * 体重キログラム ``` -and then change `weight_kg` to 100. +次に「weight_kg」を 100 に変更します。 ```{r} -weight_kg <- 100 +体重_kg <- 100 ``` -::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::: チャレンジ -## Challenge: +## チャレンジ: -What do you think is the current content of the object `weight_lb`? -126.5 or 220? +オブジェクト「weight_lb」の現在の内容は何だと思いますか? +126.5 それとも 220? -:::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::: -## Comments +## コメント -The comment character in R is `#`, anything to the right of a `#` in a -script will be ignored by R. It is useful to leave notes, and -explanations in your scripts. +のコメント文字は `#` です。0 スクリプトの `#` の右側にあるものはすべて R によって無視されます。スクリプトにメモ +説明を残すと便利です。 -RStudio makes it easy to comment or uncomment a paragraph: after -selecting the lines you want to comment, press at the same time on -your keyboard <kbd>Ctrl</kbd> + <kbd>Shift</kbd> + <kbd>C</kbd>. If -you only want to comment out one line, you can put the cursor at any -location of that line (i.e. no need to select the whole line), then -press <kbd>Ctrl</kbd> + <kbd>Shift</kbd> + <kbd>C</kbd>. +RStudio では、段落のコメントまたはコメント解除が簡単に行えます。 +コメントしたい行を選択した後、 +キーボード <kbd>Ctrl</kbd> + <kbd>Shift</kbd> + <kbd>C</kbd>を同時に押します。 +の場合、1 行だけをコメントアウトしたい場合は、その行の任意 +位置にカーソルを置きます (つまり、行全体を選択する必要はありません)。その後 +Ctrl</kbd> + <kbd>Shift</kbd> + <kbd>C</kbd><kbd>押します。 -::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::: チャレンジ -## Challenge +## チャレンジ -What are the values after each statement in the following? +次の各ステートメントの後の値は何ですか? ```{r, purl=TRUE} -mass <- 47.5 # mass? -age <- 122 # age? -mass <- mass * 2.0 # mass? -age <- age - 20 # age? -mass_index <- mass/age # mass_index? +質量 <- 47.5 # 質量? +年齢 <- 122 # 年齢? +質量 ← 質量 * 2.0 # 質量? +年齢 <- 年齢 - 20 # 年齢? +質量指数 <- 質量/年齢 # 質量指数? ``` -:::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::: -## Functions and their arguments +## 関数とその引数 -Functions are "canned scripts" that automate more complicated sets of commands -including operations assignments, etc. Many functions are predefined, or can be -made available by importing R _packages_ (more on that later). A function -usually gets one or more inputs called _arguments_. Functions often (but not -always) return a _value_. A typical example would be the function `sqrt()`. The -input (the argument) must be a number, and the return value (in fact, the -output) is the square root of that number. Executing a function ('running it') -is called _calling_ the function. An example of a function call is: +関数は、操作の割り当て +を含む、より複雑なコマンド セットを自動化する「定型スクリプト」です。 多くの関数は事前定義されているか、R _パッケージ_ をインポートすることで +可能になります (詳細は後ほど)。 関数 +は通常、_arguments_ と呼ばれる 1 つ以上の入力を取得します。 関数は多くの場合 (常に +ではありませんが) _値_ を返します。 典型的な例は関数 `sqrt()` です。 +入力 (引数) は数値でなければならず、戻り値 (実際には +出力) はその数値の平方根です。 関数の実行 (「実行中」) +は関数の _呼び出し_ と呼ばれます。 関数呼び出しの例は次のとおりです。 ```{r, eval=FALSE, purl=FALSE} b <- sqrt(a) ``` -Here, the value of `a` is given to the `sqrt()` function, the `sqrt()` function -calculates the square root, and returns the value which is then assigned to -the object `b`. This function is very simple, because it takes just one argument. +ここでは、`a` の値が `sqrt()` 関数に与えられ、`sqrt()` 関数は +平方根を計算し、その値をオブジェクト \` +に代入して返します。 この関数は引数を 1 つだけ取るため、非常に単純です。 -The return 'value' of a function need not be numerical (like that of `sqrt()`), -and it also does not need to be a single item: it can be a set of things, or -even a dataset. We'll see that when we read data files into R. +関数の戻り値「値」は数値 (`sqrt()` のような) である必要はなく、 +である必要もありません。また、単一の項目である必要もありません。一連のものや +、さらにはデータセットでも構いません。 データ ファイルを R に読み込むと、それがわかります。 -Arguments can be anything, not only numbers or filenames, but also other -objects. Exactly what each argument means differs per function, and must be -looked up in the documentation (see below). Some functions take arguments which -may either be specified by the user, or, if left out, take on a _default_ value: -these are called _options_. Options are typically used to alter the way the -function operates, such as whether it ignores 'bad values', or what symbol to -use in a plot. However, if you want something specific, you can specify a value -of your choice which will be used instead of the default. +引数には、数値やファイル名だけでなく、他の +も含めることができます。 各引数の正確な意味は関数ごとに異なるため、ドキュメントで調べて +にする必要があります (下記を参照)。 一部の関数は引数を取ります +はユーザーによって指定されるか、指定されなかった場合は _デフォルト_ 値を取ります: +これらは _オプション_ と呼ばれます。 オプションは通常、「不正な値」を無視するかどうか、プロットでどのような記号を使用 +かなど、 +関数の動作方法を変更するために使用されます。 ただし、特定の値が必要な場合は、デフォルトの代わりに使用される値 +を選択して指定できます。 -Let's try a function that can take multiple arguments: `round()`. +複数の引数を取ることができる関数 `round()` を試してみましょう。 ```{r, results="show", purl=TRUE} -round(3.14159) +ラウンド(3.14159) ``` -Here, we've called `round()` with just one argument, `3.14159`, and it has -returned the value `3`. That's because the default is to round to the nearest -whole number. If we want more digits we can see how to do that by getting -information about the `round` function. We can use `args(round)` or look at the -help for this function using `?round`. +ここでは、1 つの引数 `3.14159` を指定して `round()` を呼び出しましたが、 +が値 `3` を返しました。 これは、デフォルトでは最も近い +整数に丸められるためです。 さらに多くの桁が必要な場合は、「round」関数に関する +情報を取得することでその方法がわかります。 `args(round)` を使用するか、`?round` を使用して +関数のヘルプを参照することができます。 ```{r, results="show", purl=TRUE} -args(round) +引数(丸め) ``` ```{r, eval=FALSE, purl=TRUE} -?round +?ラウンド ``` -We see that if we want a different number of digits, we can -type `digits=2` or however many we want. +別の桁数が必要な場合は、`digits=2` または必要な桁数を入力 +ことがわかります。 ```{r, results="show", purl=TRUE} -round(3.14159, digits = 2) +ラウンド(3.14159、桁数 = 2) ``` -If you provide the arguments in the exact same order as they are defined you -don't have to name them: +定義されているのとまったく同じ順序で引数を指定する場合は +に名前を付ける必要はありません。 ```{r, results="show", purl=TRUE} -round(3.14159, 2) +ラウンド(3.14159, 2) ``` -And if you do name the arguments, you can switch their order: +引数に名前を付けた場合は、その順序を入れ替えることができます。 ```{r, results="show", purl=TRUE} -round(digits = 2, x = 3.14159) +Round(桁数 = 2、x = 3.14159) ``` -It's good practice to put the non-optional arguments (like the number you're -rounding) first in your function call, and to specify the names of all optional -arguments. If you don't, someone reading your code might have to look up the -definition of a function with unfamiliar arguments to understand what you're -doing. By specifying the name of the arguments you are also safeguarding -against possible future changes in the function interface, which may -potentially add new arguments in between the existing ones. +関数呼び出しの最初にオプションではない引数 ( +四捨五入する数値など) を置き、すべてのオプションの +引数の名前を指定することをお勧めします。 そうしないと、コードを読む人が、 +をしているのかを理解するために、なじみのない引数を持つ関数の定義を調べなければなら +可能性があります。 引数の名前を指定することで、関数インターフェースの将来の変更 (既存の引数の間に +引数が追加される可能性) から +することもできます。 -## Vectors and data types +## ベクトルとデータ型 -A vector is the most common and basic data type in R, and is pretty much -the workhorse of R. A vector is composed by a series of values, such as -numbers or characters. We can assign a series of values to a vector using -the `c()` function. For example we can create a vector of animal weights and assign -it to a new object `weight_g`: +ベクトルは R で最も一般的かつ基本的なデータ型であり、ほぼ R の主力である +です。ベクトルは、 +数字や文字などの一連の値で構成されます。 +の `c()` 関数を使用して、一連の値をベクトルに割り当てることができます。 たとえば、動物の体重のベクトルを作成し、それを新しいオブジェクト `weight_g` に +に割り当てることができます。 ```{r, purl=TRUE} -weight_g <- c(50, 60, 65, 82) -weight_g +体重g <- c(50, 60, 65, 82) +体重g ``` -A vector can also contain characters: +ベクトルには文字も含めることができます。 ```{r, purl=TRUE} -molecules <- c("dna", "rna", "protein") -molecules +分子 <- c("dna", "rna", "タンパク質") +分子 ``` -The quotes around "dna", "rna", etc. are essential here. Without the -quotes R will assume there are objects called `dna`, `rna` and -`protein`. As these objects don't exist in R's memory, there will be -an error message. +ここでは「dna」や「rna」などの引用符が重要です。 引用符 +がないと、R は `dna`、`rna`、および +`protein` と呼ばれるオブジェクトがあると想定します。 これらのオブジェクトは R のメモリに存在しないため、エラー メッセージが +されます。 -There are many functions that allow you to inspect the content of a -vector. `length()` tells you how many elements are in a particular vector: +ベクトルの内容を検査できる関数が多数あります。 `length()` は、特定のベクトルに含まれる要素の数を示します。 ```{r, purl=TRUE} -length(weight_g) -length(molecules) +長さ(重量_g) +長さ(分子) ``` -An important feature of a vector, is that all of the elements are the -same type of data. The function `class()` indicates the class (the -type of element) of an object: +ベクトルの重要な特徴は、すべての要素が +タイプのデータであることです。 関数 `class()` は、オブジェクトのクラス ( +型の要素) を示します。 ```{r, purl=TRUE} -class(weight_g) -class(molecules) +クラス(体重g) +クラス(分子) ``` -The function `str()` provides an overview of the structure of an -object and its elements. It is a useful function when working with -large and complex objects: +関数 `str()` は、 +オブジェクトとその要素の構造の概要を提供します。 これは、 +て複雑なオブジェクトを扱う場合に便利な関数です。 ```{r, purl=TRUE} -str(weight_g) -str(molecules) +str(体重g) +str(分子) ``` -You can use the `c()` function to add other elements to your vector: +`c()` 関数を使用して、ベクトルに他の要素を追加できます。 ```{r} -weight_g <- c(weight_g, 90) # add to the end of the vector -weight_g <- c(30, weight_g) # add to the beginning of the vector -weight_g +Weight_g <- c(weight_g, 90) # ベクトルの最後に追加 +Weight_g <- c(30,weight_g) # ベクトルの先頭に追加 +Weight_g ``` -In the first line, we take the original vector `weight_g`, add the -value `90` to the end of it, and save the result back into -`weight_g`. Then we add the value `30` to the beginning, again saving -the result back into `weight_g`. +最初の行では、元のベクトル `weight_g` を取得し、その末尾に +値 `90` を追加し、結果を +`weight_g` に保存します。 次に、値 `30` を先頭に追加し、結果を再び +として `weight_g` に保存します。 -We can do this over and over again to grow a vector, or assemble a -dataset. As we program, this may be useful to add results that we are -collecting or calculating. +これを何度も繰り返してベクトルを成長させたり、 +データセットを組み立てたりすることができます。 これは、プログラムするときに、 +または計算している結果を追加するのに役立つ場合があります。 -An **atomic vector** is the simplest R **data type** and is a linear -vector of a single type. Above, we saw 2 of the 6 main **atomic -vector** types that R uses: `"character"` and `"numeric"` (or -`"double"`). These are the basic building blocks that all R objects -are built from. The other 4 **atomic vector** types are: +**アトミック ベクトル**は最も単純な R **データ型**であり、単一型の線形 +ベクトルです。 上では、R が使用する 6 つの主な **アトミック +ベクトル** タイプのうち 2 つ、つまり `"character"` と `"numeric"` (または +`"double"`) を見てきました。 これらは、すべての R オブジェクト +が構築される基本的な構成要素です。 他の 4 つの **原子ベクトル** タイプは次のとおりです。 -- `"logical"` for `TRUE` and `FALSE` (the boolean data type) -- `"integer"` for integer numbers (e.g., `2L`, the `L` indicates to R - that it's an integer) -- `"complex"` to represent complex numbers with real and imaginary - parts (e.g., `1 + 4i`) and that's all we're going to say about them -- `"raw"` for bitstreams that we won't discuss further +- `TRUE` および `FALSE` の場合は `"logical"` (ブール データ型) +- 整数の場合は `"integer"` (たとえば、`2L`、`L` は R + にそれが整数であることを示します) +- `"complex"` は、実数と虚数の + 部分を持つ複素数を表します (例: 1 + 4i)。これについて説明するのはこれですべてです。 +- ビットストリームの「raw」\` (これ以上は説明しません) -You can check the type of your vector using the `typeof()` function -and inputting your vector as the argument. +`typeof()` 関数 +を使用し、ベクトルを引数として入力することで、ベクトルの型をチェックできます。 -Vectors are one of the many **data structures** that R uses. Other -important ones are lists (`list`), matrices (`matrix`), data frames -(`data.frame`), factors (`factor`) and arrays (`array`). +ベクトルは、R が使用する多くの **データ構造** の 1 つです。 その他 +重要なものは、リスト (`list`)、行列 (`matrix`)、データ フレーム +(`data.frame`)、因子 (`factor`)、および配列 (`array`) です。 -::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::: チャレンジ -## Challenge: +## チャレンジ: -We've seen that atomic vectors can be of type character, numeric (or -double), integer, and logical. But what happens if we try to mix -these types in a single vector? +アトミック ベクトルの型は、文字、数値 (または +double)、整数、および論理型であることがわかりました。 しかし、これらのタイプを +つのベクトルに混在させようとするとどうなるでしょうか? -::::::::::::::: solution +::::::::::::::: 解決 -## Solution +## 解決 -R implicitly converts them to all be the same type +R はそれらをすべて同じ型に暗黙的に変換します。 -::::::::::::::::::::::::: +:::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::: -::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::: チャレンジ -## Challenge: +## チャレンジ: -What will happen in each of these examples? (hint: use `class()` to -check the data type of your objects and type in their names to see what happens): +これらのそれぞれの例では何が起こるでしょうか? (ヒント: `class()` を使用してオブジェクトのデータ型を確認 +、名前を入力して何が起こるかを確認します): ```{r, eval=TRUE} num_char <- c(1, 2, 3, "a") num_logical <- c(1, 2, 3, TRUE, FALSE) -char_logical <- c("a", "b", "c", TRUE) -tricky <- c(1, 2, 3, "4") +char_logical <- c("a", "b", "c", TRUE) ) +トリッキー <- c(1, 2, 3, "4") ``` -::::::::::::::: solution +::::::::::::::: 解決 -## Solution +## 解決 ```{r, purl=TRUE} class(num_char) num_char class(num_logical) num_logical -class(char_logical) +クラス(char_logical) char_logical -class(tricky) -tricky +クラス(トリッキー) +トリッキー ``` -::::::::::::::::::::::::: +:::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::: -::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::: チャレンジ -## Challenge: +## チャレンジ: -Why do you think it happens? +なぜそれが起こると思いますか? -::::::::::::::: solution +::::::::::::::: 解決 -## Solution +## 解決 -Vectors can be of only one data type. R tries to convert (coerce) -the content of this vector to find a _common denominator_ that -doesn't lose any information. +ベクトルのデータ型は 1 つだけです。 R は、 +が情報を失わないという _共通分母_ を見つけるために、このベクトルの内容を +に変換 (強制) しようとします。 -::::::::::::::::::::::::: +:::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::: -::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::: チャレンジ -## Challenge: +## チャレンジ: -How many values in `combined_logical` are `"TRUE"` (as a character) -in the following example: +次の例では、`combined_logical` 内の `"TRUE"` (文字として) +となる値はいくつありますか。 ```{r, eval=TRUE} num_logical <- c(1, 2, 3, TRUE) @@ -422,487 +420,487 @@ char_logical <- c("a", "b", "c", TRUE) combined_logical <- c(num_logical, char_logical) ``` -::::::::::::::: solution +::::::::::::::: 解決 -## Solution +## 解決 -Only one. There is no memory of past data types, and the coercion -happens the first time the vector is evaluated. Therefore, the `TRUE` -in `num_logical` gets converted into a `1` before it gets converted -into `"1"` in `combined_logical`. +唯一。 過去のデータ型の記憶はなく、ベクトルが初めて評価されるときに強制 +が発生します。 したがって、「num_logical」の「TRUE」 +「combined_logical」で +「1」に変換される前に、「1」に変換されます。 ```{r} -combined_logical +結合論理 ``` -::::::::::::::::::::::::: +:::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::: -::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::: チャレンジ -## Challenge: +## チャレンジ: -In R, we call converting objects from one class into another class -_coercion_. These conversions happen according to a hierarchy, -whereby some types get preferentially coerced into other types. Can -you draw a diagram that represents the hierarchy of how these data -types are coerced? +R では、オブジェクトをあるクラスから別のクラスに変換することを +強制\* と呼びます。 これらの変換は階層 +に従って行われ、一部の型が優先的に他の型に強制されます。 これらのデータ +がどのように強制されるかの階層を表す図を描いてもらえます +? -::::::::::::::: solution +::::::::::::::: 解決 -## Solution +## 解決 -logical → numeric → character ← logical +論理 → 数値 → 文字 ← 論理 -::::::::::::::::::::::::: +:::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::: ```{r, echo=FALSE, eval=FALSE, purl=TRUE} -## We've seen that atomic vectors can be of type character, numeric, integer, and -## logical. But what happens if we try to mix these types in a single -## vector? +## アトミック ベクトルのタイプは、文字、数値、整数、および +## 論理型であることがわかりました。しかし、これらのタイプを単一の +## ベクトルに混在させようとするとどうなるでしょうか? -## What will happen in each of these examples? (hint: use `class()` to -## check the data type of your object) +## これらのそれぞれの例では何が起こるでしょうか? (ヒント: `class()` を使用して +## オブジェクトのデータ型を確認します) num_char <- c(1, 2, 3, "a") -num_logical <- c(1, 2, 3, TRUE) +num_logical <- c(1, 2, 3, TRUE) ) char_logical <- c("a", "b", "c", TRUE) -tricky <- c(1, 2, 3, "4") +トリッキー <- c(1, 2, 3, "4") -## Why do you think it happens? +## なぜそれが起こると思いますか? -## You've probably noticed that objects of different types get -## converted into a single, shared type within a vector. In R, we call -## converting objects from one class into another class -## _coercion_. These conversions happen according to a hierarchy, -## whereby some types get preferentially coerced into other types. Can -## you draw a diagram that represents the hierarchy of how these data -## types are coerced? +## おそらく、異なる型のオブジェクトがベクトル内の +## 単一の共有型に変換されることに気づいたでしょう。 R では、 +## オブジェクトをあるクラスから別のクラスに変換することを +## _強制_と呼びます。これらの変換は階層に従って行われ、 +## これにより、一部の型が優先的に他の型に強制されます。 +## これらのデータ型がどのように強制されるかの階層を表す図を描くことができますか? +## ``` -## Subsetting vectors +## ベクトルのサブセット化 -If we want to extract one or several values from a vector, we must -provide one or several indices in square brackets. For instance: +ベクトルから 1 つまたは複数の値を抽出したい場合は、角括弧内に 1 つまたは複数のインデックスを指定する必要が +ます。 例えば: ```{r, results="show", purl=TRUE} -molecules <- c("dna", "rna", "peptide", "protein") -molecules[2] -molecules[c(3, 2)] +分子 <- c("dna", "rna", "ペプチド", "タンパク質") +分子[2] +分子[c(3, 2)] ``` -We can also repeat the indices to create an object with more elements -than the original one: +インデックスを繰り返して、元のオブジェクトよりも要素 +が多いオブジェクトを作成することもできます。 ```{r, results="show", purl=TRUE} -more_molecules <- molecules[c(1, 2, 3, 2, 1, 4)] +more_molecules <- 分子[c(1, 2, 3, 2, 1, 4)] more_molecules ``` -R indices start at 1. Programming languages like Fortran, MATLAB, -Julia, and R start counting at 1, because that's what human beings -typically do. Languages in the C family (including C++, Java, Perl, -and Python) count from 0 because that's simpler for computers to do. +R インデックスは 1 から始まります。 Fortran、MATLAB、 +Julia、R などのプログラミング言語は +から数え始めます。これは人間が通常行うことだからです。 C ファミリの言語 (C++、Java、Perl、 +、Python を含む) は 0 からカウントします。これは、コンピュータにとってその方が簡単なためです。 -Finally, it is also possible to get all the elements of a vector -except some specified elements using negative indices: +最後に、負のインデックスを使用して、指定された一部の要素を除くベクトル +のすべての要素を取得することもできます。 ```{r} -molecules ## all molecules -molecules[-1] ## all but the first one -molecules[-c(1, 3)] ## all but 1st/3rd ones -molecules[c(-1, -3)] ## all but 1st/3rd ones +分子 ## すべての分子 +分子[-1] ## 最初の分子を除くすべての分子 +分子[-c(1, 3)] ## 1 番目/3 番目の分子を除くすべての分子 +分子[c(-1, -3)] # # 1番目/3番目を除くすべて ``` -## Conditional subsetting +## 条件付きサブセット化 -Another common way of subsetting is by using a logical vector. `TRUE` will -select the element with the same index, while `FALSE` will not: +サブセット化のもう 1 つの一般的な方法は、論理ベクトルを使用することです。 `TRUE` は同じインデックスを持つ要素を選択し +が、`FALSE` は選択しません。 ```{r, purl=TRUE} -weight_g <- c(21, 34, 39, 54, 55) -weight_g[c(TRUE, FALSE, TRUE, TRUE, FALSE)] +重み_g <- c(21, 34, 39, 54, 55) +重み_g[c(TRUE, FALSE, TRUE, TRUE, FALSE)] ``` -Typically, these logical vectors are not typed by hand, but are the -output of other functions or logical tests. For instance, if you -wanted to select only the values above 50: +通常、これらの論理ベクトルは手動で入力されるのではなく、他の関数または論理テストの +出力です。 たとえば、50 を超える値のみを選択したい場合は、 +のようにします。 ```{r, purl=TRUE} -## will return logicals with TRUE for the indices that meet -## the condition -weight_g > 50 -## so we can use this to select only the values above 50 -weight_g[weight_g > 50] +## は、 +を満たすインデックスに対して TRUE の論理値を返します。 ## 条件 +Weight_g > 50 +## したがって、これを使用して 50 +Weight_g[weight_g > 50] を超える値のみを選択できます。 ``` -You can combine multiple tests using `&` (both conditions are true, -AND) or `|` (at least one of the conditions is true, OR): +`&` (両方の条件が true、 +AND) または `|` (少なくとも 1 つの条件が true、OR) を使用して複数のテストを結合できます。 ```{r, results="show", purl=TRUE} -weight_g[weight_g < 30 | weight_g > 50] -weight_g[weight_g >= 30 & weight_g == 21] +体重g[体重g < 30 |体重g > 50] +体重g[体重g >= 30 & 体重g == 21] ``` -Here, `<` stands for "less than", `>` for "greater than", `>=` for -"greater than or equal to", and `==` for "equal to". The double equal -sign `==` is a test for numerical equality between the left and right -hand sides, and should not be confused with the single `=` sign, which -performs variable assignment (similar to `<-`). +ここで、「<」は「より小さい」、「>」は「より大きい」、「>=」は +「以上」、「==」は「等しい」を表します。 2 つの等号 +記号「==」は、左側と +の数値が等しいかどうかをテストするものであり、(「<-」と同様に) 変数の代入を +する単一の `=` 記号と混同しないでください。 。 -A common task is to search for certain strings in a vector. One could -use the "or" operator `|` to test for equality to multiple values, but -this can quickly become tedious. The function `%in%` allows you to -test if any of the elements of a search vector are found: +一般的なタスクは、ベクトル内の特定の文字列を検索することです。\ +「or」演算子 `|` を使用して複数の値が等しいかどうかをテストすることもできますが、 +これはすぐに面倒になります。 関数 `%in%` を使用すると、検索ベクトルの要素が見つかったかどうかを +できます。 ```{r, purl=TRUE} -molecules <- c("dna", "rna", "protein", "peptide") -molecules[molecules == "rna" | molecules == "dna"] # returns both rna and dna -molecules %in% c("rna", "dna", "metabolite", "peptide", "glycerol") -molecules[molecules %in% c("rna", "dna", "metabolite", "peptide", "glycerol")] +分子 <- c("dna", "rna", "タンパク質", "ペプチド") +分子[分子 == "rna" |分子 == "dna"] # rna と dna の両方を返します +分子 %in% c("rna", "dna", "代謝物", "ペプチド", "グリセロール") +分子[分子 %in% c("rna", " 「DNA」、「代謝物」、「ペプチド」、「グリセロール」)] ``` -::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::: チャレンジ -## Challenge: +## チャレンジ: -Can you figure out why `"four" > "five"` returns `TRUE`? +なぜ `"four" > "five"` が `TRUE` を返すのか理解できますか? -::::::::::::::: solution +::::::::::::::: 解決 -## Solution +## 解決 ```{r} -"four" > "five" +「4」 > 「5」 ``` -When using `>` or `<` on strings, R compares their alphabetical order. -Here `"four"` comes after `"five"`, and therefore is _greater than_ -it. +文字列で `>` または `<` を使用すると、R はそれらのアルファベット順を比較します。 +ここで、`"four"` は `"five"` の後に来るので、それは \* +より大きい\* です。 -::::::::::::::::::::::::: +:::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::: -## Names +## 名前 -It is possible to name each element of a vector. The code chunk below -shows an initial vector without any names, how names are set, and -retrieved. +ベクトルの各要素に名前を付けることができます。 +より下のコード チャンクは、名前のない初期ベクトル、名前の設定方法、および +取得される様子を示しています。 ```{r} x <- c(1, 5, 3, 5, 10) -names(x) ## no names -names(x) <- c("A", "B", "C", "D", "E") -names(x) ## now we have names +names(x) ## 名前なし +names(x) <- c("A", "B", "C", "D", " E") +名前(x) ## これで名前が決まりました ``` -When a vector has names, it is possible to access elements by their -name, in addition to their index. +ベクトルに名前がある場合、 +に加えて名前によって要素にアクセスすることができます。 ```{r} x[c(1, 3)] x[c("A", "C")] ``` -## Missing data +## データが欠落しています -As R was designed to analyze datasets, it includes the concept of -missing data (which is uncommon in other programming -languages). Missing data are represented in vectors as `NA`. +R はデータセットを分析するように設計されているため、欠損データが +であるという概念が含まれています (これは +のプログラミング言語では一般的ではありません)。 欠損データはベクトルで「NA」として表されます。 -When doing operations on numbers, most functions will return `NA` if -the data you are working with include missing values. This feature -makes it harder to overlook the cases where you are dealing with -missing data. You can add the argument `na.rm = TRUE` to calculate -the result while ignoring the missing values. +数値の演算を行う場合、扱っているデータに欠損値が含まれている場合 +ほとんどの関数は「NA」を返します。 この機能により、 +データを処理し +いるケースを見逃しにくくなります。 引数 `na.rm = TRUE` を追加すると、欠損値を無視して結果を +として計算できます。 ```{r} -heights <- c(2, 4, 4, NA, 6) -mean(heights) -max(heights) -mean(heights, na.rm = TRUE) -max(heights, na.rm = TRUE) +身長 <- c(2, 4, 4, NA, 6) +平均(身長) +最大(身長) +平均(身長、na.rm = TRUE) +最大(身長、na.rm = TRUE) ``` -If your data include missing values, you may want to become familiar -with the functions `is.na()`, `na.omit()`, and `complete.cases()`. See -below for examples. +データに欠損値が含まれている場合は、関数 `is.na()`、`na.omit()`、および `complete.cases()` に +ておくとよいでしょう。 例については、以下の +を参照してください。 ```{r} -## Extract those elements which are not missing values. +## 欠損値のない要素を抽出します。 heights[!is.na(heights)] -## Returns the object with incomplete cases removed. -## The returned object is an atomic vector of type `"numeric"` -## (or `"double"`). +## 不完全なケースを削除したオブジェクトを返します。 +## 返されるオブジェクトは、タイプ `"numeric"` の原子ベクトルです。 +## (または `"double"`)。 na.omit(heights) -## Extract those elements which are complete cases. -## The returned object is an atomic vector of type `"numeric"` -## (or `"double"`). -heights[complete.cases(heights)] +## 完全なケースである要素を抽出します。 +## 返されるオブジェクトは、タイプ `"numeric"` の原子ベクトルです。 +## (または `"double"`)。 +の高さ[完全なケース(高さ)] ``` -::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::: チャレンジ -## Challenge: +## チャレンジ: -1. Using this vector of heights in inches, create a new vector with the NAs removed. +1. このインチ単位の高さのベクトルを使用して、NA を削除した新しいベクトルを作成します。 ```{r} -heights <- c(63, 69, 60, 65, NA, 68, 61, 70, 61, 59, 64, 69, 63, 63, NA, 72, 65, 64, 70, 63, 65) +高さ <- c(63, 69, 60, 65, NA, 68, 61, 70, 61, 59, 64, 69, 63, 63, NA, 72, 65, 64, 70, 63, 65) ``` -2. Use the function `median()` to calculate the median of the `heights` vector. -3. Use R to figure out how many people in the set are taller than 67 inches. +2. 関数 `median()` を使用して、`heights` ベクトルの中央値を計算します。 +3. R を使用して、セット内の身長が 67 インチを超える人が何人いるかを計算します。 -::::::::::::::: solution +::::::::::::::: 解決 -## Solution +## 解決 ```{r, purl=TRUE} heights_no_na <- heights[!is.na(heights)] -## or +## または heights_no_na <- na.omit(heights) ``` ```{r, purl=TRUE} -median(heights, na.rm = TRUE) +中央値(身長、na.rm = TRUE) ``` ```{r, purl=TRUE} -heights_above_67 <- heights_no_na[heights_no_na > 67] -length(heights_above_67) +height_above_67 <- height_no_na[heights_no_na > 67] +長さ(heights_above_67) ``` -::::::::::::::::::::::::: +:::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::: -## Generating vectors {#sec:genvec} +## ベクトル {#sec:genvec}の生成 ```{r, echo=FALSE} set.seed(1) ``` -### Constructors +### コンストラクター -There exists some functions to generate vectors of different type. To -generate a vector of numerics, one can use the `numeric()` -constructor, providing the length of the output vector as -parameter. The values will be initialised with 0. +異なるタイプのベクトルを生成する関数がいくつか存在します。 数値のベクトルを生成 +には、`numeric()` +コンストラクターを使用し、出力ベクトルの長さを +パラメーターとして指定します。 値は 0 で初期化されます。 ```{r, purl=TRUE} -numeric(3) -numeric(10) +数値(3) +数値(10) ``` -Note that if we ask for a vector of numerics of length 0, we obtain -exactly that: +長さ 0 の数値ベクトルを要求すると、次のように +が得られることに注意してください。 ```{r, purl=TRUE} -numeric(0) +数値(0) ``` -There are similar constructors for characters and logicals, named -`character()` and `logical()` respectively. +文字と論理に対しても同様のコンストラクターがあり、 +`character()` と `logical()` という名前が付けられます。 -::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::: チャレンジ -## Challenge: +## チャレンジ: -What are the defaults for character and logical vectors? +文字ベクトルと論理ベクトルのデフォルトは何ですか? -::::::::::::::: solution +::::::::::::::: 解決 -## Solution +## 解決 ```{r, purl=TRUE} -character(2) ## the empty character -logical(2) ## FALSE +文字(2) ## 空の文字 +論理的(2) ## FALSE ``` -::::::::::::::::::::::::: +:::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::: -### Replicate elements +### 要素を複製する -The `rep` function allow to repeat a value a certain number of -times. If we want to initiate a vector of numerics of length 5 with -the value -1, for example, we could do the following: +`rep` 関数を使用すると、値を特定の回数 ( +回) 繰り返すことができます。 たとえば、長さ 5 の数値ベクトルを +から値 -1 で開始したい場合は、次のようにすることができます。 ```{r, purl=TRUE} -rep(-1, 5) +担当者(-1, 5) ``` -Similarly, to generate a vector populated with missing values, which -is often a good way to start, without setting assumptions on the data -to be collected: +同様に、収集されるデータ +に仮定を設定せずに、欠損値が入力されたベクトルを生成するには (多くの場合、 +から始めるのが良い方法です): ```{r, purl=TRUE} -rep(NA, 5) +担当者(NA, 5) ``` -`rep` can take vectors of any length as input (above, we used vectors -of length 1) and any type. For example, if we want to repeat the -values 1, 2 and 3 five times, we would do the following: +`rep` は、入力として任意の長さのベクトル (上記では長さ 1 のベクトル +を使用しました) および任意のタイプを受け取ることができます。 たとえば、 +値 1、2、3 を 5 回繰り返す場合は、次のようにします。 ```{r, purl=TRUE} rep(c(1, 2, 3), 5) ``` -::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::: チャレンジ -## Challenge: +## チャレンジ: -What if we wanted to repeat the values 1, 2 and 3 five times, but -obtain five 1s, five 2s and five 3s in that order? There are two -possibilities - see `?rep` or `?sort` for help. +値 1、2、3 を 5 回繰り返したいのに、 +1 を 5 つ、2 を 5 つ、3 を 5 つこの順序で取得した場合はどうなるでしょうか。 +可能性は 2 つあります。ヘルプについては `?rep` または `?sort` を参照してください。 -::::::::::::::: solution +::::::::::::::: 解決 -## Solution +## 解決 ```{r, purl=TRUE} rep(c(1, 2, 3), each = 5) sort(rep(c(1, 2, 3), 5)) ``` -::::::::::::::::::::::::: +:::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::: -### Sequence generation +### シーケンスの生成 -Another very useful function is `seq`, to generate a sequence of -numbers. For example, to generate a sequence of integers from 1 to 20 -by steps of 2, one would use: +もう 1 つの非常に便利な関数は、 +の数値シーケンスを生成する `seq` です。 たとえば、1 から 20 +までの整数のシーケンスを 2 ずつ生成するには、次のコマンドを使用します。 ```{r, purl=TRUE} seq(from = 1, to = 20, by = 2) ``` -The default value of `by` is 1 and, given that the generation of a -sequence of one value to another with steps of 1 is frequently used, -there's a shortcut: +`by` のデフォルト値は 1 で、1 のステップで 1 つの値から別の値への +シーケンスの生成が頻繁に使用されることを考えると、 +というショートカットがあります。 ```{r, purl=TRUE} seq(1, 5, 1) -seq(1, 5) ## default by +seq(1, 5) ## デフォルトは 1:5 ``` -To generate a sequence of numbers from 1 to 20 of final length of 3, -one would use: +最終長さが +の 1 から 20 までの一連の数値を生成するには、次のコマンドを使用します。 ```{r, purl=TRUE} -seq(from = 1, to = 20, length.out = 3) +seq(from = 1、to = 20、length.out = 3) ``` -### Random samples and permutations +### ランダムなサンプルと順列 -A last group of useful functions are those that generate random -data. The first one, `sample`, generates a random permutation of -another vector. For example, to draw a random order to 10 students -oral exam, I first assign each student a number from 1 to ten (for -instance based on the alphabetic order of their name) and then: +有用な関数の最後のグループは、ランダムな +データを生成する関数です。 最初の `sample` は、 +のベクトルのランダムな置換を生成します。 たとえば、口頭試験を行わない +人の生徒にランダムな順序を付けるには +まず各生徒に 1 から 10 までの番号を割り当てます (たとえば、名前のアルファベット順に基づきます)。次に次のようにします。 ```{r, purl=TRUE} -sample(1:10) +サンプル(1:10) ``` -Without further arguments, `sample` will return a permutation of all -elements of the vector. If I want a random sample of a certain size, I -would set this value as the second argument. Below, I sample 5 random -letters from the alphabet contained in the pre-defined `letters` vector: +さらなる引数がなければ、`sample` はベクトルのすべての +要素の順列を返します。 特定のサイズのランダムなサンプルが必要な場合、I +はこの値を 2 番目の引数として設定します。 以下では、事前定義された `letters` ベクトルに含まれるアルファベットから 5 つのランダムな +文字をサンプリングします。 ```{r, purl=TRUE} -sample(letters, 5) +サンプル(文字、5) ``` -If I wanted an output larger than the input vector, or being able to -draw some elements multiple times, I would need to set the `replace` -argument to `TRUE`: +入力ベクトルよりも大きな出力が必要な場合、または一部の要素を複数回 +できるようにしたい場合は、引数 `replace` +を `TRUE` に設定する必要があります。 ```{r, purl=TRUE} -sample(1:5, 10, replace = TRUE) +サンプル(1:5、10、置換 = TRUE) ``` -::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::: チャレンジ -## Challenge: +## チャレンジ: -When trying the functions above out, you will have realised that the -samples are indeed random and that one doesn't get the same -permutation twice. To be able to reproduce these random draws, one can -set the random number generation seed manually with `set.seed()` -before drawing the random sample. +上記の関数を試してみると、 +サンプルは実際にランダムであり、同じ +順列が 2 回発生することはないことがわかるでしょう。 これらのランダムな描画を再現できるようにするには、ランダム サンプルを描画する前に +`set.seed()` を使用 +て乱数生成シードを手動で設定します。 -Test this feature with your neighbour. First draw two random -permutations of `1:10` independently and observe that you get -different results. +近所の人と一緒にこの機能をテストしてください。 まず、「1:10」のランダムな +順列を 2 つ個別に描画し、 +の異なる結果が得られることを観察します。 -Now set the seed with, for example, `set.seed(123)` and repeat the -random draw. Observe that you now get the same random draws. +次に、たとえば `set.seed(123)` でシードを設定し、 +ランダムな描画を繰り返します。 同じランダムな抽選が行われることに注目してください。 -Repeat by setting a different seed. +別のシードを設定して繰り返します。 -::::::::::::::: solution +::::::::::::::: 解決 -## Solution +## 解決 -Different permutations +さまざまな順列 ```{r, purl=TRUE} -sample(1:10) -sample(1:10) +サンプル(1:10) +サンプル(1:10) ``` -Same permutations with seed 123 +シード 123 と同じ順列 ```{r, purl=TRUE} set.seed(123) -sample(1:10) +サンプル(1:10) set.seed(123) -sample(1:10) +サンプル(1:10) ``` -A different seed +違う種 ```{r, purl=TRUE} set.seed(1) -sample(1:10) +サンプル(1:10) set.seed(1) -sample(1:10) +サンプル(1:10) ``` -::::::::::::::::::::::::: +:::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::: -### Drawing samples from a normal distribution +### 正規分布からサンプルを抽出する -The last function we are going to see is `rnorm`, that draws a random -sample from a normal distribution. Two normal distributions of means 0 -and 100 and standard deviations 1 and 5, noted _N(0, 1)_ and -_N(100, 5)_, are shown below. +最後に説明する関数は `rnorm` で、正規分布からランダムな +サンプルを抽出します。 平均 +および 100、標準偏差 1 および 5 の 2 つの正規分布 (_N(0, 1)_ および +_N(100, 5)_ と表記) を以下に示します。 ```{r, echo=FALSE, fig.width=12, fig.height=6, fig.cap="Two normal distributions: *N(0, 1)* on the left and *N(100, 5)* on the right."} par(mfrow = c(1, 2)) -plot(density(rnorm(1000)), main = "", sub = "N(0, 1)") -plot(density(rnorm(1000, 100, 5)), main = "", sub = "N(100, 5)") +プロット(密度(rnorm(1000)), メイン = "", サブ = "N(0, 1)") +プロット(密度(rnorm(1000, 100, 5) ))、メイン = ""、サブ = "N(100, 5)") ``` -The three arguments, `n`, `mean` and `sd`, define the size of the -sample, and the parameters of the normal distribution, i.e the mean -and its standard deviation. The defaults of the latter are 0 and 1. +3 つの引数「n」、「mean」、「sd」は、サンプル +のサイズと、正規分布のパラメーター、つまり平均 +とその標準偏差を定義します。 後者のデフォルトは 0 と 1 です。 ```{r, purl=TRUE} rnorm(5) @@ -910,12 +908,12 @@ rnorm(5, 2, 2) rnorm(5, 100, 5) ``` -Now that we have learned how to write scripts, and the basics of R's -data structures, we are ready to start working with larger data, and -learn about data frames. +スクリプトの書き方と +のデータ構造の基本を学習したので、より大きなデータの操作を開始する準備が整い、データ フレームについて +します。 -:::::::::::::::::::::::::::::::::::::::: keypoints +:::::::::::::::::::::::::::::::::::::::: キーポイント -- How to interact with R +- Rと対話する方法 -:::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::: From f294edb9f75cf3c2e23e963cb1e72a10b5dcf239 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sun, 12 May 2024 07:19:32 +0900 Subject: [PATCH 102/334] New translations 10-data-organisation.md (French) --- locale/fr/episodes/10-data-organisation.Rmd | 27 ++++++++++++--------- 1 file changed, 15 insertions(+), 12 deletions(-) diff --git a/locale/fr/episodes/10-data-organisation.Rmd b/locale/fr/episodes/10-data-organisation.Rmd index d52686828..d702329b9 100644 --- a/locale/fr/episodes/10-data-organisation.Rmd +++ b/locale/fr/episodes/10-data-organisation.Rmd @@ -24,6 +24,9 @@ exercises: 30 :::::::::::::::::::::::::::::::::::::::::::::::::: +> This episode is based on the Data Carpentries's _Data Analysis and +> Visualisation in R for Ecologists_ lesson. + ## Spreadsheet programs **Question** @@ -482,7 +485,7 @@ different reasons why the data isn't there. This is important information to capture, but is in effect using one column to capture two pieces of information. Like for using formatting to convey information it would be good here to create a new -column like 'data_missing' and use that column to capture the +column like 'data\_missing' and use that column to capture the different reasons. Whatever the reason, it's a problem if unknown or missing data is @@ -512,7 +515,7 @@ excluded. ![](fig/good_formatting.png) -### Using formatting to make the data sheet look pretty {#formatting_pretty} +### Using formatting to make the data sheet look pretty {#formatting\_pretty} **Example**: merging cells. @@ -543,7 +546,7 @@ you need both these measurements, design your data sheet to include this information. For example, include one column for the ABO group and one for the Rhesus group. -### Using problematic field names {#field_name} +### Using problematic field names {#field\_name} Choose descriptive field names, but be careful not to include spaces, numbers, or special characters of any kind. Spaces can be @@ -560,15 +563,15 @@ confusion and enables others to readily interpret your fields. **Examples** -| Good Name | Good Alternative | Avoid | -| ---------------------------------------------------------- | -------------------------------------- | ------------------------------------ | -| Max_temp_C | MaxTemp | Maximum Temp (°C) | -| Precipitation_mm | Precipitation | precmm | -| Mean_year_growth | MeanYearGrowth | Mean growth/year | -| sex | sex | M/F | -| weight | weight | w. | -| cell_type | CellType | Cell Type | -| Observation_01 | first_observation | 1st Obs | +| Good Name | Good Alternative | Avoid | +| -------------------------------------------------------------- | ---------------------------------------- | ------------------------------------ | +| Max\_temp\_C | MaxTemp | Maximum Temp (°C) | +| Precipitation\_mm | Precipitation | precmm | +| Mean\_year\_growth | MeanYearGrowth | Mean growth/year | +| sex | sex | M/F | +| weight | weight | w. | +| cell\_type | CellType | Cell Type | +| Observation\_01 | first\_observation | 1st Obs | ### Using special characters in data {#special} From 3b0dd3024bb0f048b6b348c22f2377f6ee58682d Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sun, 12 May 2024 07:19:34 +0900 Subject: [PATCH 103/334] New translations 10-data-organisation.md (Spanish) --- locale/es/episodes/10-data-organisation.Rmd | 27 ++++++++++++--------- 1 file changed, 15 insertions(+), 12 deletions(-) diff --git a/locale/es/episodes/10-data-organisation.Rmd b/locale/es/episodes/10-data-organisation.Rmd index 02c4c6326..de1b53e0e 100644 --- a/locale/es/episodes/10-data-organisation.Rmd +++ b/locale/es/episodes/10-data-organisation.Rmd @@ -24,6 +24,9 @@ exercises: 30 :::::::::::::::::::::::::::::::::::::::::::::::::: +> This episode is based on the Data Carpentries's _Data Analysis and +> Visualisation in R for Ecologists_ lesson. + ## Spreadsheet programs **Question** @@ -482,7 +485,7 @@ different reasons why the data isn't there. This is important information to capture, but is in effect using one column to capture two pieces of information. Like for using formatting to convey information it would be good here to create a new -column like 'data_missing' and use that column to capture the +column like 'data\_missing' and use that column to capture the different reasons. Whatever the reason, it's a problem if unknown or missing data is @@ -512,7 +515,7 @@ excluded. ![](fig/good_formatting.png) -### Using formatting to make the data sheet look pretty {#formatting_pretty} +### Using formatting to make the data sheet look pretty {#formatting\_pretty} **Example**: merging cells. @@ -543,7 +546,7 @@ you need both these measurements, design your data sheet to include this information. For example, include one column for the ABO group and one for the Rhesus group. -### Using problematic field names {#field_name} +### Using problematic field names {#field\_name} Choose descriptive field names, but be careful not to include spaces, numbers, or special characters of any kind. Spaces can be @@ -560,15 +563,15 @@ confusion and enables others to readily interpret your fields. **Examples** -| Good Name | Good Alternative | Avoid | -| ---------------------------------------------------------- | -------------------------------------- | ------------------------------------ | -| Max_temp_C | MaxTemp | Maximum Temp (°C) | -| Precipitation_mm | Precipitation | precmm | -| Mean_year_growth | MeanYearGrowth | Mean growth/year | -| sex | sex | M/F | -| weight | weight | w. | -| cell_type | CellType | Cell Type | -| Observation_01 | first_observation | 1st Obs | +| Good Name | Good Alternative | Avoid | +| -------------------------------------------------------------- | ---------------------------------------- | ------------------------------------ | +| Max\_temp\_C | MaxTemp | Maximum Temp (°C) | +| Precipitation\_mm | Precipitation | precmm | +| Mean\_year\_growth | MeanYearGrowth | Mean growth/year | +| sex | sex | M/F | +| weight | weight | w. | +| cell\_type | CellType | Cell Type | +| Observation\_01 | first\_observation | 1st Obs | ### Using special characters in data {#special} From f93b1ad0b47c0f073347f3760925cc9087a33ae7 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sun, 12 May 2024 07:19:36 +0900 Subject: [PATCH 104/334] New translations 10-data-organisation.md (Japanese) --- locale/ja/episodes/10-data-organisation.Rmd | 27 ++++++++++++--------- 1 file changed, 15 insertions(+), 12 deletions(-) diff --git a/locale/ja/episodes/10-data-organisation.Rmd b/locale/ja/episodes/10-data-organisation.Rmd index 7fb6177a4..6e6dfacd7 100644 --- a/locale/ja/episodes/10-data-organisation.Rmd +++ b/locale/ja/episodes/10-data-organisation.Rmd @@ -24,6 +24,9 @@ exercises: 30 :::::::::::::::::::::::::::::::::::::::::::::::::: +> This episode is based on the Data Carpentries's _Data Analysis and +> Visualisation in R for Ecologists_ lesson. + ## 表計算プログラム **質問** @@ -475,7 +478,7 @@ Carpentries レッスンを参照してください。データを制御**しま な null 値が使用されることもあります。 これは取得すべき重要 情報ですが、実際には 1 つの列を使用して 2 つの情報を取得することになります。 フォーマットを使用して -情報を伝える と同様に、ここでは「data_missing」のような新しい +情報を伝える と同様に、ここでは「data\_missing」のような新しい 列を作成し、その列を使用して の異なる理由をキャプチャすると良いでしょう。 @@ -506,7 +509,7 @@ Carpentries レッスンを参照してください。データを制御**しま ![](fig/good_formatting.png) -### 書式設定を使用してデータシートを美しく見せる {#formatting_pretty} +### 書式設定を使用してデータシートを美しく見せる {#formatting\_pretty} **例**: セルを結合します。 @@ -537,7 +540,7 @@ B+、A- などの ABO グループとアカゲザル グループを 1 つのセ この情報を含めるようにデータシートを設計します。 たとえば、ABO グループには 1 つの列を含め、Rhesus グループには つの列を含めます。 -### 問題のあるフィールド名 {#field_name} の使用 +### 問題のあるフィールド名 {#field\_name} の使用 説明的なフィールド名を選択しますが、スペース、 数字、またはいかなる種類の特殊文字も含めないように注意してください。 スペースは、空白を区切り文字として使用するパーサーによって @@ -554,15 +557,15 @@ B+、A- などの ABO グループとアカゲザル グループを 1 つのセ **例** -| いい名前 | 良い代替品 | 避ける | -| ------------------------------------------------- | --------------------------- | ---------------------------- | -| 最高_温度_C | 最大温度 | 最高温度 (°C) | -| 降水量_mm | 降水量 | プレcmm | -| 平均_年_成長 | 平均年成長 | 平均成長率/年 | -| セックス | セックス | 男/女 | -| 重さ | 重さ | w。 | -| セル_タイプ | セルタイプ | 細胞の種類 | -| 観察_01 | 最初の_観察 | 1回目の観測 | +| いい名前 | 良い代替品 | 避ける | +| ----------------------------------------------------- | ----------------------------- | ---------------------------- | +| 最高\_温度\_C | 最大温度 | 最高温度 (°C) | +| 降水量\_mm | 降水量 | プレcmm | +| 平均\_年\_成長 | 平均年成長 | 平均成長率/年 | +| sex | セックス | 男/女 | +| weight | 重さ | w。 | +| セル\_タイプ | セルタイプ | 細胞の種類 | +| 観察\_01 | 最初の\_観察 | 1回目の観測 | ### データ {#special}での特殊文字の使用 From 11a160ae0d4428ae14b6e11a3a261b1f88338495 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sun, 12 May 2024 07:19:38 +0900 Subject: [PATCH 105/334] New translations 10-data-organisation.md (Portuguese) --- locale/pt/episodes/10-data-organisation.Rmd | 29 ++++++++++++--------- 1 file changed, 16 insertions(+), 13 deletions(-) diff --git a/locale/pt/episodes/10-data-organisation.Rmd b/locale/pt/episodes/10-data-organisation.Rmd index 9c6928fb3..548fd279e 100644 --- a/locale/pt/episodes/10-data-organisation.Rmd +++ b/locale/pt/episodes/10-data-organisation.Rmd @@ -24,6 +24,9 @@ exercises: 30 :::::::::::::::::::::::::::::::::::::::::::::::::: +> This episode is based on the Data Carpentries's _Data Analysis and +> Visualisation in R for Ecologists_ lesson. + ## Spreadsheet programs **Question** @@ -216,7 +219,7 @@ Put these principles in to practice today during your exercises. Enquanto o controle de versão está fora de escopo, você pode ver a aula do Carpentries em -['Git'](https\://swcarpentry. ithub.io/git-novice/) para aprender como +['Git'](https://swcarpentry. ithub.io/git-novice/) para aprender como manter um **controle de versão** sobre seus dados. Veja também este blog post para um tutorial rápido, ou @Perez-Riverol:2016 para um exemplo mais voltado à pesquisa. @@ -463,7 +466,7 @@ antes da análise. Outras vezes valores nulos diferentes são usados para transm diferentes razões porque os dados não estão lá. Essa é uma informação importante para capturar, mas está em vigor usando uma coluna para capturar dois tipos de informação diferentes. Assim como para [usando formatação para transmitir informação](#formatting) seria bom aqui criar uma nova coluna -como 'data_missing' e usar essa coluna para capturar as diferentes razões para o dado ser nulo. +como 'data\_missing' e usar essa coluna para capturar as diferentes razões para o dado ser nulo. Seja qual for a razão, é um problema se dados desconhecidos ou ausentes são registrados como -999, 999 ou 0. @@ -492,7 +495,7 @@ excluded. ![](fig/good_formatting.png) -### Using formatting to make the data sheet look pretty {#formatting_pretty} +### Using formatting to make the data sheet look pretty {#formatting\_pretty} **Example**: merging cells. @@ -521,7 +524,7 @@ precisar destas duas medidas, crie sua tabela para incluir estas informações. Por exemplo, inclua uma coluna para o grupo ABO e uma para o grupo Rhesus. -### Using problematic field names {#field_name} +### Using problematic field names {#field\_name} Choose descriptive field names, but be careful not to include spaces, numbers, or special characters of any kind. Os espaços podem ser @@ -537,15 +540,15 @@ confusão e permitem que outros interpretem prontamente suas colunas. **Examples** -| Good Name | Good Alternative | Avoid | -| ---------------------------------------------------------- | -------------------------------------- | ------------------------------------ | -| Max_temp_C | MaxTemp | Maximum Temp (°C) | -| Precipitation_mm | Precipitation | precmm | -| Mean_year_growth | MeanYearGrowth | Mean growth/year | -| sex | sex | M/F | -| weight | weight | w. | -| cell_type | CellType | Cell Type | -| Observation_01 | first_observation | 1st Obs | +| Good Name | Good Alternative | Avoid | +| -------------------------------------------------------------- | ---------------------------------------- | ------------------------------------ | +| Max\_temp\_C | MaxTemp | Maximum Temp (°C) | +| Precipitation\_mm | Precipitation | precmm | +| Mean\_year\_growth | MeanYearGrowth | Mean growth/year | +| sex | sex | M/F | +| weight | weight | w. | +| cell\_type | CellType | Cell Type | +| Observation\_01 | first\_observation | 1st Obs | ### Using special characters in data {#special} From 26e1b506aab23b0e36b0191dba60b3ff71fd1d45 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sun, 12 May 2024 07:19:40 +0900 Subject: [PATCH 106/334] New translations 10-data-organisation.md (Chinese Simplified) --- locale/zh/episodes/10-data-organisation.Rmd | 27 ++++++++++++--------- 1 file changed, 15 insertions(+), 12 deletions(-) diff --git a/locale/zh/episodes/10-data-organisation.Rmd b/locale/zh/episodes/10-data-organisation.Rmd index d52686828..d702329b9 100644 --- a/locale/zh/episodes/10-data-organisation.Rmd +++ b/locale/zh/episodes/10-data-organisation.Rmd @@ -24,6 +24,9 @@ exercises: 30 :::::::::::::::::::::::::::::::::::::::::::::::::: +> This episode is based on the Data Carpentries's _Data Analysis and +> Visualisation in R for Ecologists_ lesson. + ## Spreadsheet programs **Question** @@ -482,7 +485,7 @@ different reasons why the data isn't there. This is important information to capture, but is in effect using one column to capture two pieces of information. Like for using formatting to convey information it would be good here to create a new -column like 'data_missing' and use that column to capture the +column like 'data\_missing' and use that column to capture the different reasons. Whatever the reason, it's a problem if unknown or missing data is @@ -512,7 +515,7 @@ excluded. ![](fig/good_formatting.png) -### Using formatting to make the data sheet look pretty {#formatting_pretty} +### Using formatting to make the data sheet look pretty {#formatting\_pretty} **Example**: merging cells. @@ -543,7 +546,7 @@ you need both these measurements, design your data sheet to include this information. For example, include one column for the ABO group and one for the Rhesus group. -### Using problematic field names {#field_name} +### Using problematic field names {#field\_name} Choose descriptive field names, but be careful not to include spaces, numbers, or special characters of any kind. Spaces can be @@ -560,15 +563,15 @@ confusion and enables others to readily interpret your fields. **Examples** -| Good Name | Good Alternative | Avoid | -| ---------------------------------------------------------- | -------------------------------------- | ------------------------------------ | -| Max_temp_C | MaxTemp | Maximum Temp (°C) | -| Precipitation_mm | Precipitation | precmm | -| Mean_year_growth | MeanYearGrowth | Mean growth/year | -| sex | sex | M/F | -| weight | weight | w. | -| cell_type | CellType | Cell Type | -| Observation_01 | first_observation | 1st Obs | +| Good Name | Good Alternative | Avoid | +| -------------------------------------------------------------- | ---------------------------------------- | ------------------------------------ | +| Max\_temp\_C | MaxTemp | Maximum Temp (°C) | +| Precipitation\_mm | Precipitation | precmm | +| Mean\_year\_growth | MeanYearGrowth | Mean growth/year | +| sex | sex | M/F | +| weight | weight | w. | +| cell\_type | CellType | Cell Type | +| Observation\_01 | first\_observation | 1st Obs | ### Using special characters in data {#special} From e08a6b59b121126e0970881685e22ee7323d1459 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sun, 12 May 2024 07:19:42 +0900 Subject: [PATCH 107/334] New translations 20-r-rstudio.md (French) --- locale/fr/episodes/20-r-rstudio.Rmd | 3 +++ 1 file changed, 3 insertions(+) diff --git a/locale/fr/episodes/20-r-rstudio.Rmd b/locale/fr/episodes/20-r-rstudio.Rmd index ad0b73472..6b0ca4095 100644 --- a/locale/fr/episodes/20-r-rstudio.Rmd +++ b/locale/fr/episodes/20-r-rstudio.Rmd @@ -23,6 +23,9 @@ exercises: 0 :::::::::::::::::::::::::::::::::::::::::::::::::: +> This episode is based on the Data Carpentries's _Data Analysis and +> Visualisation in R for Ecologists_ lesson. + ## What is R? What is RStudio? The term [R](https://www.r-project.org/) is used to refer to the From 4b366287ad967b6d57c01b1487e4e970c310c506 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sun, 12 May 2024 07:19:44 +0900 Subject: [PATCH 108/334] New translations 20-r-rstudio.md (Spanish) --- locale/es/episodes/20-r-rstudio.Rmd | 3 +++ 1 file changed, 3 insertions(+) diff --git a/locale/es/episodes/20-r-rstudio.Rmd b/locale/es/episodes/20-r-rstudio.Rmd index f5f2e0aef..9edb0bc1e 100644 --- a/locale/es/episodes/20-r-rstudio.Rmd +++ b/locale/es/episodes/20-r-rstudio.Rmd @@ -23,6 +23,9 @@ exercises: 0 :::::::::::::::::::::::::::::::::::::::::::::::::: +> This episode is based on the Data Carpentries's _Data Analysis and +> Visualisation in R for Ecologists_ lesson. + ## What is R? What is RStudio? The term [R](https://www.r-project.org/) is used to refer to the From 62f665a3e32120da53c2aaff50a867dc1695c0ff Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sun, 12 May 2024 07:19:46 +0900 Subject: [PATCH 109/334] New translations 20-r-rstudio.md (Japanese) --- locale/ja/episodes/20-r-rstudio.Rmd | 3 +++ 1 file changed, 3 insertions(+) diff --git a/locale/ja/episodes/20-r-rstudio.Rmd b/locale/ja/episodes/20-r-rstudio.Rmd index 1ab9ae4d1..6e7104273 100644 --- a/locale/ja/episodes/20-r-rstudio.Rmd +++ b/locale/ja/episodes/20-r-rstudio.Rmd @@ -23,6 +23,9 @@ exercises: 0 :::::::::::::::::::::::::::::::::::::::::::::::::: +> This episode is based on the Data Carpentries's _Data Analysis and +> Visualisation in R for Ecologists_ lesson. + ## What is R? RStudioとは何ですか? The term [R](https://www.r-project.org/) is used to refer to the From 39240bc5a69327bff48b998b4a7c56d8204ab69b Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sun, 12 May 2024 07:19:47 +0900 Subject: [PATCH 110/334] New translations 20-r-rstudio.md (Portuguese) --- locale/pt/episodes/20-r-rstudio.Rmd | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/locale/pt/episodes/20-r-rstudio.Rmd b/locale/pt/episodes/20-r-rstudio.Rmd index b143508b9..2dcfa53bc 100644 --- a/locale/pt/episodes/20-r-rstudio.Rmd +++ b/locale/pt/episodes/20-r-rstudio.Rmd @@ -23,6 +23,9 @@ exercises: 0 :::::::::::::::::::::::::::::::::::::::::::::::::: +> This episode is based on the Data Carpentries's _Data Analysis and +> Visualisation in R for Ecologists_ lesson. + ## What is R? O que é RStudio? O termo [R](https://www.r-project.org/) é utilizado para designar a @@ -495,7 +498,7 @@ generalizar o que está fazendo para que mesmo as pessoas que não estão na sua possam compreender a pergunta. Por exemplo, pode em vez de utilizar um subconjunto do seu conjunto de dados real, criar um pequeno (3 colunas, 5 linhas) arquivo genérico. Para mais informações sobre como escrever um exemplo reprodutível em, consulte [este artigo de Hadley -Wickham] (https\://adv-r.had.co.nz/Reproducibility.html). +Wickham] (https://adv-r.had.co.nz/Reproducibility.html). Para compartilhar um objeto com outra pessoa, se for relativamente pequeno, você pode usar a função `dput()`. It will output R code that can be used @@ -584,7 +587,7 @@ sessionInfo() is very helpful to create reproducible examples when asking for help. A comunidade rOpenSci "How to ask questions so they get answered" ([Github - link](https\://github. om/ropensci/commcalls/issues/14) e [gravação de vídeo](https://vimeo.com/208749032)) inclui uma apresentação de + link](https://github. om/ropensci/commcalls/issues/14) e [gravação de vídeo](https://vimeo.com/208749032)) inclui uma apresentação de o pacote reprex e sua filosofia. ## R packages From b849b3ca7e2d739b9ac06082cf08b80732747627 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sun, 12 May 2024 07:19:49 +0900 Subject: [PATCH 111/334] New translations 20-r-rstudio.md (Chinese Simplified) --- locale/zh/episodes/20-r-rstudio.Rmd | 3 +++ 1 file changed, 3 insertions(+) diff --git a/locale/zh/episodes/20-r-rstudio.Rmd b/locale/zh/episodes/20-r-rstudio.Rmd index ad0b73472..6b0ca4095 100644 --- a/locale/zh/episodes/20-r-rstudio.Rmd +++ b/locale/zh/episodes/20-r-rstudio.Rmd @@ -23,6 +23,9 @@ exercises: 0 :::::::::::::::::::::::::::::::::::::::::::::::::: +> This episode is based on the Data Carpentries's _Data Analysis and +> Visualisation in R for Ecologists_ lesson. + ## What is R? What is RStudio? The term [R](https://www.r-project.org/) is used to refer to the From 8e04183f413652bae6ad158e67691a34602775bd Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sun, 12 May 2024 07:19:51 +0900 Subject: [PATCH 112/334] New translations 23-starting-with-r.md (French) --- locale/fr/episodes/23-starting-with-r.Rmd | 11 +++++++---- 1 file changed, 7 insertions(+), 4 deletions(-) diff --git a/locale/fr/episodes/23-starting-with-r.Rmd b/locale/fr/episodes/23-starting-with-r.Rmd index 47ac62388..410e507fd 100644 --- a/locale/fr/episodes/23-starting-with-r.Rmd +++ b/locale/fr/episodes/23-starting-with-r.Rmd @@ -28,6 +28,9 @@ exercises: 60 :::::::::::::::::::::::::::::::::::::::::::::::::: +> This episode is based on the Data Carpentries's _Data Analysis and +> Visualisation in R for Ecologists_ lesson. + ## Creating objects in R You can get output from R simply by typing math in the console: @@ -53,9 +56,9 @@ context. Because of the [slight differences](https://blog.revolutionanalytics.com/2008/12/use-equals-or-arrow-for-assignment.html) in syntax, it is good practice to always use `<-` for assignments. -In RStudio, typing <kbd>Alt</kbd> + <kbd>-</kbd> (push <kbd>Alt</kbd> -at the same time as the <kbd>-</kbd> key) will write `<-` in a single -keystroke in a PC, while typing <kbd>Option</kbd> + <kbd>-</kbd> (push <kbd>Option</kbd> at the same time as the <kbd>-</kbd> key) does the +In RStudio, typing <kbd>Alt</kbd> + <kbd>\-</kbd> (push <kbd>Alt</kbd> +at the same time as the <kbd>\-</kbd> key) will write `<-` in a single +keystroke in a PC, while typing <kbd>Option</kbd> + <kbd>\-</kbd> (push <kbd>Option</kbd> at the same time as the <kbd>\-</kbd> key) does the same in a Mac. ### Naming variables @@ -139,7 +142,7 @@ weight_kg <- 100 ## Challenge: What do you think is the current content of the object `weight_lb`? -126.5 or 220? +126\.5 or 220? :::::::::::::::::::::::::::::::::::::::::::::::::: From 52750de0019fd0588f240bc67b04b09fc7425303 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sun, 12 May 2024 07:19:53 +0900 Subject: [PATCH 113/334] New translations 23-starting-with-r.md (Spanish) --- locale/es/episodes/23-starting-with-r.Rmd | 11 +++++++---- 1 file changed, 7 insertions(+), 4 deletions(-) diff --git a/locale/es/episodes/23-starting-with-r.Rmd b/locale/es/episodes/23-starting-with-r.Rmd index 88f9cfc4d..49ba99f09 100644 --- a/locale/es/episodes/23-starting-with-r.Rmd +++ b/locale/es/episodes/23-starting-with-r.Rmd @@ -28,6 +28,9 @@ exercises: 60 :::::::::::::::::::::::::::::::::::::::::::::::::: +> This episode is based on the Data Carpentries's _Data Analysis and +> Visualisation in R for Ecologists_ lesson. + ## Creating objects in R You can get output from R simply by typing math in the console: @@ -53,9 +56,9 @@ context. Because of the [slight differences](https://blog.revolutionanalytics.com/2008/12/use-equals-or-arrow-for-assignment.html) in syntax, it is good practice to always use `<-` for assignments. -In RStudio, typing <kbd>Alt</kbd> + <kbd>-</kbd> (push <kbd>Alt</kbd> -at the same time as the <kbd>-</kbd> key) will write `<-` in a single -keystroke in a PC, while typing <kbd>Option</kbd> + <kbd>-</kbd> (push <kbd>Option</kbd> at the same time as the <kbd>-</kbd> key) does the +In RStudio, typing <kbd>Alt</kbd> + <kbd>\-</kbd> (push <kbd>Alt</kbd> +at the same time as the <kbd>\-</kbd> key) will write `<-` in a single +keystroke in a PC, while typing <kbd>Option</kbd> + <kbd>\-</kbd> (push <kbd>Option</kbd> at the same time as the <kbd>\-</kbd> key) does the same in a Mac. ### Naming variables @@ -139,7 +142,7 @@ weight_kg <- 100 ## Challenge: What do you think is the current content of the object `weight_lb`? -126.5 or 220? +126\.5 or 220? :::::::::::::::::::::::::::::::::::::::::::::::::: From 09365d43e7cf3c395042cdb48ca7415450b614ec Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sun, 12 May 2024 07:19:55 +0900 Subject: [PATCH 114/334] New translations 23-starting-with-r.md (Japanese) --- locale/ja/episodes/23-starting-with-r.Rmd | 11 +++++++---- 1 file changed, 7 insertions(+), 4 deletions(-) diff --git a/locale/ja/episodes/23-starting-with-r.Rmd b/locale/ja/episodes/23-starting-with-r.Rmd index 79eabfdd7..29a615048 100644 --- a/locale/ja/episodes/23-starting-with-r.Rmd +++ b/locale/ja/episodes/23-starting-with-r.Rmd @@ -28,6 +28,9 @@ exercises: 60 :::::::::::::::::::::::::::::::::::::::::::::: +> This episode is based on the Data Carpentries's _Data Analysis and +> Visualisation in R for Ecologists_ lesson. + ## R でオブジェクトを作成する コンソールに math と入力するだけで、R から出力を取得できます。 @@ -50,12 +53,12 @@ _オブジェクト_ に割り当てる必要があります。 オブジェク `3` になります。 矢印は 3 **が `x` に入る** と読むことができます。 歴史的 理由により、代入に `=` を使用することもできますが、 のコンテキストで使用できるわけではありません。 構文に -わずかな違い](https\://blog.revolutionanalytics.com/2008/12/use-equals-or-arrow-for-assignment.html) +わずかな違い](https://blog.revolutionanalytics.com/2008/12/use-equals-or-arrow-for-assignment.html) があるため、常に `< を使用することをお勧めします。 -` 割り当て用。 -RStudio では、 <kbd>オプション</kbd> を入力しながら、 <kbd>Alt</kbd> + <kbd>-</kbd> を入力すると ( <kbd>-</kbd> キーと同時に <kbd>Alt</kbd> +RStudio では、 <kbd>オプション</kbd> を入力しながら、 <kbd>Alt</kbd> + <kbd>\-</kbd> を入力すると ( <kbd>\-</kbd> キーと同時に <kbd>Alt</kbd> を押すと)、PC で -1 回のキーストロークで `<-` が書き込まれます。 + <kbd>-</kbd> ( <kbd>オプション</kbd> <kbd>-</kbd> キーと同時に押す) は、Mac でも +1 回のキーストロークで `<-` が書き込まれます。 + <kbd>\-</kbd> ( <kbd>オプション</kbd> <kbd>\-</kbd> キーと同時に押す) は、Mac でも と同じことを行います。 ### 変数に名前を付ける @@ -139,7 +142,7 @@ R のメモリに「weight_kg」があるので、それを使って算術演算 ## チャレンジ: オブジェクト「weight_lb」の現在の内容は何だと思いますか? -126.5 それとも 220? +126\.5 それとも 220? :::::::::::::::::::::::::::::::::::::::::::::: From 9064bdc34cb679e0f7caa2bdeecb97371f47fc9a Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sun, 12 May 2024 07:19:57 +0900 Subject: [PATCH 115/334] New translations 23-starting-with-r.md (Portuguese) --- locale/pt/episodes/23-starting-with-r.Rmd | 11 +++++++---- 1 file changed, 7 insertions(+), 4 deletions(-) diff --git a/locale/pt/episodes/23-starting-with-r.Rmd b/locale/pt/episodes/23-starting-with-r.Rmd index 47ac62388..410e507fd 100644 --- a/locale/pt/episodes/23-starting-with-r.Rmd +++ b/locale/pt/episodes/23-starting-with-r.Rmd @@ -28,6 +28,9 @@ exercises: 60 :::::::::::::::::::::::::::::::::::::::::::::::::: +> This episode is based on the Data Carpentries's _Data Analysis and +> Visualisation in R for Ecologists_ lesson. + ## Creating objects in R You can get output from R simply by typing math in the console: @@ -53,9 +56,9 @@ context. Because of the [slight differences](https://blog.revolutionanalytics.com/2008/12/use-equals-or-arrow-for-assignment.html) in syntax, it is good practice to always use `<-` for assignments. -In RStudio, typing <kbd>Alt</kbd> + <kbd>-</kbd> (push <kbd>Alt</kbd> -at the same time as the <kbd>-</kbd> key) will write `<-` in a single -keystroke in a PC, while typing <kbd>Option</kbd> + <kbd>-</kbd> (push <kbd>Option</kbd> at the same time as the <kbd>-</kbd> key) does the +In RStudio, typing <kbd>Alt</kbd> + <kbd>\-</kbd> (push <kbd>Alt</kbd> +at the same time as the <kbd>\-</kbd> key) will write `<-` in a single +keystroke in a PC, while typing <kbd>Option</kbd> + <kbd>\-</kbd> (push <kbd>Option</kbd> at the same time as the <kbd>\-</kbd> key) does the same in a Mac. ### Naming variables @@ -139,7 +142,7 @@ weight_kg <- 100 ## Challenge: What do you think is the current content of the object `weight_lb`? -126.5 or 220? +126\.5 or 220? :::::::::::::::::::::::::::::::::::::::::::::::::: From e9afbec4b97b0ae9d0ec607d868afd22a5100c8d Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sun, 12 May 2024 07:19:59 +0900 Subject: [PATCH 116/334] New translations 23-starting-with-r.md (Chinese Simplified) --- locale/zh/episodes/23-starting-with-r.Rmd | 11 +++++++---- 1 file changed, 7 insertions(+), 4 deletions(-) diff --git a/locale/zh/episodes/23-starting-with-r.Rmd b/locale/zh/episodes/23-starting-with-r.Rmd index 47ac62388..410e507fd 100644 --- a/locale/zh/episodes/23-starting-with-r.Rmd +++ b/locale/zh/episodes/23-starting-with-r.Rmd @@ -28,6 +28,9 @@ exercises: 60 :::::::::::::::::::::::::::::::::::::::::::::::::: +> This episode is based on the Data Carpentries's _Data Analysis and +> Visualisation in R for Ecologists_ lesson. + ## Creating objects in R You can get output from R simply by typing math in the console: @@ -53,9 +56,9 @@ context. Because of the [slight differences](https://blog.revolutionanalytics.com/2008/12/use-equals-or-arrow-for-assignment.html) in syntax, it is good practice to always use `<-` for assignments. -In RStudio, typing <kbd>Alt</kbd> + <kbd>-</kbd> (push <kbd>Alt</kbd> -at the same time as the <kbd>-</kbd> key) will write `<-` in a single -keystroke in a PC, while typing <kbd>Option</kbd> + <kbd>-</kbd> (push <kbd>Option</kbd> at the same time as the <kbd>-</kbd> key) does the +In RStudio, typing <kbd>Alt</kbd> + <kbd>\-</kbd> (push <kbd>Alt</kbd> +at the same time as the <kbd>\-</kbd> key) will write `<-` in a single +keystroke in a PC, while typing <kbd>Option</kbd> + <kbd>\-</kbd> (push <kbd>Option</kbd> at the same time as the <kbd>\-</kbd> key) does the same in a Mac. ### Naming variables @@ -139,7 +142,7 @@ weight_kg <- 100 ## Challenge: What do you think is the current content of the object `weight_lb`? -126.5 or 220? +126\.5 or 220? :::::::::::::::::::::::::::::::::::::::::::::::::: From f7496e1c10e246950f55b3f452528e433f408b81 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sun, 12 May 2024 07:20:01 +0900 Subject: [PATCH 117/334] New translations 25-starting-with-data.md (French) --- locale/fr/episodes/25-starting-with-data.Rmd | 27 +++++++++++--------- 1 file changed, 15 insertions(+), 12 deletions(-) diff --git a/locale/fr/episodes/25-starting-with-data.Rmd b/locale/fr/episodes/25-starting-with-data.Rmd index bc29da0cd..8506d99ee 100644 --- a/locale/fr/episodes/25-starting-with-data.Rmd +++ b/locale/fr/episodes/25-starting-with-data.Rmd @@ -27,6 +27,9 @@ exercises: 30 :::::::::::::::::::::::::::::::::::::::::::::::::: +> This episode is based on the Data Carpentries's _Data Analysis and +> Visualisation in R for Ecologists_ lesson. + ## Presentation of the gene expression data We are going to use part of the data published by Blackmore , _The @@ -43,19 +46,19 @@ The dataset is stored as a comma-separated values (CSV) file. Each row holds information for a single RNA expression measurement, and the first eleven columns represent: -| Column | Description | -| ---------- | -------------------------------------------------------------------------------------------- | -| gene | The name of the gene that was measured | -| sample | The name of the sample the gene expression was measured in | -| expression | The value of the gene expression | -| organism | The organism/species - here all data stem from mice | -| age | The age of the mouse (all mice were 8 weeks here) | -| sex | The sex of the mouse | +| Column | Description | +| ---------- | -------------------------------------------------------------------------------------------------------------------------------------------- | +| gene | The name of the gene that was measured | +| sample | The name of the sample the gene expression was measured in | +| expression | The value of the gene expression | +| organism | The organism/species - here all data stem from mice | +| age | The age of the mouse (all mice were 8 weeks here) | +| sex | The sex of the mouse | | infection | The infection state of the mouse, i.e. infected with Influenza A or not infected. | -| strain | The Influenza A strain. | -| time | The duration of the infection (in days). | +| strain | The Influenza A strain. | +| time | The duration of the infection (in days). | | tissue | The tissue that was used for the gene expression experiment, i.e. cerebellum or spinal cord. | -| mouse | The mouse unique identifier. | +| mouse | The mouse unique identifier. | We are going to use the R function `download.file()` to download the CSV file that contains the gene expression data, and we will use @@ -89,7 +92,7 @@ typing its name: rna ``` -Wow\... that was a lot of output. At least it means the data loaded +Wow... that was a lot of output. At least it means the data loaded properly. Let's check the top (the first 6 lines) of this data frame using the function `head()`: From 65827a868a4ca48c418b070a89861c16fece3fb2 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sun, 12 May 2024 07:20:03 +0900 Subject: [PATCH 118/334] New translations 25-starting-with-data.md (Spanish) --- locale/es/episodes/25-starting-with-data.Rmd | 27 +++++++++++--------- 1 file changed, 15 insertions(+), 12 deletions(-) diff --git a/locale/es/episodes/25-starting-with-data.Rmd b/locale/es/episodes/25-starting-with-data.Rmd index ea2a353ca..65bef1be3 100644 --- a/locale/es/episodes/25-starting-with-data.Rmd +++ b/locale/es/episodes/25-starting-with-data.Rmd @@ -27,6 +27,9 @@ exercises: 30 :::::::::::::::::::::::::::::::::::::::::::::::::: +> This episode is based on the Data Carpentries's _Data Analysis and +> Visualisation in R for Ecologists_ lesson. + ## Presentation of the gene expression data We are going to use part of the data published by Blackmore , _The @@ -43,19 +46,19 @@ The dataset is stored as a comma-separated values (CSV) file. Each row holds information for a single RNA expression measurement, and the first eleven columns represent: -| Column | Description | -| ---------- | -------------------------------------------------------------------------------------------- | -| gene | The name of the gene that was measured | -| sample | The name of the sample the gene expression was measured in | -| expression | The value of the gene expression | -| organism | The organism/species - here all data stem from mice | -| age | The age of the mouse (all mice were 8 weeks here) | -| sex | The sex of the mouse | +| Column | Description | +| ---------- | -------------------------------------------------------------------------------------------------------------------------------------------- | +| gene | The name of the gene that was measured | +| sample | The name of the sample the gene expression was measured in | +| expression | The value of the gene expression | +| organism | The organism/species - here all data stem from mice | +| age | The age of the mouse (all mice were 8 weeks here) | +| sex | The sex of the mouse | | infection | The infection state of the mouse, i.e. infected with Influenza A or not infected. | -| strain | The Influenza A strain. | -| time | The duration of the infection (in days). | +| strain | The Influenza A strain. | +| time | The duration of the infection (in days). | | tissue | The tissue that was used for the gene expression experiment, i.e. cerebellum or spinal cord. | -| mouse | The mouse unique identifier. | +| mouse | The mouse unique identifier. | We are going to use the R function `download.file()` to download the CSV file that contains the gene expression data, and we will use @@ -89,7 +92,7 @@ typing its name: rna ``` -Wow\... that was a lot of output. At least it means the data loaded +Wow... that was a lot of output. At least it means the data loaded properly. Let's check the top (the first 6 lines) of this data frame using the function `head()`: From c784e2d18745a3b49cc0e1d7dd42cf3d103f622b Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sun, 12 May 2024 07:20:05 +0900 Subject: [PATCH 119/334] New translations 25-starting-with-data.md (Japanese) --- locale/ja/episodes/25-starting-with-data.Rmd | 27 +++++++++++--------- 1 file changed, 15 insertions(+), 12 deletions(-) diff --git a/locale/ja/episodes/25-starting-with-data.Rmd b/locale/ja/episodes/25-starting-with-data.Rmd index b473a51d9..cfd7485b3 100644 --- a/locale/ja/episodes/25-starting-with-data.Rmd +++ b/locale/ja/episodes/25-starting-with-data.Rmd @@ -27,6 +27,9 @@ exercises: 30 :::::::::::::::::::::::::::::::::::::::::::::::::: +> This episode is based on the Data Carpentries's _Data Analysis and +> Visualisation in R for Ecologists_ lesson. + ## Presentation of the gene expression data We are going to use part of the data published by Blackmore , _The @@ -43,19 +46,19 @@ The dataset is stored as a comma-separated values (CSV) file. Each row holds information for a single RNA expression measurement, and the first eleven columns represent: -| Column | Description | -| ---------- | -------------------------------------------------------------------------------------------- | -| gene | The name of the gene that was measured | -| sample | The name of the sample the gene expression was measured in | -| expression | The value of the gene expression | -| organism | The organism/species - here all data stem from mice | -| age | The age of the mouse (all mice were 8 weeks here) | -| sex | The sex of the mouse | +| Column | Description | +| ---------- | -------------------------------------------------------------------------------------------------------------------------------------------- | +| gene | The name of the gene that was measured | +| sample | The name of the sample the gene expression was measured in | +| expression | The value of the gene expression | +| organism | The organism/species - here all data stem from mice | +| age | The age of the mouse (all mice were 8 weeks here) | +| sex | The sex of the mouse | | infection | The infection state of the mouse, i.e. infected with Influenza A or not infected. | -| strain | The Influenza A strain. | -| time | The duration of the infection (in days). | +| strain | The Influenza A strain. | +| time | The duration of the infection (in days). | | tissue | The tissue that was used for the gene expression experiment, i.e. cerebellum or spinal cord. | -| mouse | The mouse unique identifier. | +| mouse | The mouse unique identifier. | We are going to use the R function `download.file()` to download the CSV file that contains the gene expression data, and we will use @@ -89,7 +92,7 @@ typing its name: rna ``` -Wow\... that was a lot of output. At least it means the data loaded +Wow... that was a lot of output. At least it means the data loaded properly. Let's check the top (the first 6 lines) of this data frame using the function `head()`: From 38aa2ebcf87f4cccad985f991ad750be1912f84d Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sun, 12 May 2024 07:20:09 +0900 Subject: [PATCH 120/334] New translations 25-starting-with-data.md (Portuguese) --- locale/pt/episodes/25-starting-with-data.Rmd | 27 +++++++++++--------- 1 file changed, 15 insertions(+), 12 deletions(-) diff --git a/locale/pt/episodes/25-starting-with-data.Rmd b/locale/pt/episodes/25-starting-with-data.Rmd index cd371d4de..21ab5df47 100644 --- a/locale/pt/episodes/25-starting-with-data.Rmd +++ b/locale/pt/episodes/25-starting-with-data.Rmd @@ -27,6 +27,9 @@ exercises: 30 :::::::::::::::::::::::::::::::::::::::::::::::::: +> This episode is based on the Data Carpentries's _Data Analysis and +> Visualisation in R for Ecologists_ lesson. + ## Presentation of the gene expression data We are going to use part of the data published by Blackmore , _The @@ -43,19 +46,19 @@ The dataset is stored as a comma-separated values (CSV) file. Each row holds information for a single RNA expression measurement, and the first eleven columns represent: -| Column | Description | -| ---------- | -------------------------------------------------------------------------------------------- | -| gene | The name of the gene that was measured | -| sample | The name of the sample the gene expression was measured in | -| expression | The value of the gene expression | -| organism | The organism/species - here all data stem from mice | -| age | The age of the mouse (all mice were 8 weeks here) | -| sex | The sex of the mouse | +| Column | Description | +| ---------- | -------------------------------------------------------------------------------------------------------------------------------------------- | +| gene | The name of the gene that was measured | +| sample | The name of the sample the gene expression was measured in | +| expression | The value of the gene expression | +| organism | The organism/species - here all data stem from mice | +| age | The age of the mouse (all mice were 8 weeks here) | +| sex | The sex of the mouse | | infection | The infection state of the mouse, i.e. infected with Influenza A or not infected. | -| strain | The Influenza A strain. | -| time | The duration of the infection (in days). | +| strain | The Influenza A strain. | +| time | The duration of the infection (in days). | | tissue | The tissue that was used for the gene expression experiment, i.e. cerebellum or spinal cord. | -| mouse | The mouse unique identifier. | +| mouse | The mouse unique identifier. | We are going to use the R function `download.file()` to download the CSV file that contains the gene expression data, and we will use @@ -89,7 +92,7 @@ typing its name: rna ``` -Wow\... that was a lot of output. At least it means the data loaded +Wow... that was a lot of output. At least it means the data loaded properly. Let's check the top (the first 6 lines) of this data frame using the function `head()`: From 86e5653aa8df793df7cdf8b3758d89a39b6ed666 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sun, 12 May 2024 07:20:11 +0900 Subject: [PATCH 121/334] New translations 25-starting-with-data.md (Chinese Simplified) --- locale/zh/episodes/25-starting-with-data.Rmd | 27 +++++++++++--------- 1 file changed, 15 insertions(+), 12 deletions(-) diff --git a/locale/zh/episodes/25-starting-with-data.Rmd b/locale/zh/episodes/25-starting-with-data.Rmd index bc29da0cd..8506d99ee 100644 --- a/locale/zh/episodes/25-starting-with-data.Rmd +++ b/locale/zh/episodes/25-starting-with-data.Rmd @@ -27,6 +27,9 @@ exercises: 30 :::::::::::::::::::::::::::::::::::::::::::::::::: +> This episode is based on the Data Carpentries's _Data Analysis and +> Visualisation in R for Ecologists_ lesson. + ## Presentation of the gene expression data We are going to use part of the data published by Blackmore , _The @@ -43,19 +46,19 @@ The dataset is stored as a comma-separated values (CSV) file. Each row holds information for a single RNA expression measurement, and the first eleven columns represent: -| Column | Description | -| ---------- | -------------------------------------------------------------------------------------------- | -| gene | The name of the gene that was measured | -| sample | The name of the sample the gene expression was measured in | -| expression | The value of the gene expression | -| organism | The organism/species - here all data stem from mice | -| age | The age of the mouse (all mice were 8 weeks here) | -| sex | The sex of the mouse | +| Column | Description | +| ---------- | -------------------------------------------------------------------------------------------------------------------------------------------- | +| gene | The name of the gene that was measured | +| sample | The name of the sample the gene expression was measured in | +| expression | The value of the gene expression | +| organism | The organism/species - here all data stem from mice | +| age | The age of the mouse (all mice were 8 weeks here) | +| sex | The sex of the mouse | | infection | The infection state of the mouse, i.e. infected with Influenza A or not infected. | -| strain | The Influenza A strain. | -| time | The duration of the infection (in days). | +| strain | The Influenza A strain. | +| time | The duration of the infection (in days). | | tissue | The tissue that was used for the gene expression experiment, i.e. cerebellum or spinal cord. | -| mouse | The mouse unique identifier. | +| mouse | The mouse unique identifier. | We are going to use the R function `download.file()` to download the CSV file that contains the gene expression data, and we will use @@ -89,7 +92,7 @@ typing its name: rna ``` -Wow\... that was a lot of output. At least it means the data loaded +Wow... that was a lot of output. At least it means the data loaded properly. Let's check the top (the first 6 lines) of this data frame using the function `head()`: From 341fa0a68484a0296c2f5615b139413121799183 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sun, 12 May 2024 07:20:13 +0900 Subject: [PATCH 122/334] New translations 30-dplyr.md (French) --- locale/fr/episodes/30-dplyr.Rmd | 23 +++++++++++++---------- 1 file changed, 13 insertions(+), 10 deletions(-) diff --git a/locale/fr/episodes/30-dplyr.Rmd b/locale/fr/episodes/30-dplyr.Rmd index d41f82e5f..b50395a63 100644 --- a/locale/fr/episodes/30-dplyr.Rmd +++ b/locale/fr/episodes/30-dplyr.Rmd @@ -31,6 +31,9 @@ download.file(url = "https://github.com/carpentries-incubator/bioc-intro/raw/mai destfile = "data/rnaseq.csv") ``` +> This episode is based on the Data Carpentries's _Data Analysis and +> Visualisation in R for Ecologists_ lesson. + ## Data manipulation using **`dplyr`** and **`tidyr`** Bracket subsetting is handy, but it can be cumbersome and difficult to @@ -103,7 +106,7 @@ previously. The data structure is very similar to a data frame. For our purposes the only differences are that: 1. It displays the data type of each column under its name. - Note that <`dbl`> is a data type defined to hold numeric values with + Note that \<`dbl`\> is a data type defined to hold numeric values with decimal points. 2. It only prints the first few rows of data and only as many columns as fit on @@ -167,7 +170,7 @@ filter(genes, is.na(hsapiens_homolog_associated_gene_name)) If we want to keep only mouse genes that have a human homolog, we can insert a "!" symbol that negates the result, so we're asking for -every row where hsapiens_homolog_associated_gene_name _is not_ an +every row where hsapiens\_homolog\_associated\_gene\_name _is not_ an `NA`. ```{r, purl=TRUE} @@ -305,7 +308,7 @@ criteria: contains only the `gene`, `chromosome_name`, `phenotype_description`, `sample`, and `expression` columns. The expression values should be log-transformed. This data frame must only contain genes located on sex chromosomes, associated with a -phenotype_description, and with a log expression higher than 5. +phenotype\_description, and with a log expression higher than 5. **Hint**: think about how the commands should be ordered to produce this data frame! @@ -529,7 +532,7 @@ In the `rna` tibble, the rows contain expression values (the unit) that are associated with a combination of 2 other variables: `gene` and `sample`. All the other columns correspond to variables describing either -the sample (organism, age, sex, ...) or the gene (gene_biotype, ENTREZ_ID, product, ...). +the sample (organism, age, sex, ...) or the gene (gene\_biotype, ENTREZ\_ID, product, ...). The variables that don't change with genes or with samples will have the same value in all the rows. ```{r} @@ -815,7 +818,7 @@ rna %>% summarise(mean_exp = mean(expression)) ``` -before using the pivot_wider() function +before using the pivot\_wider() function ```{r} rna_time <- rna %>% @@ -839,7 +842,7 @@ rna %>% select(gene, 4) ``` -To select the timepoint 4, we would have to quote the column name, with backticks "\`" +To select the timepoint 4, we would have to quote the column name, with backticks "\\`" ```{r} rna %>% @@ -880,7 +883,7 @@ Convert this table into a long-format table gathering the fold-changes calculate ## Solution -Starting from the rna_time tibble: +Starting from the rna\_time tibble: ```{r} rna_time @@ -893,7 +896,7 @@ rna_time %>% mutate(time_8_vs_0 = `8` / `0`, time_8_vs_4 = `8` / `4`) ``` -And use the pivot_longer() function: +And use the pivot\_longer() function: ```{r} rna_time %>% @@ -938,7 +941,7 @@ rna_mini ``` The second table, `annot1`, contains 2 columns, gene and -gene_description. You can either +gene\_description. You can either [download annot1.csv](https://carpentries-incubator.github.io/bioc-intro/data/annot1.csv) by clicking on the link and then moving it to the `data/` folder, or you can use the R code below to download it directly to the folder. @@ -1031,7 +1034,7 @@ or modify it. In contrast, our script will generate the contents of the `data_ou directory, so even if the files it contains are deleted, we can always re-generate them. -Let's use `write_csv()` to save the rna_wide table that we have created previously. +Let's use `write_csv()` to save the rna\_wide table that we have created previously. ```{r, purl=TRUE, eval=FALSE} write_csv(rna_wide, file = "data_output/rna_wide.csv") From 2bcffe0f0013bf42a2f8602fea0cf7fb2965105d Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sun, 12 May 2024 07:20:15 +0900 Subject: [PATCH 123/334] New translations 30-dplyr.md (Spanish) --- locale/es/episodes/30-dplyr.Rmd | 23 +++++++++++++---------- 1 file changed, 13 insertions(+), 10 deletions(-) diff --git a/locale/es/episodes/30-dplyr.Rmd b/locale/es/episodes/30-dplyr.Rmd index fd4b2b14f..b2360ec26 100644 --- a/locale/es/episodes/30-dplyr.Rmd +++ b/locale/es/episodes/30-dplyr.Rmd @@ -31,6 +31,9 @@ download.file(url = "https://github.com/carpentries-incubator/bioc-intro/raw/mai destfile = "data/rnaseq.csv") ``` +> This episode is based on the Data Carpentries's _Data Analysis and +> Visualisation in R for Ecologists_ lesson. + ## Data manipulation using **`dplyr`** and **`tidyr`** Bracket subsetting is handy, but it can be cumbersome and difficult to @@ -103,7 +106,7 @@ previously. The data structure is very similar to a data frame. For our purposes the only differences are that: 1. It displays the data type of each column under its name. - Note that <`dbl`> is a data type defined to hold numeric values with + Note that \<`dbl`\> is a data type defined to hold numeric values with decimal points. 2. It only prints the first few rows of data and only as many columns as fit on @@ -167,7 +170,7 @@ filter(genes, is.na(hsapiens_homolog_associated_gene_name)) If we want to keep only mouse genes that have a human homolog, we can insert a "!" symbol that negates the result, so we're asking for -every row where hsapiens_homolog_associated_gene_name _is not_ an +every row where hsapiens\_homolog\_associated\_gene\_name _is not_ an `NA`. ```{r, purl=TRUE} @@ -305,7 +308,7 @@ criteria: contains only the `gene`, `chromosome_name`, `phenotype_description`, `sample`, and `expression` columns. The expression values should be log-transformed. This data frame must only contain genes located on sex chromosomes, associated with a -phenotype_description, and with a log expression higher than 5. +phenotype\_description, and with a log expression higher than 5. **Hint**: think about how the commands should be ordered to produce this data frame! @@ -529,7 +532,7 @@ In the `rna` tibble, the rows contain expression values (the unit) that are associated with a combination of 2 other variables: `gene` and `sample`. All the other columns correspond to variables describing either -the sample (organism, age, sex, ...) or the gene (gene_biotype, ENTREZ_ID, product, ...). +the sample (organism, age, sex, ...) or the gene (gene\_biotype, ENTREZ\_ID, product, ...). The variables that don't change with genes or with samples will have the same value in all the rows. ```{r} @@ -815,7 +818,7 @@ rna %>% summarise(mean_exp = mean(expression)) ``` -before using the pivot_wider() function +before using the pivot\_wider() function ```{r} rna_time <- rna %>% @@ -839,7 +842,7 @@ rna %>% select(gene, 4) ``` -To select the timepoint 4, we would have to quote the column name, with backticks "\`" +To select the timepoint 4, we would have to quote the column name, with backticks "\\`" ```{r} rna %>% @@ -880,7 +883,7 @@ Convert this table into a long-format table gathering the fold-changes calculate ## Solution -Starting from the rna_time tibble: +Starting from the rna\_time tibble: ```{r} rna_time @@ -893,7 +896,7 @@ rna_time %>% mutate(time_8_vs_0 = `8` / `0`, time_8_vs_4 = `8` / `4`) ``` -And use the pivot_longer() function: +And use the pivot\_longer() function: ```{r} rna_time %>% @@ -938,7 +941,7 @@ rna_mini ``` The second table, `annot1`, contains 2 columns, gene and -gene_description. You can either +gene\_description. You can either [download annot1.csv](https://carpentries-incubator.github.io/bioc-intro/data/annot1.csv) by clicking on the link and then moving it to the `data/` folder, or you can use the R code below to download it directly to the folder. @@ -1031,7 +1034,7 @@ or modify it. In contrast, our script will generate the contents of the `data_ou directory, so even if the files it contains are deleted, we can always re-generate them. -Let's use `write_csv()` to save the rna_wide table that we have created previously. +Let's use `write_csv()` to save the rna\_wide table that we have created previously. ```{r, purl=TRUE, eval=FALSE} write_csv(rna_wide, file = "data_output/rna_wide.csv") From e5a3ea9d45e0799c5c334c1da2217d67c8372292 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sun, 12 May 2024 07:20:17 +0900 Subject: [PATCH 124/334] New translations 30-dplyr.md (Japanese) --- locale/ja/episodes/30-dplyr.Rmd | 23 +++++++++++++---------- 1 file changed, 13 insertions(+), 10 deletions(-) diff --git a/locale/ja/episodes/30-dplyr.Rmd b/locale/ja/episodes/30-dplyr.Rmd index d26d30d12..0af50f431 100644 --- a/locale/ja/episodes/30-dplyr.Rmd +++ b/locale/ja/episodes/30-dplyr.Rmd @@ -31,6 +31,9 @@ download.file(url = "https://github.com/carpentries-incubator/bioc-intro/raw/mai destfile = "data/rnaseq.csv") ``` +> This episode is based on the Data Carpentries's _Data Analysis and +> Visualisation in R for Ecologists_ lesson. + ## Data manipulation using **`dplyr`** and **`tidyr`** Bracket subsetting is handy, but it can be cumbersome and difficult to @@ -103,7 +106,7 @@ previously. The data structure is very similar to a data frame. For our purposes the only differences are that: 1. It displays the data type of each column under its name. - Note that <`dbl`> is a data type defined to hold numeric values with + Note that \<`dbl`\> is a data type defined to hold numeric values with decimal points. 2. It only prints the first few rows of data and only as many columns as fit on @@ -167,7 +170,7 @@ filter(genes, is.na(hsapiens_homolog_associated_gene_name)) If we want to keep only mouse genes that have a human homolog, we can insert a "!" symbol that negates the result, so we're asking for -every row where hsapiens_homolog_associated_gene_name _is not_ an +every row where hsapiens\_homolog\_associated\_gene\_name _is not_ an `NA`. ```{r, purl=TRUE} @@ -305,7 +308,7 @@ criteria: contains only the `gene`, `chromosome_name`, `phenotype_description`, `sample`, and `expression` columns. The expression values should be log-transformed. This data frame must only contain genes located on sex chromosomes, associated with a -phenotype_description, and with a log expression higher than 5. +phenotype\_description, and with a log expression higher than 5. **Hint**: think about how the commands should be ordered to produce this data frame! @@ -529,7 +532,7 @@ In the `rna` tibble, the rows contain expression values (the unit) that are associated with a combination of 2 other variables: `gene` and `sample`. All the other columns correspond to variables describing either -the sample (organism, age, sex, ...) or the gene (gene_biotype, ENTREZ_ID, product, ...). +the sample (organism, age, sex, ...) or the gene (gene\_biotype, ENTREZ\_ID, product, ...). The variables that don't change with genes or with samples will have the same value in all the rows. ```{r} @@ -815,7 +818,7 @@ rna %>% summarise(mean_exp = mean(expression)) ``` -before using the pivot_wider() function +before using the pivot\_wider() function ```{r} rna_time <- rna %>% @@ -839,7 +842,7 @@ rna %>% select(gene, 4) ``` -To select the timepoint 4, we would have to quote the column name, with backticks "\`" +To select the timepoint 4, we would have to quote the column name, with backticks "\\`" ```{r} rna %>% @@ -880,7 +883,7 @@ Convert this table into a long-format table gathering the fold-changes calculate ## Solution -Starting from the rna_time tibble: +Starting from the rna\_time tibble: ```{r} rna_time @@ -893,7 +896,7 @@ rna_time %>% mutate(time_8_vs_0 = `8` / `0`, time_8_vs_4 = `8` / `4`) ``` -And use the pivot_longer() function: +And use the pivot\_longer() function: ```{r} rna_time %>% @@ -938,7 +941,7 @@ rna_mini ``` The second table, `annot1`, contains 2 columns, gene and -gene_description. You can either +gene\_description. You can either [download annot1.csv](https://carpentries-incubator.github.io/bioc-intro/data/annot1.csv) by clicking on the link and then moving it to the `data/` folder, or you can use the R code below to download it directly to the folder. @@ -1031,7 +1034,7 @@ or modify it. In contrast, our script will generate the contents of the `data_ou directory, so even if the files it contains are deleted, we can always re-generate them. -Let's use `write_csv()` to save the rna_wide table that we have created previously. +Let's use `write_csv()` to save the rna\_wide table that we have created previously. ```{r, purl=TRUE, eval=FALSE} write_csv(rna_wide, file = "data_output/rna_wide.csv") From 131a1fd3690d4217d242f4bec1415f2110af1e4e Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sun, 12 May 2024 07:20:19 +0900 Subject: [PATCH 125/334] New translations 30-dplyr.md (Portuguese) --- locale/pt/episodes/30-dplyr.Rmd | 23 +++++++++++++---------- 1 file changed, 13 insertions(+), 10 deletions(-) diff --git a/locale/pt/episodes/30-dplyr.Rmd b/locale/pt/episodes/30-dplyr.Rmd index d41f82e5f..b50395a63 100644 --- a/locale/pt/episodes/30-dplyr.Rmd +++ b/locale/pt/episodes/30-dplyr.Rmd @@ -31,6 +31,9 @@ download.file(url = "https://github.com/carpentries-incubator/bioc-intro/raw/mai destfile = "data/rnaseq.csv") ``` +> This episode is based on the Data Carpentries's _Data Analysis and +> Visualisation in R for Ecologists_ lesson. + ## Data manipulation using **`dplyr`** and **`tidyr`** Bracket subsetting is handy, but it can be cumbersome and difficult to @@ -103,7 +106,7 @@ previously. The data structure is very similar to a data frame. For our purposes the only differences are that: 1. It displays the data type of each column under its name. - Note that <`dbl`> is a data type defined to hold numeric values with + Note that \<`dbl`\> is a data type defined to hold numeric values with decimal points. 2. It only prints the first few rows of data and only as many columns as fit on @@ -167,7 +170,7 @@ filter(genes, is.na(hsapiens_homolog_associated_gene_name)) If we want to keep only mouse genes that have a human homolog, we can insert a "!" symbol that negates the result, so we're asking for -every row where hsapiens_homolog_associated_gene_name _is not_ an +every row where hsapiens\_homolog\_associated\_gene\_name _is not_ an `NA`. ```{r, purl=TRUE} @@ -305,7 +308,7 @@ criteria: contains only the `gene`, `chromosome_name`, `phenotype_description`, `sample`, and `expression` columns. The expression values should be log-transformed. This data frame must only contain genes located on sex chromosomes, associated with a -phenotype_description, and with a log expression higher than 5. +phenotype\_description, and with a log expression higher than 5. **Hint**: think about how the commands should be ordered to produce this data frame! @@ -529,7 +532,7 @@ In the `rna` tibble, the rows contain expression values (the unit) that are associated with a combination of 2 other variables: `gene` and `sample`. All the other columns correspond to variables describing either -the sample (organism, age, sex, ...) or the gene (gene_biotype, ENTREZ_ID, product, ...). +the sample (organism, age, sex, ...) or the gene (gene\_biotype, ENTREZ\_ID, product, ...). The variables that don't change with genes or with samples will have the same value in all the rows. ```{r} @@ -815,7 +818,7 @@ rna %>% summarise(mean_exp = mean(expression)) ``` -before using the pivot_wider() function +before using the pivot\_wider() function ```{r} rna_time <- rna %>% @@ -839,7 +842,7 @@ rna %>% select(gene, 4) ``` -To select the timepoint 4, we would have to quote the column name, with backticks "\`" +To select the timepoint 4, we would have to quote the column name, with backticks "\\`" ```{r} rna %>% @@ -880,7 +883,7 @@ Convert this table into a long-format table gathering the fold-changes calculate ## Solution -Starting from the rna_time tibble: +Starting from the rna\_time tibble: ```{r} rna_time @@ -893,7 +896,7 @@ rna_time %>% mutate(time_8_vs_0 = `8` / `0`, time_8_vs_4 = `8` / `4`) ``` -And use the pivot_longer() function: +And use the pivot\_longer() function: ```{r} rna_time %>% @@ -938,7 +941,7 @@ rna_mini ``` The second table, `annot1`, contains 2 columns, gene and -gene_description. You can either +gene\_description. You can either [download annot1.csv](https://carpentries-incubator.github.io/bioc-intro/data/annot1.csv) by clicking on the link and then moving it to the `data/` folder, or you can use the R code below to download it directly to the folder. @@ -1031,7 +1034,7 @@ or modify it. In contrast, our script will generate the contents of the `data_ou directory, so even if the files it contains are deleted, we can always re-generate them. -Let's use `write_csv()` to save the rna_wide table that we have created previously. +Let's use `write_csv()` to save the rna\_wide table that we have created previously. ```{r, purl=TRUE, eval=FALSE} write_csv(rna_wide, file = "data_output/rna_wide.csv") From 2a216138fea46c9d5e83756b32481bfe0c195e1e Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sun, 12 May 2024 07:20:21 +0900 Subject: [PATCH 126/334] New translations 30-dplyr.md (Chinese Simplified) --- locale/zh/episodes/30-dplyr.Rmd | 23 +++++++++++++---------- 1 file changed, 13 insertions(+), 10 deletions(-) diff --git a/locale/zh/episodes/30-dplyr.Rmd b/locale/zh/episodes/30-dplyr.Rmd index d41f82e5f..b50395a63 100644 --- a/locale/zh/episodes/30-dplyr.Rmd +++ b/locale/zh/episodes/30-dplyr.Rmd @@ -31,6 +31,9 @@ download.file(url = "https://github.com/carpentries-incubator/bioc-intro/raw/mai destfile = "data/rnaseq.csv") ``` +> This episode is based on the Data Carpentries's _Data Analysis and +> Visualisation in R for Ecologists_ lesson. + ## Data manipulation using **`dplyr`** and **`tidyr`** Bracket subsetting is handy, but it can be cumbersome and difficult to @@ -103,7 +106,7 @@ previously. The data structure is very similar to a data frame. For our purposes the only differences are that: 1. It displays the data type of each column under its name. - Note that <`dbl`> is a data type defined to hold numeric values with + Note that \<`dbl`\> is a data type defined to hold numeric values with decimal points. 2. It only prints the first few rows of data and only as many columns as fit on @@ -167,7 +170,7 @@ filter(genes, is.na(hsapiens_homolog_associated_gene_name)) If we want to keep only mouse genes that have a human homolog, we can insert a "!" symbol that negates the result, so we're asking for -every row where hsapiens_homolog_associated_gene_name _is not_ an +every row where hsapiens\_homolog\_associated\_gene\_name _is not_ an `NA`. ```{r, purl=TRUE} @@ -305,7 +308,7 @@ criteria: contains only the `gene`, `chromosome_name`, `phenotype_description`, `sample`, and `expression` columns. The expression values should be log-transformed. This data frame must only contain genes located on sex chromosomes, associated with a -phenotype_description, and with a log expression higher than 5. +phenotype\_description, and with a log expression higher than 5. **Hint**: think about how the commands should be ordered to produce this data frame! @@ -529,7 +532,7 @@ In the `rna` tibble, the rows contain expression values (the unit) that are associated with a combination of 2 other variables: `gene` and `sample`. All the other columns correspond to variables describing either -the sample (organism, age, sex, ...) or the gene (gene_biotype, ENTREZ_ID, product, ...). +the sample (organism, age, sex, ...) or the gene (gene\_biotype, ENTREZ\_ID, product, ...). The variables that don't change with genes or with samples will have the same value in all the rows. ```{r} @@ -815,7 +818,7 @@ rna %>% summarise(mean_exp = mean(expression)) ``` -before using the pivot_wider() function +before using the pivot\_wider() function ```{r} rna_time <- rna %>% @@ -839,7 +842,7 @@ rna %>% select(gene, 4) ``` -To select the timepoint 4, we would have to quote the column name, with backticks "\`" +To select the timepoint 4, we would have to quote the column name, with backticks "\\`" ```{r} rna %>% @@ -880,7 +883,7 @@ Convert this table into a long-format table gathering the fold-changes calculate ## Solution -Starting from the rna_time tibble: +Starting from the rna\_time tibble: ```{r} rna_time @@ -893,7 +896,7 @@ rna_time %>% mutate(time_8_vs_0 = `8` / `0`, time_8_vs_4 = `8` / `4`) ``` -And use the pivot_longer() function: +And use the pivot\_longer() function: ```{r} rna_time %>% @@ -938,7 +941,7 @@ rna_mini ``` The second table, `annot1`, contains 2 columns, gene and -gene_description. You can either +gene\_description. You can either [download annot1.csv](https://carpentries-incubator.github.io/bioc-intro/data/annot1.csv) by clicking on the link and then moving it to the `data/` folder, or you can use the R code below to download it directly to the folder. @@ -1031,7 +1034,7 @@ or modify it. In contrast, our script will generate the contents of the `data_ou directory, so even if the files it contains are deleted, we can always re-generate them. -Let's use `write_csv()` to save the rna_wide table that we have created previously. +Let's use `write_csv()` to save the rna\_wide table that we have created previously. ```{r, purl=TRUE, eval=FALSE} write_csv(rna_wide, file = "data_output/rna_wide.csv") From 97e20b55bb76fb86399b29b9acca9706de8f16b9 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sun, 12 May 2024 07:20:23 +0900 Subject: [PATCH 127/334] New translations 40-visualization.md (French) --- locale/fr/episodes/40-visualization.Rmd | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/locale/fr/episodes/40-visualization.Rmd b/locale/fr/episodes/40-visualization.Rmd index b1ab2920c..5500e95c3 100644 --- a/locale/fr/episodes/40-visualization.Rmd +++ b/locale/fr/episodes/40-visualization.Rmd @@ -31,6 +31,9 @@ download.file(url = "https://github.com/carpentries-incubator/bioc-intro/raw/mai rna <- read.csv("data/rnaseq.csv") ``` +> This episode is based on the Data Carpentries's _Data Analysis and +> Visualisation in R for Ecologists_ lesson. + ## Data Visualization We start by loading the required packages. **`ggplot2`** is included in @@ -83,7 +86,7 @@ customization of plots. > The idea behind the Grammar of Graphics it is that you can build every > graph from the same 3 components: (1) a data set, (2) a coordinate system, -> and (3) geoms — i.e. visual marks that represent data points \[^three\_comp\_ggplot2] +> and (3) geoms — i.e. visual marks that represent data points \[^three\\_comp\\_ggplot2] [^three_comp_ggplot2]: Source: [Data Visualization Cheat Sheet](https://raw.githubusercontent.com/rstudio/cheatsheets/main/data-visualization.pdf). @@ -733,7 +736,7 @@ Take a look at the ggplot2, and think of ways you could improve the plot. Now, we can change names of axes to something more informative than -'time' and 'mean_exp', and add a title to the figure: +'time' and 'mean\_exp', and add a title to the figure: ```{r mean_exp-time-with-right-labels, cache=FALSE, purl=TRUE} ggplot(data = mean_exp_by_time_sex, @@ -820,7 +823,7 @@ for inspiration. Here are some ideas: `scale_`) - Try using a different color palette or manually specifying the colors for the lines (see - [http://www.cookbook-r.com/Graphs/Colors_(ggplot2)/](https://www.cookbook-r.com/Graphs/Colors_\(ggplot2\)/)). + [http://www.cookbook-r.com/Graphs/Colors\_(ggplot2)/](https://www.cookbook-r.com/Graphs/Colors_\\(ggplot2\\)/)). ::::::::::::::: solution From 9398620ed8e85862556b1257ff1c654760dedc08 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sun, 12 May 2024 07:20:25 +0900 Subject: [PATCH 128/334] New translations 40-visualization.md (Spanish) --- locale/es/episodes/40-visualization.Rmd | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/locale/es/episodes/40-visualization.Rmd b/locale/es/episodes/40-visualization.Rmd index 1c3b31c29..f0b7de9b3 100644 --- a/locale/es/episodes/40-visualization.Rmd +++ b/locale/es/episodes/40-visualization.Rmd @@ -31,6 +31,9 @@ download.file(url = "https://github.com/carpentries-incubator/bioc-intro/raw/mai rna <- read.csv("data/rnaseq.csv") ``` +> This episode is based on the Data Carpentries's _Data Analysis and +> Visualisation in R for Ecologists_ lesson. + ## Data Visualization We start by loading the required packages. **`ggplot2`** is included in @@ -83,7 +86,7 @@ customization of plots. > The idea behind the Grammar of Graphics it is that you can build every > graph from the same 3 components: (1) a data set, (2) a coordinate system, -> and (3) geoms — i.e. visual marks that represent data points \[^three\_comp\_ggplot2] +> and (3) geoms — i.e. visual marks that represent data points \[^three\\_comp\\_ggplot2] [^three_comp_ggplot2]: Source: [Data Visualization Cheat Sheet](https://raw.githubusercontent.com/rstudio/cheatsheets/main/data-visualization.pdf). @@ -733,7 +736,7 @@ Take a look at the ggplot2, and think of ways you could improve the plot. Now, we can change names of axes to something more informative than -'time' and 'mean_exp', and add a title to the figure: +'time' and 'mean\_exp', and add a title to the figure: ```{r mean_exp-time-with-right-labels, cache=FALSE, purl=TRUE} ggplot(data = mean_exp_by_time_sex, @@ -820,7 +823,7 @@ for inspiration. Here are some ideas: `scale_`) - Try using a different color palette or manually specifying the colors for the lines (see - [http://www.cookbook-r.com/Graphs/Colors_(ggplot2)/](https://www.cookbook-r.com/Graphs/Colors_\(ggplot2\)/)). + [http://www.cookbook-r.com/Graphs/Colors\_(ggplot2)/](https://www.cookbook-r.com/Graphs/Colors_\\(ggplot2\\)/)). ::::::::::::::: solution From 42c596d5c416b43f1f3c47c3e9a259a084b3a570 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sun, 12 May 2024 07:20:27 +0900 Subject: [PATCH 129/334] New translations 40-visualization.md (Japanese) --- locale/ja/episodes/40-visualization.Rmd | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/locale/ja/episodes/40-visualization.Rmd b/locale/ja/episodes/40-visualization.Rmd index 425759cbc..9a926ff53 100644 --- a/locale/ja/episodes/40-visualization.Rmd +++ b/locale/ja/episodes/40-visualization.Rmd @@ -31,6 +31,9 @@ download.file(url = "https://github.com/carpentries-incubator/bioc-intro/raw/mai rna <- read.csv("data/rnaseq.csv") ``` +> This episode is based on the Data Carpentries's _Data Analysis and +> Visualisation in R for Ecologists_ lesson. + ## Data Visualization We start by loading the required packages. **`ggplot2`** is included in @@ -83,7 +86,7 @@ customization of plots. > The idea behind the Grammar of Graphics it is that you can build every > graph from the same 3 components: (1) a data set, (2) a coordinate system, -> and (3) geoms — i.e. visual marks that represent data points \[^three\_comp\_ggplot2] +> and (3) geoms — i.e. visual marks that represent data points \[^three\\_comp\\_ggplot2] [^three_comp_ggplot2]: Source: [Data Visualization Cheat Sheet](https://raw.githubusercontent.com/rstudio/cheatsheets/main/data-visualization.pdf). @@ -733,7 +736,7 @@ Take a look at the ggplot2, and think of ways you could improve the plot. Now, we can change names of axes to something more informative than -'time' and 'mean_exp', and add a title to the figure: +'time' and 'mean\_exp', and add a title to the figure: ```{r mean_exp-time-with-right-labels, cache=FALSE, purl=TRUE} ggplot(data = mean_exp_by_time_sex, @@ -820,7 +823,7 @@ for inspiration. Here are some ideas: `scale_`) - Try using a different color palette or manually specifying the colors for the lines (see - [http://www.cookbook-r.com/Graphs/Colors_(ggplot2)/](https://www.cookbook-r.com/Graphs/Colors_\(ggplot2\)/)). + [http://www.cookbook-r.com/Graphs/Colors\_(ggplot2)/](https://www.cookbook-r.com/Graphs/Colors_\\(ggplot2\\)/)). ::::::::::::::: solution From 67317e9711e7814f12ae8445583f3680407531b7 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sun, 12 May 2024 07:20:29 +0900 Subject: [PATCH 130/334] New translations 40-visualization.md (Portuguese) --- locale/pt/episodes/40-visualization.Rmd | 11 +++++++---- 1 file changed, 7 insertions(+), 4 deletions(-) diff --git a/locale/pt/episodes/40-visualization.Rmd b/locale/pt/episodes/40-visualization.Rmd index a635340fc..be5335640 100644 --- a/locale/pt/episodes/40-visualization.Rmd +++ b/locale/pt/episodes/40-visualization.Rmd @@ -31,6 +31,9 @@ download.file(url = "https://github.com/carpentries-incubator/bioc-intro/raw/mai rna <- read.csv("data/rnaseq.csv") ``` +> This episode is based on the Data Carpentries's _Data Analysis and +> Visualisation in R for Ecologists_ lesson. + ## Data Visualization We start by loading the required packages. **`ggplot2`** está incluído no pacote **`tidyverse`**. @@ -67,7 +70,7 @@ criar gráficos com qualidade de publicação com o mínimo de ajustes e afinações. Existe um livro sobre `ggplot2` (@ggplot2book) que fornece uma boa visão geral, mas está desatualizado. A 3ª edição está a ser preparada e será -[disponível gratuitamente online] (https\://ggplot2-book.org/). A página `ggplot2` +[disponível gratuitamente online] (https://ggplot2-book.org/). A página `ggplot2` ([https://ggplot2.tidyverse.org](https://ggplot2.tidyverse.org)) fornece uma ampla documentação. O `ggplot2` funciona como dados no formato 'long', ou seja, uma coluna para @@ -80,7 +83,7 @@ personalização das parcelas. > The idea behind the Grammar of Graphics it is that you can build every > graph from the same 3 components: (1) a data set, (2) a coordinate system, -> and (3) geoms — i.e. visual marks that represent data points \[^three\_comp\_ggplot2] +> and (3) geoms — i.e. visual marks that represent data points \[^three\\_comp\\_ggplot2] [^three_comp_ggplot2]: Source: [Data Visualization Cheat Sheet](https://raw.githubusercontent.com/rstudio/cheatsheets/main/data-visualization.pdf). @@ -730,7 +733,7 @@ Take a look at the ggplot2, and think of ways you could improve the plot. Now, we can change names of axes to something more informative than -'time' and 'mean_exp', and add a title to the figure: +'time' and 'mean\_exp', and add a title to the figure: ```{r mean_exp-time-with-right-labels, cache=FALSE, purl=TRUE} ggplot(data = mean_exp_by_time_sex, @@ -817,7 +820,7 @@ for inspiration. Here are some ideas: `scale_`) - Try using a different color palette or manually specifying the colors for the lines (see - [http://www.cookbook-r.com/Graphs/Colors_(ggplot2)/](https://www.cookbook-r.com/Graphs/Colors_\(ggplot2\)/)). + [http://www.cookbook-r.com/Graphs/Colors\_(ggplot2)/](https://www.cookbook-r.com/Graphs/Colors_\\(ggplot2\\)/)). ::::::::::::::: solution From 1587176f4890866c2920eb81580af57541dd1c04 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sun, 12 May 2024 07:20:31 +0900 Subject: [PATCH 131/334] New translations 40-visualization.md (Chinese Simplified) --- locale/zh/episodes/40-visualization.Rmd | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/locale/zh/episodes/40-visualization.Rmd b/locale/zh/episodes/40-visualization.Rmd index b1ab2920c..5500e95c3 100644 --- a/locale/zh/episodes/40-visualization.Rmd +++ b/locale/zh/episodes/40-visualization.Rmd @@ -31,6 +31,9 @@ download.file(url = "https://github.com/carpentries-incubator/bioc-intro/raw/mai rna <- read.csv("data/rnaseq.csv") ``` +> This episode is based on the Data Carpentries's _Data Analysis and +> Visualisation in R for Ecologists_ lesson. + ## Data Visualization We start by loading the required packages. **`ggplot2`** is included in @@ -83,7 +86,7 @@ customization of plots. > The idea behind the Grammar of Graphics it is that you can build every > graph from the same 3 components: (1) a data set, (2) a coordinate system, -> and (3) geoms — i.e. visual marks that represent data points \[^three\_comp\_ggplot2] +> and (3) geoms — i.e. visual marks that represent data points \[^three\\_comp\\_ggplot2] [^three_comp_ggplot2]: Source: [Data Visualization Cheat Sheet](https://raw.githubusercontent.com/rstudio/cheatsheets/main/data-visualization.pdf). @@ -733,7 +736,7 @@ Take a look at the ggplot2, and think of ways you could improve the plot. Now, we can change names of axes to something more informative than -'time' and 'mean_exp', and add a title to the figure: +'time' and 'mean\_exp', and add a title to the figure: ```{r mean_exp-time-with-right-labels, cache=FALSE, purl=TRUE} ggplot(data = mean_exp_by_time_sex, @@ -820,7 +823,7 @@ for inspiration. Here are some ideas: `scale_`) - Try using a different color palette or manually specifying the colors for the lines (see - [http://www.cookbook-r.com/Graphs/Colors_(ggplot2)/](https://www.cookbook-r.com/Graphs/Colors_\(ggplot2\)/)). + [http://www.cookbook-r.com/Graphs/Colors\_(ggplot2)/](https://www.cookbook-r.com/Graphs/Colors_\\(ggplot2\\)/)). ::::::::::::::: solution From e08bc9e8563c8ed3faaadda1a7935c7cd0d39c67 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sun, 12 May 2024 07:20:33 +0900 Subject: [PATCH 132/334] New translations 60-next-steps.md (French) --- locale/fr/episodes/60-next-steps.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/locale/fr/episodes/60-next-steps.Rmd b/locale/fr/episodes/60-next-steps.Rmd index 77da2a8ad..89511b1ab 100644 --- a/locale/fr/episodes/60-next-steps.Rmd +++ b/locale/fr/episodes/60-next-steps.Rmd @@ -374,7 +374,7 @@ se It's still a `SummarizedExperiment` object, so maintains the efficient structure, but now we can view it as a tibble. Note the first line of -the output says this, it's a `SummarizedExperiment`-`tibble` +the output says this, it's a `SummarizedExperiment`\-`tibble` abstraction. We can also see in the second line of the output the number of transcripts and samples. From 7bbe21a09d6a7a6d53817e29431fb110e911cdfd Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sun, 12 May 2024 07:20:34 +0900 Subject: [PATCH 133/334] New translations 60-next-steps.md (Spanish) --- locale/es/episodes/60-next-steps.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/locale/es/episodes/60-next-steps.Rmd b/locale/es/episodes/60-next-steps.Rmd index 742cc9c2b..fce5527bf 100644 --- a/locale/es/episodes/60-next-steps.Rmd +++ b/locale/es/episodes/60-next-steps.Rmd @@ -374,7 +374,7 @@ se It's still a `SummarizedExperiment` object, so maintains the efficient structure, but now we can view it as a tibble. Note the first line of -the output says this, it's a `SummarizedExperiment`-`tibble` +the output says this, it's a `SummarizedExperiment`\-`tibble` abstraction. We can also see in the second line of the output the number of transcripts and samples. From bbdf06e4396f179252891fdf77b863cbb8719da8 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sun, 12 May 2024 07:20:36 +0900 Subject: [PATCH 134/334] New translations 60-next-steps.md (Japanese) --- locale/ja/episodes/60-next-steps.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/locale/ja/episodes/60-next-steps.Rmd b/locale/ja/episodes/60-next-steps.Rmd index aa91aaaf6..d2b2d186e 100644 --- a/locale/ja/episodes/60-next-steps.Rmd +++ b/locale/ja/episodes/60-next-steps.Rmd @@ -374,7 +374,7 @@ se It's still a `SummarizedExperiment` object, so maintains the efficient structure, but now we can view it as a tibble. Note the first line of -the output says this, it's a `SummarizedExperiment`-`tibble` +the output says this, it's a `SummarizedExperiment`\-`tibble` abstraction. We can also see in the second line of the output the number of transcripts and samples. From 9a2b2b07e608333fea6673a2e19c26632f3bddcc Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sun, 12 May 2024 07:20:37 +0900 Subject: [PATCH 135/334] New translations 60-next-steps.md (Portuguese) --- locale/pt/episodes/60-next-steps.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/locale/pt/episodes/60-next-steps.Rmd b/locale/pt/episodes/60-next-steps.Rmd index 3ecb0f797..b261371af 100644 --- a/locale/pt/episodes/60-next-steps.Rmd +++ b/locale/pt/episodes/60-next-steps.Rmd @@ -364,7 +364,7 @@ se It's still a `SummarizedExperiment` object, so maintains the efficient structure, but now we can view it as a tibble. Repare que na primeira linha do output diz isto: -`SummarizedExperiment`-`tibble` +`SummarizedExperiment`\-`tibble` abstraction. Também podemos ver na segunda linha do output o número de transcrições e amostras. From 7e4243230c89e4718dbe2674116e9890a2c861b0 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sun, 12 May 2024 07:20:39 +0900 Subject: [PATCH 136/334] New translations 60-next-steps.md (Chinese Simplified) --- locale/zh/episodes/60-next-steps.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/locale/zh/episodes/60-next-steps.Rmd b/locale/zh/episodes/60-next-steps.Rmd index 77da2a8ad..89511b1ab 100644 --- a/locale/zh/episodes/60-next-steps.Rmd +++ b/locale/zh/episodes/60-next-steps.Rmd @@ -374,7 +374,7 @@ se It's still a `SummarizedExperiment` object, so maintains the efficient structure, but now we can view it as a tibble. Note the first line of -the output says this, it's a `SummarizedExperiment`-`tibble` +the output says this, it's a `SummarizedExperiment`\-`tibble` abstraction. We can also see in the second line of the output the number of transcripts and samples. From eb8a59416a8ac8da0812394158f24ab01716489e Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sun, 12 May 2024 07:20:59 +0900 Subject: [PATCH 137/334] New translations code_of_conduct.md (French) --- locale/fr/CODE_OF_CONDUCT.md | 1 - 1 file changed, 1 deletion(-) diff --git a/locale/fr/CODE_OF_CONDUCT.md b/locale/fr/CODE_OF_CONDUCT.md index 11895988e..a820b8df5 100644 --- a/locale/fr/CODE_OF_CONDUCT.md +++ b/locale/fr/CODE_OF_CONDUCT.md @@ -9,5 +9,4 @@ Instances of abusive, harassing, or otherwise unacceptable behavior may be reported by following our [reporting guidelines][coc-reporting]. [coc-reporting]: https://docs.carpentries.org/topic_folders/policies/incident-reporting.html - [coc]: https://docs.carpentries.org/topic_folders/policies/code-of-conduct.html From bc6eef4521275f4737de2cd1c4f16287ae369f0b Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sun, 12 May 2024 07:21:01 +0900 Subject: [PATCH 138/334] New translations code_of_conduct.md (Spanish) --- locale/es/CODE_OF_CONDUCT.md | 1 - 1 file changed, 1 deletion(-) diff --git a/locale/es/CODE_OF_CONDUCT.md b/locale/es/CODE_OF_CONDUCT.md index 11895988e..a820b8df5 100644 --- a/locale/es/CODE_OF_CONDUCT.md +++ b/locale/es/CODE_OF_CONDUCT.md @@ -9,5 +9,4 @@ Instances of abusive, harassing, or otherwise unacceptable behavior may be reported by following our [reporting guidelines][coc-reporting]. [coc-reporting]: https://docs.carpentries.org/topic_folders/policies/incident-reporting.html - [coc]: https://docs.carpentries.org/topic_folders/policies/code-of-conduct.html From 7177cdb4f24f1da5a1d2470d0e4db6132acc3fd8 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sun, 12 May 2024 07:21:02 +0900 Subject: [PATCH 139/334] New translations code_of_conduct.md (Japanese) --- locale/ja/CODE_OF_CONDUCT.md | 1 - 1 file changed, 1 deletion(-) diff --git a/locale/ja/CODE_OF_CONDUCT.md b/locale/ja/CODE_OF_CONDUCT.md index 11895988e..a820b8df5 100644 --- a/locale/ja/CODE_OF_CONDUCT.md +++ b/locale/ja/CODE_OF_CONDUCT.md @@ -9,5 +9,4 @@ Instances of abusive, harassing, or otherwise unacceptable behavior may be reported by following our [reporting guidelines][coc-reporting]. [coc-reporting]: https://docs.carpentries.org/topic_folders/policies/incident-reporting.html - [coc]: https://docs.carpentries.org/topic_folders/policies/code-of-conduct.html From 7fe3a849c4280f3744a38ff0dc009b2acc353a5a Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sun, 12 May 2024 07:21:03 +0900 Subject: [PATCH 140/334] New translations code_of_conduct.md (Portuguese) --- locale/pt/CODE_OF_CONDUCT.md | 1 - 1 file changed, 1 deletion(-) diff --git a/locale/pt/CODE_OF_CONDUCT.md b/locale/pt/CODE_OF_CONDUCT.md index 11895988e..a820b8df5 100644 --- a/locale/pt/CODE_OF_CONDUCT.md +++ b/locale/pt/CODE_OF_CONDUCT.md @@ -9,5 +9,4 @@ Instances of abusive, harassing, or otherwise unacceptable behavior may be reported by following our [reporting guidelines][coc-reporting]. [coc-reporting]: https://docs.carpentries.org/topic_folders/policies/incident-reporting.html - [coc]: https://docs.carpentries.org/topic_folders/policies/code-of-conduct.html From 7fbfc75de78b5c408081e9e9d0da54f62d7cbb9e Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sun, 12 May 2024 07:21:04 +0900 Subject: [PATCH 141/334] New translations code_of_conduct.md (Chinese Simplified) --- locale/zh/CODE_OF_CONDUCT.md | 1 - 1 file changed, 1 deletion(-) diff --git a/locale/zh/CODE_OF_CONDUCT.md b/locale/zh/CODE_OF_CONDUCT.md index 11895988e..a820b8df5 100644 --- a/locale/zh/CODE_OF_CONDUCT.md +++ b/locale/zh/CODE_OF_CONDUCT.md @@ -9,5 +9,4 @@ Instances of abusive, harassing, or otherwise unacceptable behavior may be reported by following our [reporting guidelines][coc-reporting]. [coc-reporting]: https://docs.carpentries.org/topic_folders/policies/incident-reporting.html - [coc]: https://docs.carpentries.org/topic_folders/policies/code-of-conduct.html From 2e270d482d2d9cb79dda574e78109ebd8d213293 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sun, 12 May 2024 07:21:09 +0900 Subject: [PATCH 142/334] New translations contributing.md (French) --- locale/fr/CONTRIBUTING.md | 27 +++++++-------------------- 1 file changed, 7 insertions(+), 20 deletions(-) diff --git a/locale/fr/CONTRIBUTING.md b/locale/fr/CONTRIBUTING.md index e80f40421..e5957a520 100644 --- a/locale/fr/CONTRIBUTING.md +++ b/locale/fr/CONTRIBUTING.md @@ -46,23 +46,23 @@ and to meet some of our community members. ## Where to Contribute 1. If you wish to change this lesson, - please work in https\://github.com/swcarpentry/shell-novice, - which can be viewed at https\://swcarpentry.github.io/shell-novice. + please work in https://github.com/swcarpentry/shell-novice, + which can be viewed at https://swcarpentry.github.io/shell-novice. 2. If you wish to change the example lesson, - please work in https\://github.com/carpentries/lesson-example, + please work in https://github.com/carpentries/lesson-example, which documents the format of our lessons - and can be viewed at https\://carpentries.github.io/lesson-example. + and can be viewed at https://carpentries.github.io/lesson-example. 3. If you wish to change the template used for workshop websites, - please work in https\://github.com/carpentries/workshop-template. + please work in https://github.com/carpentries/workshop-template. The home page of that repository explains how to set up workshop websites, - while the extra pages in https\://carpentries.github.io/workshop-template + while the extra pages in https://carpentries.github.io/workshop-template provide more background on our design choices. 4. If you wish to change CSS style files, tools, or HTML boilerplate for lessons or workshops stored in `_includes` or `_layouts`, - please work in https\://github.com/carpentries/styles. + please work in https://github.com/carpentries/styles. ## What to Contribute @@ -136,29 +136,16 @@ which everyone is welcome to join. You can also [reach us by email][contact]. [contact]: mailto:admin@software-carpentry.org - [dc-issues]: https://github.com/issues?q=user%3Adatacarpentry - [dc-lessons]: http://datacarpentry.org/lessons/ - [dc-site]: http://datacarpentry.org/ - [discuss-list]: http://lists.software-carpentry.org/listinfo/discuss - [github]: http://github.com - [github-flow]: https://guides.github.com/introduction/flow/ - [github-join]: https://github.com/join - [how-contribute]: https://egghead.io/series/how-to-contribute-to-an-open-source-project-on-github - [issues]: https://github.com/swcarpentry/shell-novice/issues/ - [repo]: https://github.com/swcarpentry/shell-novice/ - [swc-issues]: https://github.com/issues?q=user%3Aswcarpentry - [swc-lessons]: http://software-carpentry.org/lessons/ - [swc-site]: http://software-carpentry.org/ From 5acccd8c833b293cd976bbdf3345dbdd5b1e7127 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sun, 12 May 2024 07:21:10 +0900 Subject: [PATCH 143/334] New translations contributing.md (Spanish) --- locale/es/CONTRIBUTING.md | 27 +++++++-------------------- 1 file changed, 7 insertions(+), 20 deletions(-) diff --git a/locale/es/CONTRIBUTING.md b/locale/es/CONTRIBUTING.md index e80f40421..e5957a520 100644 --- a/locale/es/CONTRIBUTING.md +++ b/locale/es/CONTRIBUTING.md @@ -46,23 +46,23 @@ and to meet some of our community members. ## Where to Contribute 1. If you wish to change this lesson, - please work in https\://github.com/swcarpentry/shell-novice, - which can be viewed at https\://swcarpentry.github.io/shell-novice. + please work in https://github.com/swcarpentry/shell-novice, + which can be viewed at https://swcarpentry.github.io/shell-novice. 2. If you wish to change the example lesson, - please work in https\://github.com/carpentries/lesson-example, + please work in https://github.com/carpentries/lesson-example, which documents the format of our lessons - and can be viewed at https\://carpentries.github.io/lesson-example. + and can be viewed at https://carpentries.github.io/lesson-example. 3. If you wish to change the template used for workshop websites, - please work in https\://github.com/carpentries/workshop-template. + please work in https://github.com/carpentries/workshop-template. The home page of that repository explains how to set up workshop websites, - while the extra pages in https\://carpentries.github.io/workshop-template + while the extra pages in https://carpentries.github.io/workshop-template provide more background on our design choices. 4. If you wish to change CSS style files, tools, or HTML boilerplate for lessons or workshops stored in `_includes` or `_layouts`, - please work in https\://github.com/carpentries/styles. + please work in https://github.com/carpentries/styles. ## What to Contribute @@ -136,29 +136,16 @@ which everyone is welcome to join. You can also [reach us by email][contact]. [contact]: mailto:admin@software-carpentry.org - [dc-issues]: https://github.com/issues?q=user%3Adatacarpentry - [dc-lessons]: http://datacarpentry.org/lessons/ - [dc-site]: http://datacarpentry.org/ - [discuss-list]: http://lists.software-carpentry.org/listinfo/discuss - [github]: http://github.com - [github-flow]: https://guides.github.com/introduction/flow/ - [github-join]: https://github.com/join - [how-contribute]: https://egghead.io/series/how-to-contribute-to-an-open-source-project-on-github - [issues]: https://github.com/swcarpentry/shell-novice/issues/ - [repo]: https://github.com/swcarpentry/shell-novice/ - [swc-issues]: https://github.com/issues?q=user%3Aswcarpentry - [swc-lessons]: http://software-carpentry.org/lessons/ - [swc-site]: http://software-carpentry.org/ From 732728cc6cc5c74eb6861dd0e5ef9cda4160d8ec Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sun, 12 May 2024 07:21:11 +0900 Subject: [PATCH 144/334] New translations contributing.md (Japanese) --- locale/ja/CONTRIBUTING.md | 27 +++++++-------------------- 1 file changed, 7 insertions(+), 20 deletions(-) diff --git a/locale/ja/CONTRIBUTING.md b/locale/ja/CONTRIBUTING.md index e80f40421..e5957a520 100644 --- a/locale/ja/CONTRIBUTING.md +++ b/locale/ja/CONTRIBUTING.md @@ -46,23 +46,23 @@ and to meet some of our community members. ## Where to Contribute 1. If you wish to change this lesson, - please work in https\://github.com/swcarpentry/shell-novice, - which can be viewed at https\://swcarpentry.github.io/shell-novice. + please work in https://github.com/swcarpentry/shell-novice, + which can be viewed at https://swcarpentry.github.io/shell-novice. 2. If you wish to change the example lesson, - please work in https\://github.com/carpentries/lesson-example, + please work in https://github.com/carpentries/lesson-example, which documents the format of our lessons - and can be viewed at https\://carpentries.github.io/lesson-example. + and can be viewed at https://carpentries.github.io/lesson-example. 3. If you wish to change the template used for workshop websites, - please work in https\://github.com/carpentries/workshop-template. + please work in https://github.com/carpentries/workshop-template. The home page of that repository explains how to set up workshop websites, - while the extra pages in https\://carpentries.github.io/workshop-template + while the extra pages in https://carpentries.github.io/workshop-template provide more background on our design choices. 4. If you wish to change CSS style files, tools, or HTML boilerplate for lessons or workshops stored in `_includes` or `_layouts`, - please work in https\://github.com/carpentries/styles. + please work in https://github.com/carpentries/styles. ## What to Contribute @@ -136,29 +136,16 @@ which everyone is welcome to join. You can also [reach us by email][contact]. [contact]: mailto:admin@software-carpentry.org - [dc-issues]: https://github.com/issues?q=user%3Adatacarpentry - [dc-lessons]: http://datacarpentry.org/lessons/ - [dc-site]: http://datacarpentry.org/ - [discuss-list]: http://lists.software-carpentry.org/listinfo/discuss - [github]: http://github.com - [github-flow]: https://guides.github.com/introduction/flow/ - [github-join]: https://github.com/join - [how-contribute]: https://egghead.io/series/how-to-contribute-to-an-open-source-project-on-github - [issues]: https://github.com/swcarpentry/shell-novice/issues/ - [repo]: https://github.com/swcarpentry/shell-novice/ - [swc-issues]: https://github.com/issues?q=user%3Aswcarpentry - [swc-lessons]: http://software-carpentry.org/lessons/ - [swc-site]: http://software-carpentry.org/ From 809bb41faf87eb538dc4f0e84473db5a98cecc74 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sun, 12 May 2024 07:21:13 +0900 Subject: [PATCH 145/334] New translations contributing.md (Portuguese) --- locale/pt/CONTRIBUTING.md | 27 +++++++-------------------- 1 file changed, 7 insertions(+), 20 deletions(-) diff --git a/locale/pt/CONTRIBUTING.md b/locale/pt/CONTRIBUTING.md index e80f40421..e5957a520 100644 --- a/locale/pt/CONTRIBUTING.md +++ b/locale/pt/CONTRIBUTING.md @@ -46,23 +46,23 @@ and to meet some of our community members. ## Where to Contribute 1. If you wish to change this lesson, - please work in https\://github.com/swcarpentry/shell-novice, - which can be viewed at https\://swcarpentry.github.io/shell-novice. + please work in https://github.com/swcarpentry/shell-novice, + which can be viewed at https://swcarpentry.github.io/shell-novice. 2. If you wish to change the example lesson, - please work in https\://github.com/carpentries/lesson-example, + please work in https://github.com/carpentries/lesson-example, which documents the format of our lessons - and can be viewed at https\://carpentries.github.io/lesson-example. + and can be viewed at https://carpentries.github.io/lesson-example. 3. If you wish to change the template used for workshop websites, - please work in https\://github.com/carpentries/workshop-template. + please work in https://github.com/carpentries/workshop-template. The home page of that repository explains how to set up workshop websites, - while the extra pages in https\://carpentries.github.io/workshop-template + while the extra pages in https://carpentries.github.io/workshop-template provide more background on our design choices. 4. If you wish to change CSS style files, tools, or HTML boilerplate for lessons or workshops stored in `_includes` or `_layouts`, - please work in https\://github.com/carpentries/styles. + please work in https://github.com/carpentries/styles. ## What to Contribute @@ -136,29 +136,16 @@ which everyone is welcome to join. You can also [reach us by email][contact]. [contact]: mailto:admin@software-carpentry.org - [dc-issues]: https://github.com/issues?q=user%3Adatacarpentry - [dc-lessons]: http://datacarpentry.org/lessons/ - [dc-site]: http://datacarpentry.org/ - [discuss-list]: http://lists.software-carpentry.org/listinfo/discuss - [github]: http://github.com - [github-flow]: https://guides.github.com/introduction/flow/ - [github-join]: https://github.com/join - [how-contribute]: https://egghead.io/series/how-to-contribute-to-an-open-source-project-on-github - [issues]: https://github.com/swcarpentry/shell-novice/issues/ - [repo]: https://github.com/swcarpentry/shell-novice/ - [swc-issues]: https://github.com/issues?q=user%3Aswcarpentry - [swc-lessons]: http://software-carpentry.org/lessons/ - [swc-site]: http://software-carpentry.org/ From 9e9231cb6235a792bfa6c19df0ee1eb2575f4099 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sun, 12 May 2024 07:21:14 +0900 Subject: [PATCH 146/334] New translations contributing.md (Chinese Simplified) --- locale/zh/CONTRIBUTING.md | 27 +++++++-------------------- 1 file changed, 7 insertions(+), 20 deletions(-) diff --git a/locale/zh/CONTRIBUTING.md b/locale/zh/CONTRIBUTING.md index e80f40421..e5957a520 100644 --- a/locale/zh/CONTRIBUTING.md +++ b/locale/zh/CONTRIBUTING.md @@ -46,23 +46,23 @@ and to meet some of our community members. ## Where to Contribute 1. If you wish to change this lesson, - please work in https\://github.com/swcarpentry/shell-novice, - which can be viewed at https\://swcarpentry.github.io/shell-novice. + please work in https://github.com/swcarpentry/shell-novice, + which can be viewed at https://swcarpentry.github.io/shell-novice. 2. If you wish to change the example lesson, - please work in https\://github.com/carpentries/lesson-example, + please work in https://github.com/carpentries/lesson-example, which documents the format of our lessons - and can be viewed at https\://carpentries.github.io/lesson-example. + and can be viewed at https://carpentries.github.io/lesson-example. 3. If you wish to change the template used for workshop websites, - please work in https\://github.com/carpentries/workshop-template. + please work in https://github.com/carpentries/workshop-template. The home page of that repository explains how to set up workshop websites, - while the extra pages in https\://carpentries.github.io/workshop-template + while the extra pages in https://carpentries.github.io/workshop-template provide more background on our design choices. 4. If you wish to change CSS style files, tools, or HTML boilerplate for lessons or workshops stored in `_includes` or `_layouts`, - please work in https\://github.com/carpentries/styles. + please work in https://github.com/carpentries/styles. ## What to Contribute @@ -136,29 +136,16 @@ which everyone is welcome to join. You can also [reach us by email][contact]. [contact]: mailto:admin@software-carpentry.org - [dc-issues]: https://github.com/issues?q=user%3Adatacarpentry - [dc-lessons]: http://datacarpentry.org/lessons/ - [dc-site]: http://datacarpentry.org/ - [discuss-list]: http://lists.software-carpentry.org/listinfo/discuss - [github]: http://github.com - [github-flow]: https://guides.github.com/introduction/flow/ - [github-join]: https://github.com/join - [how-contribute]: https://egghead.io/series/how-to-contribute-to-an-open-source-project-on-github - [issues]: https://github.com/swcarpentry/shell-novice/issues/ - [repo]: https://github.com/swcarpentry/shell-novice/ - [swc-issues]: https://github.com/issues?q=user%3Aswcarpentry - [swc-lessons]: http://software-carpentry.org/lessons/ - [swc-site]: http://software-carpentry.org/ From f3c1d8c8f2bc6f18b7bc26a85c92c5b266905001 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sun, 12 May 2024 07:21:15 +0900 Subject: [PATCH 147/334] New translations license.md (French) --- locale/fr/LICENSE.md | 4 ---- 1 file changed, 4 deletions(-) diff --git a/locale/fr/LICENSE.md b/locale/fr/LICENSE.md index bc98317a1..696cc3ae1 100644 --- a/locale/fr/LICENSE.md +++ b/locale/fr/LICENSE.md @@ -76,11 +76,7 @@ WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. are registered trademarks of [Community Initiatives][ci]. [cc-by-human]: https://creativecommons.org/licenses/by/4.0/ - [cc-by-legal]: https://creativecommons.org/licenses/by/4.0/legalcode - [mit-license]: https://opensource.org/licenses/mit-license.html - [ci]: http://communityin.org/ - [osi]: https://opensource.org From 81133bf9e0d32783d0393d46091930aa463e1bd2 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sun, 12 May 2024 07:21:16 +0900 Subject: [PATCH 148/334] New translations license.md (Spanish) --- locale/es/LICENSE.md | 4 ---- 1 file changed, 4 deletions(-) diff --git a/locale/es/LICENSE.md b/locale/es/LICENSE.md index bc98317a1..696cc3ae1 100644 --- a/locale/es/LICENSE.md +++ b/locale/es/LICENSE.md @@ -76,11 +76,7 @@ WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. are registered trademarks of [Community Initiatives][ci]. [cc-by-human]: https://creativecommons.org/licenses/by/4.0/ - [cc-by-legal]: https://creativecommons.org/licenses/by/4.0/legalcode - [mit-license]: https://opensource.org/licenses/mit-license.html - [ci]: http://communityin.org/ - [osi]: https://opensource.org From d8f95259595fe30fa165f02cdeb18841294c00f1 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sun, 12 May 2024 07:21:18 +0900 Subject: [PATCH 149/334] New translations license.md (Japanese) --- locale/ja/LICENSE.md | 4 ---- 1 file changed, 4 deletions(-) diff --git a/locale/ja/LICENSE.md b/locale/ja/LICENSE.md index bc98317a1..696cc3ae1 100644 --- a/locale/ja/LICENSE.md +++ b/locale/ja/LICENSE.md @@ -76,11 +76,7 @@ WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. are registered trademarks of [Community Initiatives][ci]. [cc-by-human]: https://creativecommons.org/licenses/by/4.0/ - [cc-by-legal]: https://creativecommons.org/licenses/by/4.0/legalcode - [mit-license]: https://opensource.org/licenses/mit-license.html - [ci]: http://communityin.org/ - [osi]: https://opensource.org From 0cb74451a5cfe288eb72112f41d9186094de8069 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sun, 12 May 2024 07:21:19 +0900 Subject: [PATCH 150/334] New translations license.md (Portuguese) --- locale/pt/LICENSE.md | 4 ---- 1 file changed, 4 deletions(-) diff --git a/locale/pt/LICENSE.md b/locale/pt/LICENSE.md index bc98317a1..696cc3ae1 100644 --- a/locale/pt/LICENSE.md +++ b/locale/pt/LICENSE.md @@ -76,11 +76,7 @@ WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. are registered trademarks of [Community Initiatives][ci]. [cc-by-human]: https://creativecommons.org/licenses/by/4.0/ - [cc-by-legal]: https://creativecommons.org/licenses/by/4.0/legalcode - [mit-license]: https://opensource.org/licenses/mit-license.html - [ci]: http://communityin.org/ - [osi]: https://opensource.org From 1ab5ef6afd64c0ec22280bb6ed8f81eb57c910a5 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sun, 12 May 2024 07:21:20 +0900 Subject: [PATCH 151/334] New translations license.md (Chinese Simplified) --- locale/zh/LICENSE.md | 4 ---- 1 file changed, 4 deletions(-) diff --git a/locale/zh/LICENSE.md b/locale/zh/LICENSE.md index bc98317a1..696cc3ae1 100644 --- a/locale/zh/LICENSE.md +++ b/locale/zh/LICENSE.md @@ -76,11 +76,7 @@ WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. are registered trademarks of [Community Initiatives][ci]. [cc-by-human]: https://creativecommons.org/licenses/by/4.0/ - [cc-by-legal]: https://creativecommons.org/licenses/by/4.0/legalcode - [mit-license]: https://opensource.org/licenses/mit-license.html - [ci]: http://communityin.org/ - [osi]: https://opensource.org From 2d4dd4dca45705c3e56eddefb50fc3e9be31a41e Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sun, 12 May 2024 07:21:22 +0900 Subject: [PATCH 152/334] New translations readme.md (French) --- locale/fr/README.md | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/locale/fr/README.md b/locale/fr/README.md index 8ab3d42f4..0ec628ff2 100644 --- a/locale/fr/README.md +++ b/locale/fr/README.md @@ -21,7 +21,7 @@ Project in Pro Git by Scott Chacon. Look for the tag -![good\_first\_issue](https://img.shields.io/badge/-good%20first%20issue-gold.svg). This +![good\\_first\\_issue](https://img.shields.io/badge/-good%20first%20issue-gold.svg). This indicates that the maintainers will welcome a pull request fixing this issue. @@ -60,7 +60,6 @@ A list of contributors to the lesson can be found in [AUTHORS](AUTHORS) To cite this lesson, please consult with [CITATION](CITATION) [lesson-example]: https://carpentries.github.io/lesson-example - [cdh]: https://cdh.carpentries.org ## Testing locally @@ -72,4 +71,4 @@ sandpaper::serve() ``` For more details, see the [workbench installation -instructions](https\://carpentries.github.io/workbench/#installation]. +instructions](https://carpentries.github.io/workbench/#installation]. From 2213451c2bec84f2145e48fded82875e54ae653a Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sun, 12 May 2024 07:21:23 +0900 Subject: [PATCH 153/334] New translations readme.md (Spanish) --- locale/es/README.md | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/locale/es/README.md b/locale/es/README.md index 8ab3d42f4..0ec628ff2 100644 --- a/locale/es/README.md +++ b/locale/es/README.md @@ -21,7 +21,7 @@ Project in Pro Git by Scott Chacon. Look for the tag -![good\_first\_issue](https://img.shields.io/badge/-good%20first%20issue-gold.svg). This +![good\\_first\\_issue](https://img.shields.io/badge/-good%20first%20issue-gold.svg). This indicates that the maintainers will welcome a pull request fixing this issue. @@ -60,7 +60,6 @@ A list of contributors to the lesson can be found in [AUTHORS](AUTHORS) To cite this lesson, please consult with [CITATION](CITATION) [lesson-example]: https://carpentries.github.io/lesson-example - [cdh]: https://cdh.carpentries.org ## Testing locally @@ -72,4 +71,4 @@ sandpaper::serve() ``` For more details, see the [workbench installation -instructions](https\://carpentries.github.io/workbench/#installation]. +instructions](https://carpentries.github.io/workbench/#installation]. From 1703f312e30f09069c6d23b75a38f38c398fb376 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sun, 12 May 2024 07:21:24 +0900 Subject: [PATCH 154/334] New translations readme.md (Japanese) --- locale/ja/README.md | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/locale/ja/README.md b/locale/ja/README.md index 8ab3d42f4..0ec628ff2 100644 --- a/locale/ja/README.md +++ b/locale/ja/README.md @@ -21,7 +21,7 @@ Project in Pro Git by Scott Chacon. Look for the tag -![good\_first\_issue](https://img.shields.io/badge/-good%20first%20issue-gold.svg). This +![good\\_first\\_issue](https://img.shields.io/badge/-good%20first%20issue-gold.svg). This indicates that the maintainers will welcome a pull request fixing this issue. @@ -60,7 +60,6 @@ A list of contributors to the lesson can be found in [AUTHORS](AUTHORS) To cite this lesson, please consult with [CITATION](CITATION) [lesson-example]: https://carpentries.github.io/lesson-example - [cdh]: https://cdh.carpentries.org ## Testing locally @@ -72,4 +71,4 @@ sandpaper::serve() ``` For more details, see the [workbench installation -instructions](https\://carpentries.github.io/workbench/#installation]. +instructions](https://carpentries.github.io/workbench/#installation]. From fb95ec6a9c9ae7ee27647cd69ed2ca5bf419c6d2 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sun, 12 May 2024 07:21:25 +0900 Subject: [PATCH 155/334] New translations readme.md (Portuguese) --- locale/pt/README.md | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/locale/pt/README.md b/locale/pt/README.md index 8ab3d42f4..0ec628ff2 100644 --- a/locale/pt/README.md +++ b/locale/pt/README.md @@ -21,7 +21,7 @@ Project in Pro Git by Scott Chacon. Look for the tag -![good\_first\_issue](https://img.shields.io/badge/-good%20first%20issue-gold.svg). This +![good\\_first\\_issue](https://img.shields.io/badge/-good%20first%20issue-gold.svg). This indicates that the maintainers will welcome a pull request fixing this issue. @@ -60,7 +60,6 @@ A list of contributors to the lesson can be found in [AUTHORS](AUTHORS) To cite this lesson, please consult with [CITATION](CITATION) [lesson-example]: https://carpentries.github.io/lesson-example - [cdh]: https://cdh.carpentries.org ## Testing locally @@ -72,4 +71,4 @@ sandpaper::serve() ``` For more details, see the [workbench installation -instructions](https\://carpentries.github.io/workbench/#installation]. +instructions](https://carpentries.github.io/workbench/#installation]. From 79847e764bde2f00ee2974b8c8e734955a95e774 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sun, 12 May 2024 07:21:27 +0900 Subject: [PATCH 156/334] New translations readme.md (Chinese Simplified) --- locale/zh/README.md | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/locale/zh/README.md b/locale/zh/README.md index 8ab3d42f4..0ec628ff2 100644 --- a/locale/zh/README.md +++ b/locale/zh/README.md @@ -21,7 +21,7 @@ Project in Pro Git by Scott Chacon. Look for the tag -![good\_first\_issue](https://img.shields.io/badge/-good%20first%20issue-gold.svg). This +![good\\_first\\_issue](https://img.shields.io/badge/-good%20first%20issue-gold.svg). This indicates that the maintainers will welcome a pull request fixing this issue. @@ -60,7 +60,6 @@ A list of contributors to the lesson can be found in [AUTHORS](AUTHORS) To cite this lesson, please consult with [CITATION](CITATION) [lesson-example]: https://carpentries.github.io/lesson-example - [cdh]: https://cdh.carpentries.org ## Testing locally @@ -72,4 +71,4 @@ sandpaper::serve() ``` For more details, see the [workbench installation -instructions](https\://carpentries.github.io/workbench/#installation]. +instructions](https://carpentries.github.io/workbench/#installation]. From 41beac615d076809678d25bb805a59ead098ea82 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sun, 12 May 2024 16:56:57 +0900 Subject: [PATCH 157/334] New translations 25-starting-with-data.md (Japanese) --- locale/ja/episodes/25-starting-with-data.Rmd | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/locale/ja/episodes/25-starting-with-data.Rmd b/locale/ja/episodes/25-starting-with-data.Rmd index cfd7485b3..d0a0def8a 100644 --- a/locale/ja/episodes/25-starting-with-data.Rmd +++ b/locale/ja/episodes/25-starting-with-data.Rmd @@ -11,8 +11,8 @@ exercises: 30 ::::::::::::::::::::::::::::::::::::::: 目的 - Describe what a `data.frame` is. -- Load external data from a .csv file into a data frame. -- Summarize the contents of a data frame. +- .csv ファイルからデータ フレームに外部データを読み込みましょう。 +- データフレームの内容を要約してみましょう。 - Describe what a factor is. - Convert between strings and factors. - Reorder and rename factors. From 95eee317784e63de469064c5d1aa995d760b628b Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Mon, 13 May 2024 03:18:03 +0900 Subject: [PATCH 158/334] New translations 25-starting-with-data.md (Japanese) --- locale/ja/episodes/25-starting-with-data.Rmd | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/locale/ja/episodes/25-starting-with-data.Rmd b/locale/ja/episodes/25-starting-with-data.Rmd index d0a0def8a..2387ed393 100644 --- a/locale/ja/episodes/25-starting-with-data.Rmd +++ b/locale/ja/episodes/25-starting-with-data.Rmd @@ -10,14 +10,14 @@ exercises: 30 ::::::::::::::::::::::::::::::::::::::: 目的 -- Describe what a `data.frame` is. +- `data.frame` が何なのか説明してみましょう。 - .csv ファイルからデータ フレームに外部データを読み込みましょう。 - データフレームの内容を要約してみましょう。 - Describe what a factor is. -- Convert between strings and factors. -- Reorder and rename factors. -- Format dates. -- Export and save data. +- string と factor を変換してみましょう。 +- factor の並び替えとリネームを行ってみましょう。 +- 日付をフォーマットしてみましょう。 +- データをエクスポートして保存してみましょう。 :::::::::::::::::::::::::::::::::::::::::::::::::: From e6a4c6470e2f53210914e04f77646292dfea8ffc Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Mon, 13 May 2024 04:18:09 +0900 Subject: [PATCH 159/334] New translations 10-data-organisation.md (Japanese) --- locale/ja/episodes/10-data-organisation.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/locale/ja/episodes/10-data-organisation.Rmd b/locale/ja/episodes/10-data-organisation.Rmd index 6e6dfacd7..c41dcb618 100644 --- a/locale/ja/episodes/10-data-organisation.Rmd +++ b/locale/ja/episodes/10-data-organisation.Rmd @@ -8,7 +8,7 @@ exercises: 30 ```{r, include=FALSE} ``` -::::::::::::::::::::::::::::::::::::::: 目的 +::::::::::::::::::::::::::::::::::::::: objectives - スプレッドシートとその長所と短所について学びます。 - データを効果的に使用するには、スプレッドシート内のデータをどのようにフォーマットすればよいでしょうか? From 4bb319da43cb4b6302ac5834f1d3bb0e1af3daef Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Mon, 13 May 2024 04:18:51 +0900 Subject: [PATCH 160/334] New translations 60-next-steps.md (Japanese) --- locale/ja/episodes/60-next-steps.Rmd | 344 +++++++++++++-------------- 1 file changed, 169 insertions(+), 175 deletions(-) diff --git a/locale/ja/episodes/60-next-steps.Rmd b/locale/ja/episodes/60-next-steps.Rmd index d2b2d186e..a1fe83aee 100644 --- a/locale/ja/episodes/60-next-steps.Rmd +++ b/locale/ja/episodes/60-next-steps.Rmd @@ -1,6 +1,6 @@ --- source: Rmd -title: Next steps +title: 次のステップ teaching: 45 exercises: 45 --- @@ -8,112 +8,109 @@ exercises: 45 ```{r, include=FALSE} ``` -::::::::::::::::::::::::::::::::::::::: 目的 +::::::::::::::::::::::::::::::::::::::: objectives -- Introduce the Bioconductor project. -- Introduce the notion of data containers. -- Give an overview of the `SummarizedExperiment`, extensively used in - omics analyses. +- Bioconductorプロジェクトを紹介してみましょう。 +- データコンテナの概念を紹介してみましょう。 +- オミックス解析で多用される `SummarizedExperiment` の概要を説明する。 :::::::::::::::::::::::::::::::::::::::::::::::::: :::::::::::::::::::::::::::::::::::::::: questions -- What is a `SummarizedExperiment`? -- What is Bioconductor? +- `SummarizedExperiment` とは何でしょうか? +- Bioconductor と何でしょうか? :::::::::::::::::::::::::::::::::::::::::::::::::: -## Next steps +## 次のステップ ```{r, echo=FALSE, message=FALSE} library("tidyverse") ``` -Data in bioinformatics is often complex. To deal with this, -developers define specialised data containers (termed classes) that -match the properties of the data they need to handle. +バイオインフォマティクスのデータはしばしば複雑です。 これに対処するため、 +開発者は、扱う必要のあるデータのプロパティに +マッチする、特別なデータコンテナ(クラスと呼ばれる)を定義する。 -This aspect is central to the **Bioconductor**[^Bioconductor] project -which uses the same **core data infrastructure** across packages. This -certainly contributed to Bioconductor's success. Bioconductor package -developers are advised to make use of existing infrastructure to -provide coherence, interoperability, and stability to the project as a -whole. +この側面は、パッケージ間で同じ**コア・データ・インフラ**を使用する**バイオコンダクター**\[^バイオコンダクター]プロジェクト +。 この +、Bioconductorの成功に貢献したことは間違いない。 Bioconductor パッケージ +開発者は、 +プロジェクト全体に一貫性、相互運用性、安定性を提供するために、既存のインフラストラクチャを利用することをお勧めします +。 -[^Bioconductor]: The [Bioconductor](https://www.bioconductor.org) was - initiated by Robert Gentleman, one of the two creators of the R - language. Bioconductor provides tools dedicated to omics data - analysis. Bioconductor uses the R statistical programming language - and is open source and open development. +[^Bioconductor]: Bioconductor](https://www.bioconductor.org)は、 + 、 + R言語の生みの親の一人であるロバート・ジェントルマンによって始められた。 Bioconductorは、オミックスデータ + 分析に特化したツールを提供している。 Bioconductorは統計プログラミング言語R( + )を使用しており、オープンソース、オープン開発である。 -To illustrate such an omics data container, we'll present the -`SummarizedExperiment` class. +このようなオミックス・データ・コンテナを説明するために、 +`SummarizedExperiment`クラスを紹介する。 -## SummarizedExperiment +## 実験概要 -The figure below represents the anatomy of the SummarizedExperiment class. +下図は、SummarizedExperimentクラスの構造を表しています。 ```{r SE, echo=FALSE, out.width="80%"} knitr::include_graphics("https://uclouvain-cbio.github.io/WSBIM1322/figs/SE.svg") ``` -Objects of the class SummarizedExperiment contain : +SummarizedExperimentクラスのオブジェクトには、: -- **One (or more) assay(s)** containing the quantitative omics data - (expression data), stored as a matrix-like object. Features (genes, - transcripts, proteins, ...) are defined along the rows, and samples - along the columns. +- \*\*定量的オミックスデータ + (発現データ)を含む1つ(または複数)のアッセイ \*\*、マトリックス状のオブジェクトとして格納されている。 特徴(遺伝子、 + 転写物、タンパク質、...) は行に沿って定義され、 + は列に沿って定義される。 -- A **sample metadata** slot containing sample co-variates, stored as a - data frame. Rows from this table represent samples (rows match exactly the - columns of the expression data). +- データフレームとして格納された、サンプルの共変量を含む **sample metadata** スロット。 この表の行はサンプルを表す(行は発現データの + 列と正確に一致する)。 -- A **feature metadata** slot containing feature co-variates, stored as - a data frame. The rows of this data frame match exactly the rows of the - expression data. +- データフレームとして格納される、特徴共変量を含む **特徴メタデータ** スロット。 このデータフレームの行は、 + 式データの行と完全に一致する。 -The coordinated nature of the `SummarizedExperiment` guarantees that -during data manipulation, the dimensions of the different slots will -always match (i.e the columns in the expression data and then rows in -the sample metadata, as well as the rows in the expression data and -feature metadata) during data manipulation. For example, if we had to -exclude one sample from the assay, it would be automatically removed -from the sample metadata in the same operation. +SummarizedExperiment\`の調整された性質は、データ操作中に +、異なるスロットの次元が +、常に一致することを保証する(すなわち、発現データの列と +サンプルメタデータの行、および発現データと +特徴メタデータの行)。 例えば、 +、アッセイから1つのサンプルを除外しなければならない場合、同じ操作でサンプルメタデータから +、自動的に除外される。 -The metadata slots can grow additional co-variates -(columns) without affecting the other structures. +メタデータ・スロットは、他の構造に影響を与えることなく、 +(カラム)の共変数を追加で増やすことができる。 -### Creating a SummarizedExperiment +### SummarizedExperimentの作成 -In order to create a `SummarizedExperiment`, we will create the -individual components, i.e the count matrix, the sample and gene -metadata from csv files. These are typically how RNA-Seq data are -provided (after raw data have been processed). +SummarizedExperiment\`を作成するために、 +の各コンポーネント、すなわちカウントマトリックス、サンプル、遺伝子 +のメタデータをcsvファイルから作成する。 これらは通常、RNA-Seqデータが +(生データが処理された後)提供される方法である。 ```{r, echo=FALSE, message=FALSE} rna <- read_csv("data/rnaseq.csv") ## count matrix -counts <- rna %>% +counts<- rna %>% select(gene, sample, expression) %>% pivot_wider(names_from = sample, values_from = expression) -## convert to matrix and set row names -count_matrix <- counts %>% select(-gene) %>% as.matrix() +## matrix に変換して行名を設定 +count_matrix<- counts %>% select(-gene) %>% as.matrix() rownames(count_matrix) <- counts$gene ## sample annotation -sample_metadata <- rna %>% +sample_metadata<- rna %>% select(sample, organism, age, sex, infection, strain, time, tissue, mouse) ## remove redundancy sample_metadata <- unique(sample_metadata) ## gene annotation -gene_metadata <- rna %>% - select(gene, ENTREZID, product, ensembl_gene_id, external_synonym, +gene_metadata<- rna %>% + select(gene、ENTREZID, product, ensembl_gene_id, external_synonym, chromosome_name, gene_biotype, phenotype_description, hsapiens_homolog_associated_gene_name) @@ -126,10 +123,10 @@ write.csv(gene_metadata, file = "data/gene_metadata.csv", row.names = FALSE) write.csv(sample_metadata, file = "data/sample_metadata.csv", row.names = FALSE) ``` -- **An expression matrix**: we load the count matrix, specifying that - the first columns contains row/gene names, and convert the - `data.frame` to a `matrix`. You can download it - [here](https://carpentries-incubator.github.io/bioc-intro/data/count_matrix.csv). +- **An expression matrix**: カウント行列をロードし、 + 最初の列が行/遺伝子名を含むことを指定し、 + `data.frame` を `matrix` に変換する。 ダウンロードは + [こちら](https://carpentries-incubator.github.io/bioc-intro/data/count_matrix.csv)。 ```{r} count_matrix <- read.csv("data/count_matrix.csv", @@ -140,8 +137,8 @@ count_matrix[1:5, ] dim(count_matrix) ``` -- **A table describing the samples**, available - [here](https://carpentries-incubator.github.io/bioc-intro/data/sample_metadata.csv). +- **サンプルを説明する表**、 + [こちら](https://carpentries-incubator.github.io/bioc-intro/data/sample_metadata.csv)。 ```{r} sample_metadata <- read.csv("data/sample_metadata.csv") @@ -149,8 +146,8 @@ sample_metadata dim(sample_metadata) ``` -- **A table describing the genes**, available - [here](https://carpentries-incubator.github.io/bioc-intro/data/gene_metadata.csv). +- **遺伝子を説明する表**、 + [こちら](https://carpentries-incubator.github.io/bioc-intro/data/gene_metadata.csv)。 ```{r} gene_metadata <- read.csv("data/gene_metadata.csv") @@ -158,27 +155,27 @@ gene_metadata[1:10, 1:4] dim(gene_metadata) ``` -We will create a `SummarizedExperiment` from these tables: +これらのテーブルから `SummarizedExperiment` を作成する: -- The count matrix that will be used as the **`assay`** +- として使用されるカウント行列。 -- The table describing the samples will be used as the **sample - metadata** slot +- サンプルを記述したテーブルは、**サンプル + メタデータ**スロットとして使用される。 -- The table describing the genes will be used as the **features - metadata** slot +- 遺伝子を記述したテーブルは、**features + メタデータ**スロットとして使用される。 -To do this we can put the different parts together using the -`SummarizedExperiment` constructor: +これを行うには、 +`SummarizedExperiment` コンストラクタを使って、異なるパーツをまとめることができる: ```{r, message=FALSE, warning=FALSE} ## BiocManager::install("SummarizedExperiment") library("SummarizedExperiment") ``` -First, we make sure that the samples are in the same order in the -count matrix and the sample annotation, and the same for the genes in -the count matrix and the gene annotation. +まず、 +カウントマトリックスとサンプルアノテーションにおいて、サンプルの順番が同じであることを確認する。また、 +カウントマトリックスと遺伝子アノテーションにおいて、遺伝子の順番が同じであることを確認する。 ```{r} stopifnot(rownames(count_matrix) == gene_metadata$gene) @@ -192,20 +189,20 @@ se <- SummarizedExperiment(assays = list(counts = count_matrix), se ``` -### Saving data +### データの保存 -Exporting data to a spreadsheet, as we did in a previous episode, has -several limitations, such as those described in the first chapter -(possible inconsistencies with `,` and `.` for decimal separators and -lack of variable type definitions). Furthermore, exporting data to a -spreadsheet is only relevant for rectangular data such as dataframes -and matrices. +以前のエピソードで行ったように、データをスプレッドシートにエクスポートするには、 +、第1章 +(小数点以下の区切り文字に`,`と`.`を使った場合の不整合の可能性、 +変数型の定義の欠如)で説明したようないくつかの制限がある。 さらに、 +スプレッドシートへのデータエクスポートは、データフレーム +や行列のような長方形のデータにのみ関係する。 -A more general way to save data, that is specific to R and is -guaranteed to work on any operating system, is to use the `saveRDS` -function. Saving objects like this will generate a binary -representation on disk (using the `rds` file extension here), which -can be loaded back into R using the `readRDS` function. +データを保存するより一般的な方法は、Rに特有であり、 +どのオペレーティングシステムでも動作することが保証されている `saveRDS` +関数を使用することである。 このようにオブジェクトを保存すると、ディスク上にバイナリ +表現が生成されます(ここでは `rds` ファイル拡張子を使用します)。 +`readRDS` 関数を使用して R にロードし直すことができます。 ```{r, eval=FALSE} saveRDS(se, file = "data_output/se.rds") @@ -214,41 +211,41 @@ se <- readRDS("data_output/se.rds") head(se) ``` -To conclude, when it comes to saving data from R that will be loaded -again in R, saving and loading with `saveRDS` and `readRDS` is the -preferred approach. If tabular data need to be shared with somebody -that is not using R, then exporting to a text-based spreadsheet is a -good alternative. +結論として、Rからデータを保存し、 +Rで再度ロードする場合、`saveRDS`と`readRDS`で保存とロードを行うのが +。 表形式のデータを、Rを使用していない誰か( +)と共有する必要がある場合は、テキストベースのスプレッドシートにエクスポートするのが、 +良い選択肢である。 -Using this data structure, we can access the expression matrix with -the `assay` function: +このデータ構造を使って、 +`assay`関数で発現行列にアクセスすることができる: ```{r} head(assay(se)) dim(assay(se)) ``` -We can access the sample metadata using the `colData` function: +colData\`関数を使ってサンプルのメタデータにアクセスすることができる: ```{r} colData(se) dim(colData(se)) ``` -We can also access the feature metadata using the `rowData` function: +また、`rowData`関数を使ってフィーチャーのメタデータにアクセスすることもできる: ```{r} head(rowData(se)) dim(rowData(se)) ``` -### Subsetting a SummarizedExperiment +### SummarizedExperimentをサブセットする -SummarizedExperiment can be subset just like with data frames, with -numerics or with characters of logicals. +SummarizedExperiment は、データフレームと同じように、 +数値または論理の文字でサブセットできる。 -Below, we create a new instance of class SummarizedExperiment that -contains only the 5 first features for the 3 first samples. +以下では、 +、3つの最初のサンプルの5つの最初の特徴のみを含む、SummarizedExperimentクラスの新しいインスタンスを作成します。 ```{r} se1 <- se[1:5, 1:3] @@ -260,10 +257,10 @@ colData(se1) rowData(se1) ``` -We can also use the `colData()` function to subset on something from -the sample metadata or the `rowData()` to subset on something from the -feature metadata. For example, here we keep only miRNAs and the non -infected samples: +また、`colData()` 関数を使用して、 +サンプルメタデータから何かをサブセットしたり、`rowData()` 関数を使用して、 +フィーチャーメタデータから何かをサブセットすることもできます。 例えば、ここではmiRNAと +に感染していないサンプルだけを残している: ```{r} se1 <- se[rowData(se)$gene_biotype == "miRNA", @@ -290,20 +287,20 @@ function.--> ::::::::::::::::::::::::::::::::::::::: challenge -## Challenge +## チャレンジ -Extract the gene expression levels of the 3 first genes in samples -at time 0 and at time 8. +時刻0と時刻8のサンプル +、最初の3遺伝子の遺伝子発現レベルを抽出する。 -::::::::::::::: solution +:::::::::::::::::::: 解決策 -## Solution +## ソリューション ```{r, purl=FALSE} assay(se)[1:3, colData(se)$time != 4] -# Equivalent to -assay(se)[1:3, colData(se)$time == 0 | colData(se)$time == 8] +# +と等価 assay(se)[1:3, colData(se)$time == 0 | colData(se)$time == 8]. ``` ::::::::::::::::::::::::: @@ -312,17 +309,17 @@ assay(se)[1:3, colData(se)$time == 0 | colData(se)$time == 8] ::::::::::::::::::::::::::::::::::::::: challenge -## Challenge +## チャレンジ -Verify that you get the same values using the long `rna` table. +長い`rna`テーブルを使用して同じ値が得られることを確認する。 -::::::::::::::: solution +:::::::::::::::::::: 解決策 -## Solution +## ソリューション ```{r, purl=FALSE} rna |> - filter(gene %in% c("Asl", "Apod", "Cyd2d22")) |> + filter(gene %in% c("Asl", "Apod", "Cyd2d22"))|> filter(time != 4) |> select(expression) ``` @@ -330,40 +327,39 @@ rna |> :::::::::::::::::::::::::::::::::::::::::::::::::: -The long table and the `SummarizedExperiment` contain the same -information, but are simply structured differently. Each approach has its -own advantages: the former is a good fit for the `tidyverse` packages, -while the latter is the preferred structure for many bioinformatics and -statistical processing steps. For example, a typical RNA-Seq analyses using -the `DESeq2` package. +長いテーブルと`SummarizedExperiment`は同じ +情報を含むが、単に構造が異なるだけである。 各アプローチにはそれぞれ +独自の利点がある。前者は `tidyverse` パッケージに適しており、 +一方、後者は多くのバイオインフォマティクスと +統計処理ステップに適した構造である。 例えば、 +`DESeq2`パッケージを使用した典型的なRNA-Seq分析である。 -#### Adding variables to metadata +#### メタデータに変数を追加する -We can also add information to the metadata. -Suppose that you want to add the center where the samples were collected... +メタデータに情報を追加することもできる。 +サンプルが採取されたセンターを追加したいとします... ```{r} colData(se)$center <- rep("University of Illinois", nrow(colData(se))) colData(se) ``` -This illustrates that the metadata slots can grow indefinitely without -affecting the other structures! +これは、メタデータ・スロットが、 +、他の構造に影響を与えることなく、無限に成長できることを示している! ### tidySummarizedExperiment -You may be wondering, can we use tidyverse commands to interact with -`SummarizedExperiment` objects? The answer is yes, we can with the -`tidySummarizedExperiment` package. +`SummarizedExperiment` オブジェクトを操作するために tidyverse コマンドを使うことはできるのだろうか? +`tidySummarizedExperiment` パッケージを使えば可能です。 -Remember what our SummarizedExperiment object looks like: +SummarizedExperimentオブジェクトがどのようなものか思い出してください: ```{r, message=FALSE} -se +シー ``` -Load `tidySummarizedExperiment` and then take a look at the se object -again. +tidySummarizedExperiment\`をロードし、seオブジェクト +。 ```{r, message=FALSE} #BiocManager::install("tidySummarizedExperiment") @@ -372,52 +368,51 @@ library("tidySummarizedExperiment") se ``` -It's still a `SummarizedExperiment` object, so maintains the efficient -structure, but now we can view it as a tibble. Note the first line of -the output says this, it's a `SummarizedExperiment`\-`tibble` -abstraction. We can also see in the second line of the output the -number of transcripts and samples. +これはまだ`SummarizedExperiment`オブジェクトなので、効率的な +構造を維持しているが、これでティブルとして見ることができる。 +の最初の行に注目してほしい。出力にはこう書いてあるが、これは `SummarizedExperiment`-`tibble` +の抽象化である。 また、出力の2行目には、 +のトランスクリプトとサンプルの数を見ることができる。 -If we want to revert to the standard `SummarizedExperiment` view, we -can do that. +標準の`SummarizedExperiment`ビューに戻したい場合は、 +。 ```{r} options("restore_SummarizedExperiment_show" = TRUE) se ``` -But here we use the tibble view. +しかし、ここではティブル・ビューを使う。 ```{r} options("restore_SummarizedExperiment_show" = FALSE) se ``` -We can now use tidyverse commands to interact with the -`SummarizedExperiment` object. +`SummarizedExperiment` オブジェクトと対話するために、tidyverse コマンドを使用できるようになりました。 -We can use `filter` to filter for rows using a condition e.g. to view -all rows for one sample. +filter\`を使用すると、条件を使って行をフィルタリングすることができる。例えば、 +、あるサンプルのすべての行を表示することができる。 ```{r} se %>% filter(.sample == "GSM2545336") ``` -We can use `select` to specify columns we want to view. +select\`を使って表示したいカラムを指定することができる。 ```{r} se %>% select(.sample) ``` -We can use `mutate` to add metadata info. +mutate\`を使ってメタデータ情報を追加することができる。 ```{r} -se %>% mutate(center = "Heidelberg University") +se %>% mutate(center = "ハイデルベルク大学") ``` -We can also combine commands with the tidyverse pipe `%>%`. For -example, we could combine `group_by` and `summarise` to get the total -counts for each sample. +tidyverseパイプ `%>%` を使ってコマンドを組み合わせることもできます。 +の例では、`group_by` と `summarise` を組み合わせて、各サンプルの +カウントの合計を得ることができる。 ```{r} se %>% @@ -425,40 +420,39 @@ se %>% summarise(total_counts=sum(counts)) ``` -We can treat the tidy SummarizedExperiment object as a normal tibble -for plotting. +整頓されたSummarizedExperimentオブジェクトを、プロット用の通常のtibble +として扱うことができる。 -Here we plot the distribution of counts per sample. +ここでは、サンプルごとのカウント数分布をプロットしている。 ```{r tidySE-plot} se %>% - ggplot(aes(counts + 1, group=.sample, color=infection)) + + ggplot(aes(counts + 1, group=.sample, color=infection))+ geom_density() + scale_x_log10() + theme_bw() ``` -For more information on tidySummarizedExperiment, see the package -website -[here](https://stemangiola.github.io/tidySummarizedExperiment/). +tidySummarizedExperimentの詳細については、パッケージ +ウェブサイト +[こちら](https://stemangiola.github.io/tidySummarizedExperiment/)を参照してください。 -**Take-home message** +\*\*テイクホーム・メッセージ -- `SummarizedExperiment` represents an efficient way to store and - handle omics data. +- SummarizedExperiment\`は、オミックスデータを効率的に保存し、 + 。 -- They are used in many Bioconductor packages. +- これらは多くのBioconductorパッケージで使用されている。 -If you follow the next training focused on RNA sequencing analysis, -you will learn to use the Bioconductor `DESeq2` package to do some -differential expression analyses. The whole analysis of the `DESeq2` -package is handled in a `SummarizedExperiment`. +RNAシーケンス解析に焦点を当てた次のトレーニング、 +、Bioconductor `DESeq2`パッケージを使って、 +差分発現解析を行う方法を学ぶ。 DESeq2`パッケージの全解析は`SummarizedExperiment\` で処理される。 -:::::::::::::::::::::::::::::::::::::::: keypoints +::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: キーポイント -- Bioconductor is a project provide support and packages for the - comprehension of high high-throughput biology data. -- A `SummarizedExperiment` is a type of object useful to store and - manage high-throughput omics data. +- Bioconductorは、ハイスループットな生物学データの理解( + )のためのサポートとパッケージを提供するプロジェクトである。 +- SummarizedExperiment\`は、ハイスループットのオミックスデータを保存し、 + 管理するのに便利なオブジェクトの一種である。 :::::::::::::::::::::::::::::::::::::::::::::::::: From 4e05b87ef1c21cd5c850d4931d6804bd12a537bc Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Mon, 13 May 2024 05:14:17 +0900 Subject: [PATCH 161/334] New translations 40-visualization.md (Japanese) --- locale/ja/episodes/40-visualization.Rmd | 907 ++++++++++++------------ 1 file changed, 447 insertions(+), 460 deletions(-) diff --git a/locale/ja/episodes/40-visualization.Rmd b/locale/ja/episodes/40-visualization.Rmd index 9a926ff53..39ed69b48 100644 --- a/locale/ja/episodes/40-visualization.Rmd +++ b/locale/ja/episodes/40-visualization.Rmd @@ -1,6 +1,6 @@ --- source: Rmd -title: Data visualization +title: データの可視化 teaching: 60 exercises: 60 --- @@ -13,17 +13,17 @@ download.file(url = "https://github.com/carpentries-incubator/bioc-intro/raw/mai ::::::::::::::::::::::::::::::::::::::: 目的 -- Produce scatter plots, boxplots, line plots, etc. using ggplot. -- Set universal plot settings. -- Describe what faceting is and apply faceting in ggplot. -- Modify the aesthetics of an existing ggplot plot (including axis labels and color). -- Build complex and customized plots from data in a data frame. +- ggplotを使って散布図、箱ひげ図、折れ線グラフなどを作成する。 +- ユニバーサルプロット設定を行う。 +- ファセットとは何かを説明し、ggplotでファセットを適用する。 +- 既存のggplotプロットの美学(軸ラベルや色を含む)を修正する。 +- データフレーム内のデータから複雑でカスタマイズされたプロットを作成。 :::::::::::::::::::::::::::::::::::::::::::::::::: :::::::::::::::::::::::::::::::::::::::: questions -- Visualization in R +- Rによる可視化 :::::::::::::::::::::::::::::::::::::::::::::::::: @@ -31,147 +31,144 @@ download.file(url = "https://github.com/carpentries-incubator/bioc-intro/raw/mai rna <- read.csv("data/rnaseq.csv") ``` -> This episode is based on the Data Carpentries's _Data Analysis and -> Visualisation in R for Ecologists_ lesson. +> このエピソードは、Data Carpentriesの_Data Analysis and +> Visualisation in R for Ecologists_レッスンに基づいています。 -## Data Visualization +## データの可視化 -We start by loading the required packages. **`ggplot2`** is included in -the **`tidyverse`** package. +必要なパッケージをロードすることから始める。 \*\*ggplot2`**は、 **tidyverse`\*\*パッケージに含まれています。 ```{r load-package, message=FALSE, purl=TRUE} library("tidyverse") ``` -If not still in the workspace, load the data we saved in the previous -lesson. +ワークスペースにまだない場合は、前回の +レッスンで保存したデータをロードします。 ```{r load-data, eval=FALSE, purl=TRUE} rna <- read.csv("data/rnaseq.csv") ``` -The Data Visualization Cheat -Sheet -will cover the basics and more advanced features of `ggplot2` and will -help, in addition to serve as a reminder, getting an overview of the -many data representations available in the package. The following video -tutorials ([part 1](https://www.youtube.com/watch?v=h29g21z0a68) and -[2](https://www.youtube.com/watch?v=0m4yywqNPVY)) by Thomas Lin Pedersen -are also very instructive. +Data Visualization Cheat +Sheet](https://raw.githubusercontent.com/rstudio/cheatsheets/main/data-visualization.pdf) +は `ggplot2` の基本からより高度な機能までをカバーし、 +パッケージで利用可能な多くのデータ表現 +の概要を理解するためのリマインダとしてだけでなく、手助けになるでしょう。 Thomas Lin Pedersen氏による以下のビデオ +チュートリアル([パート1](https://www.youtube.com/watch?v=h29g21z0a68)と +[2](https://www.youtube.com/watch?v=0m4yywqNPVY)) +も非常に参考になる。 -## Plotting with `ggplot2` +## ggplot2\`によるプロット -`ggplot2` is a plotting package that makes it simple to create complex -plots from data in a data frame. It provides a more programmatic -interface for specifying what variables to plot, how they are displayed, -and general visual properties. The theoretical foundation that supports -the `ggplot2` is the _Grammar of Graphics_ (@Wilkinson:2005). Using this -approach, we only need minimal changes if the underlying data change or -if we decide to change from a bar plot to a scatterplot. This helps in -creating publication quality plots with minimal amounts of adjustments -and tweaking. +ggplot2`は、データフレーム内のデータから複雑な +プロットを簡単に作成できるプロットパッケージである。 どの変数をプロットするか、どのように表示するか、 +、一般的なビジュアル・プロパティを指定するための、よりプログラム的な +。 +`ggplot2\`を支える理論的基盤は、_Grammar of Graphics_ (@Wilkinson:2005)である。 この +アプローチを使用すると、基礎となるデータが変更された場合、または棒グラフから散布図に変更することを決めた場合に +、最小限の変更で済む。 これは、 +、最小限の調整で出版品質のプロットを作成するのに役立ちます +、微調整。 -There is a book about `ggplot2` (@ggplot2book) that provides a good -overview, but it is outdated. The 3rd edition is in preparation and will -be [freely available online](https://ggplot2-book.org/). The `ggplot2` -webpage ([https://ggplot2.tidyverse.org](https://ggplot2.tidyverse.org)) provides ample documentation. +ggplot2`に関する本(@ggplot2book)があり、 +。 第3版は現在準備中で、 [オンラインで自由に利用できる](https://ggplot2-book.org/)。 +の `ggplot2\` ウェブページ([https://ggplot2.tidyverse.org](https://ggplot2.tidyverse.org))に十分なドキュメントがあります。 -`ggplot2` functions like data in the 'long' format, i.e., a column for -every dimension, and a row for every observation. Well-structured data -will save you lots of time when making figures with `ggplot2`. +ggplot2`の関数は、'long'形式のデータ、つまり、 +すべての次元を表す列と、すべてのオブザベーションを表す行を持つ。 よく構造化されたデータ +、 `ggplot2\`で図を作成する時間を大幅に節約できる。 -ggplot graphics are built step by step by adding new elements. Adding -layers in this fashion allows for extensive flexibility and -customization of plots. +ggplotグラフィックスは、新しい要素を追加することによって段階的に構築される。 この方法で +レイヤーを追加すると、プロットの広範な柔軟性と +カスタマイズが可能になる。 -> The idea behind the Grammar of Graphics it is that you can build every -> graph from the same 3 components: (1) a data set, (2) a coordinate system, -> and (3) geoms — i.e. visual marks that represent data points \[^three\\_comp\\_ggplot2] +> Grammar of Graphicsの背後にある考え方は、同じ3つのコンポーネントからすべての +> グラフを構築できるということである:(1)データセット、(2)座標系、 +> 、(3)ジオム、つまりデータ点を表す視覚的マーク [^three_comp_ggplot2] である。 -[^three_comp_ggplot2]: Source: [Data Visualization Cheat Sheet](https://raw.githubusercontent.com/rstudio/cheatsheets/main/data-visualization.pdf). +[^three_comp_ggplot2]: 出典[データ可視化チートシート](https://raw.githubusercontent.com/rstudio/cheatsheets/main/data-visualization.pdf). -To build a ggplot, we will use the following basic template that can be -used for different types of plots: +ggplotを構築するために、 +、さまざまなタイプのプロットに使用できる以下の基本テンプレートを使用する: ``` -ggplot(data = <DATA>, mapping = aes(<MAPPINGS>)) + <GEOM_FUNCTION>() +ggplot(data =<DATA>, mapping = aes(<MAPPINGS>)) +<GEOM_FUNCTION>() ``` -- use the `ggplot()` function and bind the plot to a specific **data - frame** using the `data` argument +- ggplot()`関数を使用し、`data\`引数を使用して特定の**data + frame**にプロットをバインドする。 ```{r, eval=FALSE} ggplot(data = rna) ``` -- define a **mapping** (using the aesthetic (`aes`) function), by - selecting the variables to be plotted and specifying how to present - them in the graph, e.g. as x/y positions or characteristics such as - size, shape, color, etc. +- (`aes`)関数を使用して)**マッピング**を定義する。 + 、プロットする変数を選択し、 + 、x/yの位置や、 + 、サイズ、形、色などの特徴として、グラフでどのように表示するかを指定する。 ```{r, eval=FALSE} ggplot(data = rna, mapping = aes(x = expression)) ``` -- add '**geoms**' - geometries, or graphical representations of the - data in the plot (points, lines, bars). `ggplot2` offers many - different geoms; we will use some common ones today, including: +- 追加 '**geoms**' - ジオメトリ、つまりプロット内の + データのグラフ表現(点、線、棒)。 ggplot2\`は多くの + 、様々なジオムを提供している: ``` - * `geom_point()` for scatter plots, dot plots, etc. - * `geom_histogram()` for histograms - * `geom_boxplot()` for, well, boxplots! - * `geom_line()` for trend lines, time series, etc. + * 散布図やドットプロットなどには `geom_point()` を使用する。 + * + * `geom_boxplot()` ボックスプロット! + * トレンドライン、時系列など。 ``` -To add a geom(etry) to the plot use the `+` operator. Let's use -`geom_histogram()` first: +プロットにgeom(etry)を追加するには `+` 演算子を使います。 +`geom_histogram()` をまず使ってみよう: ```{r first-ggplot, cache=FALSE, purl=TRUE} -ggplot(data = rna, mapping = aes(x = expression)) + +ggplot(data = rna, mapping = aes(x = expression))+ geom_histogram() ``` -The `+` in the `ggplot2` package is particularly useful because it -allows you to modify existing `ggplot` objects. This means you can -easily set up plot templates and conveniently explore different types of -plots, so the above plot can also be generated with code like this: +ggplot2`パッケージの`+`は特に便利で、 +、既存の`ggplot\`オブジェクトを修正することができる。 つまり、 +、プロット・テンプレートを簡単に設定し、 +、さまざまなタイプのプロットを便利に調べることができる: ```{r, eval=FALSE, purl=TRUE} -# Assign plot to a variable +# プロットを変数に代入 rna_plot <- ggplot(data = rna, mapping = aes(x = expression)) -# Draw the plot +# プロットを描く rna_plot + geom_histogram() ``` ::::::::::::::::::::::::::::::::::::::: challenge -## Challenge +## チャレンジ -You have probably noticed an automatic message that appears when -drawing the histogram: +ヒストグラムを描画するときに表示される自動メッセージにお気づきでしょう: ```{r, echo=FALSE, fig.show="hide"} -ggplot(rna, aes(x = expression)) + +ggplot(rna, aes(x = expression))+ geom_histogram() ``` -Change the arguments `bins` or `binwidth` of `geom_histogram()` to -change the number or width of the bins. +geom_histogram()`の引数`bins`または`binwidth\` を変更して、 +ビンの数または幅を変更する。 -::::::::::::::: solution +:::::::::::::::::::: 解決策 -## Solution +## ソリューション ```{r, purl=TRUE} -# change bins -ggplot(rna, aes(x = expression)) + +# ビンを変更する +ggplot(rna, aes(x = expression))+ geom_histogram(bins = 15) -# change binwidth -ggplot(rna, aes(x = expression)) + +# binwidth を変更 +ggplot(rna, aes(x = expression))+ geom_histogram(binwidth = 2000) ``` @@ -179,43 +176,42 @@ ggplot(rna, aes(x = expression)) + :::::::::::::::::::::::::::::::::::::::::::::::::: -We can observe here that the data are skewed to the right. We can apply -log2 transformation to have a more symmetric distribution. Note that we -add here a small constant value (`+1`) to avoid having `-Inf` values -returned for expression values equal to 0. +データが右に偏っていることがわかる。 より対称的な分布にするために、 +log2変換を適用することができる。 ここでは、 +、0に等しい式の値に対して返される`-Inf`値 +を避けるために、小さな定数値(`+1`)を追加していることに注意してください。 ```{r log-transfo, cache=FALSE, purl=TRUE} -rna <- rna %>% +rna<- rna %>% mutate(expression_log = log2(expression + 1)) ``` -If we now draw the histogram of the log2-transformed expressions, the -distribution is indeed closer to a normal distribution. +ここで対数変換した式のヒストグラムを描いてみると、 +の分布は確かに正規分布に近くなっている。 ```{r second-ggplot, cache=FALSE, purl=TRUE} -ggplot(rna, aes(x = expression_log)) + geom_histogram() +ggplot(rna, aes(x = expression_log))+ geom_histogram() ``` -From now on we will work on the log-transformed expression values. +これからは対数変換した発現値を扱うことにする。 ::::::::::::::::::::::::::::::::::::::: challenge -## Challenge +## チャレンジ -Another way to visualize this transformation is to consider the scale -of the observations. For example, it may be worth changing the scale -of the axis to better distribute the observations in the space of the -plot. Changing the scale of the axes is done similarly to -adding/modifying other components (i.e., by incrementally adding -commands). Try making this modification: +この変換を視覚化するもう1つの方法は、オブザベーションのスケール +を考えることである。 たとえば、 +プロットの空間でオブザベーションをよりよく分布させるために、軸のスケール +を変更する価値があるかもしれません。 軸のスケールの変更は、 +他のコンポーネントの追加/変更と同様に(すなわち、 +コマンドをインクリメンタルに追加することによって)行われます。 このように変更してみてほしい: -- Represent the un-transformed expression on the log10 scale; see - `scale_x_log10()`. Compare it with the previous graph. Why do you - now have warning messages appearing? +- `scale_x_log10()` を参照。 前のグラフと比較してみよう。 + 、警告メッセージが表示されるようになったのはなぜですか? -::::::::::::::: solution +:::::::::::::::::::: 解決策 -## Solution +## ソリューション ```{r, eval=TRUE, purl=TRUE, echo=TRUE} ggplot(data = rna,mapping = aes(x = expression))+ @@ -227,45 +223,45 @@ ggplot(data = rna,mapping = aes(x = expression))+ :::::::::::::::::::::::::::::::::::::::::::::::::: -**Notes** +**注**\*。 -- Anything you put in the `ggplot()` function can be seen by any geom - layers that you add (i.e., these are global plot settings). This - includes the x- and y-axis mapping you set up in `aes()`. -- You can also specify mappings for a given geom independently of the - mappings defined globally in the `ggplot()` function. -- The `+` sign used to add new layers must be placed at the end of the - line containing the _previous_ layer. If, instead, the `+` sign is - added at the beginning of the line containing the new layer, - `ggplot2` will not add the new layer and will return an error - message. +- ggplot()`関数で設定したものはすべて、あなたが追加したgeom + レイヤーで見ることができます(つまり、これらはグローバルなプロット設定です)。 この + には、`aes()\`で設定したx軸とy軸のマッピングが含まれる。 +- また、`ggplot()`関数でグローバルに定義された + マッピングとは別に、与えられたジオムに対するマッピングを指定することもできます。 +- 新しいレイヤーを追加するために使われる`+`記号は、_前の_レイヤーを含む + 行の最後に置かなければなりません。 その代わりに、`+`記号が + 新しいレイヤーを含む行の先頭に追加された場合、 + `ggplot2` は新しいレイヤーを追加せず、エラー + メッセージを返す。 ```{r, eval=FALSE} -# This is the correct syntax for adding layers +# これはレイヤーを追加するための正しい構文です rna_plot + geom_histogram() -# This will not add the new layer and will return an error message +# これは新しいレイヤーを追加せず、エラーメッセージを返します rna_plot + geom_histogram() ``` -## Building your plots iteratively +## 反復的にプロットを構築する -We will now draw a scatter plot with two continuous variables and the -`geom_point()` function. This graph will represent the log2 fold changes -of expression comparing time 8 versus time 0, and time 4 versus time 0. -To this end, we first need to compute the means of the log-transformed -expression values by gene and time, then the log fold changes by -subtracting the mean log expressions between time 8 and time 0 and -between time 4 and time 0. Note that we also include here the gene -biotype that we will use later on to represent the genes. We will save -the fold changes in a new data frame called `rna_fc.` +ここでは、2つの連続変数と +`geom_point()`関数を使って散布図を描きます。 このグラフは、時間8と時間0、時間4と時間0を比較した発現のlog2倍変化 +。 +この目的のために、まず対数変換した +発現値の平均値を遺伝子ごと、時間ごとに計算する必要がある。次に、 +、時間8と時間0の間の平均対数発現と、 +、時間4と時間0の間の平均対数発現を差し引くことにより、対数倍変化を計算する。 ここでは、後で遺伝子を表現するために使用する遺伝子 +バイオタイプも含めていることに注意されたい。 +フォールドの変化を `rna_fc.` という新しいデータフレームに保存する。 ```{r rna_fc, cache=FALSE, purl=TRUE} -rna_fc <- rna %>% select(gene, time, +rna_fc<- rna %>% select(gene, time, gene_biotype, expression_log) %>% - group_by(gene, time, gene_biotype) %>% + group_by(gene, time、gene_biotype) %>% summarize(mean_exp = mean(expression_log)) %>% pivot_wider(names_from = time, values_from = mean_exp) %>% @@ -273,35 +269,35 @@ rna_fc <- rna %>% select(gene, time, ``` -We can then build a ggplot with the newly created dataset `rna_fc`. -Building plots with `ggplot2` is typically an iterative process. We -start by defining the dataset we'll use, lay out the axes, and choose a -geom: +新しく作成されたデータセット `rna_fc` を使って ggplot を作成することができる。 +ggplot2\`でプロットを作成するのは、通常、反復的なプロセスである。 +まず、使用するデータセットを定義し、軸を配置し、 +ジオムを選択する: ```{r create-ggplot-object, cache=FALSE, purl=TRUE} ggplot(data = rna_fc, mapping = aes(x = time_4_vs_0, y = time_8_vs_0)) + geom_point() ``` -Then, we start modifying this plot to extract more information from it. -For instance, we can add transparency (`alpha`) to avoid overplotting: +次に、このプロットからより多くの情報を抽出するために、プロットを修正し始める。 +例えば、オーバープロットを避けるために透明度(`α`)を加えることができる: ```{r adding-transparency, cache=FALSE, purl=TRUE} ggplot(data = rna_fc, mapping = aes(x = time_4_vs_0, y = time_8_vs_0)) + geom_point(alpha = 0.3) ``` -We can also add colors for all the points: +また、すべてのポイントに色を付けることもできる: ```{r adding-colors, cache=FALSE, purl=TRUE} ggplot(data = rna_fc, mapping = aes(x = time_4_vs_0, y = time_8_vs_0)) + geom_point(alpha = 0.3, color = "blue") ``` -Or to color each gene in the plot differently, you could use a vector as -an input to the argument **color**. `ggplot2` will provide a different -color corresponding to different values in the vector. Here is an -example where we color with `gene_biotype`: +あるいは、プロット中の各遺伝子を異なる色にするために、 +、引数**color**の入力としてベクトルを使うこともできる。 ggplot2`は、ベクトルの異なる値に対応する異なる +。 以下は、 +`gene_biotype\`を使った例である: ```{r color-by-gene_biotype1, cache=FALSE, purl=TRUE} ggplot(data = rna_fc, mapping = aes(x = time_4_vs_0, y = time_8_vs_0)) + @@ -309,32 +305,32 @@ ggplot(data = rna_fc, mapping = aes(x = time_4_vs_0, y = time_8_vs_0)) + ``` -We can also specify the colors directly inside the mapping provided in -the `ggplot()` function. This will be seen by any geom layers and the -mapping will be determined by the x- and y-axis set up in `aes()`. +また、 +`ggplot()`関数で提供されるマッピングの中で直接色を指定することもできる。 これはどのジオムレイヤーでも見ることができ、 +のマッピングは `aes()` で設定したx軸とy軸によって決定される。 ```{r color-by-gene_biotype2, cache=FALSE, purl=TRUE} ggplot(data = rna_fc, mapping = aes(x = time_4_vs_0, y = time_8_vs_0, - color = gene_biotype)) + + color = gene_biotype))+ geom_point(alpha = 0.3) ``` -Finally, we could also add a diagonal line with the `geom_abline()` -function: +最後に、`geom_abline()` +: ```{r adding-diag, cache=FALSE, purl=TRUE} ggplot(data = rna_fc, mapping = aes(x = time_4_vs_0, y = time_8_vs_0, - color = gene_biotype)) + + color = gene_biotype))+ geom_point(alpha = 0.3) + geom_abline(intercept = 0) ``` -Notice that we can change the geom layer from `geom_point` to -`geom_jitter` and colors will still be determined by `gene_biotype`. +ジオムレイヤーを `geom_point` から +`geom_jitter` に変更しても、色は `gene_biotype` によって決定されることに注意してください。 ```{r color-by-gene_biotype3, cache=FALSE, purl=TRUE} ggplot(data = rna_fc, mapping = aes(x = time_4_vs_0, y = time_8_vs_0, - color = gene_biotype)) + + color = gene_biotype))+ geom_jitter(alpha = 0.3) + geom_abline(intercept = 0) ``` @@ -345,28 +341,28 @@ library("hexbin") ::::::::::::::::::::::::::::::::::::::: challenge -## Challenge +## チャレンジ -Scatter plots can be useful exploratory tools for small datasets. For -data sets with large numbers of observations, such as the `rna_fc` -data set, overplotting of points can be a limitation of scatter plots. -One strategy for handling such settings is to use hexagonal binning of -observations. The plot space is tessellated into hexagons. Each -hexagon is assigned a color based on the number of observations that -fall within its boundaries. +散布図は、小規模なデータセットの探索に役立つツールである。 rna_fc\` +データセットのような多数のオブザベーションを持つ +データセットの場合、点のオーバープロットは散布図の制限となりうる。 +このような設定を扱うための1つの戦略は、 +の観測値を六角形にビニングすることである。 プロット空間は六角形にテッセレーションされている。 それぞれの +六角形は、 +その境界内に入るオブザベーションの数に基づいて色が割り当てられる。 -- To use hexagonal binning in `ggplot2`, first install the R package - `hexbin` from CRAN and load it. +- ggplot2`で六角ビニングを使用するには、まずRパッケージ + `hexbin\`をCRANからインストールしてロードする。 -- Then use the `geom_hex()` function to produce the hexbin figure. +- そして、`geom_hex()`関数を使ってhexbin図を作成する。 -- What are the relative strengths and weaknesses of a hexagonal bin - plot compared to a scatter plot? Examine the above scatter plot - and compare it with the hexagonal bin plot that you created. +- 散布図と比較して、六角形のビン + プロットの相対的な長所と短所は何か? 上記の散布図( + )を調べ、作成した六角形のビンプロットと比較する。 -::::::::::::::: solution +:::::::::::::::::::: 解決策 -## Solution +## ソリューション ```{r, eval=FALSE, purl=TRUE} install.packages("hexbin") @@ -387,18 +383,18 @@ ggplot(data = rna_fc, mapping = aes(x = time_4_vs_0, y = time_8_vs_0)) + ::::::::::::::::::::::::::::::::::::::: challenge -## Challenge +## チャレンジ -Use what you just learned to create a scatter plot of `expression_log` -over `sample` from the `rna` dataset with the time showing in -different colors. Is this a good way to show this type of data? +今学んだことを使って、`rna`データセットから`sample`に対する`expression_log` +の散布図を作成し、 +異なる色で時間を表示する。 このようなデータを表示するのは良い方法ですか? -::::::::::::::: solution +:::::::::::::::::::: 解決策 -## Solution +## ソリューション ```{r, eval=TRUE, purl=TRUE} -ggplot(data = rna, mapping = aes(y = expression_log, x = sample)) + +ggplot(data = rna, mapping = aes(y = expression_log, x = sample))+ geom_point(aes(color = time)) ``` @@ -406,43 +402,43 @@ ggplot(data = rna, mapping = aes(y = expression_log, x = sample)) + :::::::::::::::::::::::::::::::::::::::::::::::::: -## Boxplot +## ボックスプロット -We can use boxplots to visualize the distribution of gene expressions -within each sample: +ボックスプロットを使って、各サンプル内の遺伝子発現の分布( +)を可視化することができる: ```{r boxplot, cache=FALSE, purl=TRUE} ggplot(data = rna, - mapping = aes(y = expression_log, x = sample)) + + mapping = aes(y = expression_log, x = sample))+ geom_boxplot() ``` -By adding points to boxplot, we can have a better idea of the number of -measurements and of their distribution: +boxplotにポイントを追加することで、 +の測定数とその分布をよりよく知ることができる: ```{r boxplot-with-points, cache=FALSE, purl=TRUE} ggplot(data = rna, - mapping = aes(y = expression_log, x = sample)) + + mapping = aes(y = expression_log, x = sample))+ geom_jitter(alpha = 0.2, color = "tomato") + geom_boxplot(alpha = 0) ``` ::::::::::::::::::::::::::::::::::::::: challenge -## Challenge +## チャレンジ -Note how the boxplot layer is in front of the jitter layer? What do -you need to change in the code to put the boxplot below the points? +ボックスプロットレイヤーがジッターレイヤーの前にあることに注目してほしい。 +、ボックスプロットをポイントの下に配置するために、コードのどこを変更する必要がありますか? -::::::::::::::: solution +:::::::::::::::::::: 解決策 -## Solution +## ソリューション -We should switch the order of these two geoms: +この2つのジオムの順番を入れ替えるべきだ: ```{r boxplot-with-points2, cache=FALSE, purl=TRUE} ggplot(data = rna, - mapping = aes(y = expression_log, x = sample)) + + mapping = aes(y = expression_log, x = sample))+ geom_boxplot(alpha = 0) + geom_jitter(alpha = 0.2, color = "tomato") ``` @@ -451,51 +447,51 @@ ggplot(data = rna, :::::::::::::::::::::::::::::::::::::::::::::::::: -You may notice that the values on the x-axis are still not properly -readable. Let's change the orientation of the labels and adjust them -vertically and horizontally so they don't overlap. You can use a -90-degree angle, or experiment to find the appropriate angle for -diagonally oriented labels: +X軸の値がまだ正しく +読めないことにお気づきかもしれない。 ラベルの向きを変え、 +縦と横に重ならないように調整しよう。 +90度の角度を使ってもいいし、 +斜め向きのラベルに適切な角度を見つけるために実験してもいい: ```{r boxplot-xaxis-rotated, cache=FALSE, purl=TRUE} ggplot(data = rna, - mapping = aes(y = expression_log, x = sample)) + + mapping = aes(y = expression_log, x = sample))+ geom_jitter(alpha = 0.2, color = "tomato") + geom_boxplot(alpha = 0) + - theme(axis.text.x = element_text(angle = 90, hjust = 0.5, vjust = 0.5)) + theme(axis.text.x = element_text(angle = 90, hjust = 0.5, vjust = 0.5)) ``` ::::::::::::::::::::::::::::::::::::::: challenge -## Challenge +## チャレンジ -Add color to the data points on your boxplot according to the duration -of the infection (`time`). +感染期間( +)に応じて、箱ひげ図上のデータ点に色を付ける(`time`)。 -_Hint:_ Check the class for `time`. Consider changing the class of -`time` from integer to factor directly in the ggplot mapping. Why does -this change how R makes the graph? +_ヒント:_ `time`のクラスをチェックする。 +`time` のクラスをggplotマッピングで整数から因数に直接変更することを検討する。 +、Rのグラフの作り方が変わるのはなぜですか? -::::::::::::::: solution +:::::::::::::::::::: 解決策 -## Solution +## ソリューション ```{r boxplot-color-time, cache=FALSE, purl=TRUE} # time as integer ggplot(data = rna, mapping = aes(y = expression_log, - x = sample)) + - geom_jitter(alpha = 0.2, aes(color = time)) + + x = sample))+ + geom_jitter(alpha = 0.2, aes(color = time))+ geom_boxplot(alpha = 0) + - theme(axis.text.x = element_text(angle = 90, hjust = 0.5, vjust = 0.5)) + theme(axis.text.x = element_text(angle = 90, hjust = 0.5, vjust = 0.5)) # time as factor ggplot(data = rna, mapping = aes(y = expression_log, - x = sample)) + - geom_jitter(alpha = 0.2, aes(color = as.factor(time))) + + x = sample))+ + geom_jitter(alpha = 0.2, aes(color = as.factor(time)))+ geom_boxplot(alpha = 0) + - theme(axis.text.x = element_text(angle = 90, hjust = 0.5, vjust = 0.5)) + theme(axis.text.x = element_text(angle = 90, hjust = 0.5, vjust = 0.5)) ``` ::::::::::::::::::::::::: @@ -504,25 +500,25 @@ ggplot(data = rna, ::::::::::::::::::::::::::::::::::::::: challenge -## Challenge +## チャレンジ -Boxplots are useful summaries, but hide the _shape_ of the -distribution. For example, if the distribution is bimodal, we would -not see it in a boxplot. An alternative to the boxplot is the violin -plot, where the shape (of the density of points) is drawn. +箱ひげ図は便利な要約だが、 +分布の_形_を隠してしまう。 例えば、分布が二峰性であれば、 +、ボックスプロットではそれを見ることはできない。 箱ひげ図に代わるものとして、(点の密度の)形状を描くバイオリン +プロットがある。 -- Replace the box plot with a violin plot; see `geom_violin()`. Fill - in the violins according to the time with the argument `fill`. +- geom_violin()`を参照してください。 + 引数 `fill\` の時間に従ってヴァイオリンにフィルを入れる。 -::::::::::::::: solution +:::::::::::::::::::: 解決策 -## Solution +## ソリューション ```{r, eval=TRUE, echo=TRUE, cache=FALSE, purl=TRUE} ggplot(data = rna, - mapping = aes(y = expression_log, x = sample)) + - geom_violin(aes(fill = as.factor(time))) + - theme(axis.text.x = element_text(angle = 90, hjust = 0.5, vjust = 0.5)) + mapping = aes(y = expression_log, x = sample))+ + geom_violin(aes(fill = as.factor(time)))+ + theme(axis.text.x = element_text(angle = 90, hjust = 0.5, vjust = 0.5)) ``` ::::::::::::::::::::::::: @@ -531,129 +527,125 @@ ggplot(data = rna, ::::::::::::::::::::::::::::::::::::::: challenge -## Challenge +## チャレンジ -- Modify the violin plot to fill in the violins by `sex`. +- ヴァイオリンのプロットを修正し、ヴァイオリンを `sex` で埋める。 -::::::::::::::: solution +:::::::::::::::::::: 解決策 -## Solution +## ソリューション ```{r, eval=TRUE, echo=TRUE, cache=FALSE, purl=TRUE} ggplot(data = rna, - mapping = aes(y = expression_log, x = sample)) + - geom_violin(aes(fill = sex)) + - theme(axis.text.x = element_text(angle = 90, hjust = 0.5, vjust = 0.5)) + mapping = aes(y = expression_log, x = sample))+ + geom_violin(aes(fill = sex))+ + theme(axis.text.x = element_text(angle = 90, hjust = 0.5, vjust = 0.5)) ``` ::::::::::::::::::::::::: :::::::::::::::::::::::::::::::::::::::::::::::::: -## Line plots +## 線プロット -Let's calculate the mean expression per duration of the infection for -the 10 genes having the highest log fold changes comparing time 8 versus -time 0. First, we need to select the genes and create a subset of `rna` -called `sub_rna` containing the 10 selected genes, then we need to group -the data and calculate the mean gene expression within each group: +時刻8と +時刻0を比較し、log fold変化が最も大きかった10の遺伝子について、感染期間ごとの平均発現量を計算してみよう。 まず、遺伝子を選択し、選択した10遺伝子を含む`sub_rna`と呼ばれる`rna` +のサブセットを作成する必要がある。次に、 +データをグループ化し、各グループ内の平均遺伝子発現を計算する必要がある: ```{r, purl=TRUE} -rna_fc <- rna_fc %>% arrange(desc(time_8_vs_0)) +rna_fc<- rna_fc %>% arrange(desc(time_8_vs_0)) genes_selected <- rna_fc$gene[1:10] -sub_rna <- rna %>% +sub_rna<- rna %>% filter(gene %in% genes_selected) -mean_exp_by_time <- sub_rna %>% +mean_exp_by_time<- sub_rna %>% group_by(gene,time) %>% summarize(mean_exp = mean(expression_log)) mean_exp_by_time ``` -We can build the line plot with duration of the infection on the x-axis -and the mean expression on the y-axis: +X軸に感染期間( +)、Y軸に平均発現をとって折れ線グラフを作成することができる: ```{r first-time-series, purl=TRUE} -ggplot(data = mean_exp_by_time, mapping = aes(x = time, y = mean_exp)) + +ggplot(data = mean_exp_by_time, mapping = aes(x = time, y = mean_exp))+ geom_line() ``` -Unfortunately, this does not work because we plotted data for all the -genes together. We need to tell ggplot to draw a line for each gene by -modifying the aesthetic function to include `group = gene`: +残念なことに、これはうまくいかない。というのも、 +の全遺伝子のデータをまとめてプロットしたからである。 各遺伝子に対して線を引くようにggplotに指示する必要がある。 +、`group = gene`を含むようにesthetic関数を修正する: ```{r time-series-by-gene, purl=TRUE} ggplot(data = mean_exp_by_time, - mapping = aes(x = time, y = mean_exp, group = gene)) + + mapping = aes(x = time, y = mean_exp, group = gene))+ geom_line() ``` -We will be able to distinguish genes in the plot if we add colors (using -`color` also automatically groups the data): +色をつければ、プロットの中で遺伝子を区別できるようになる( +`color`を使えば、自動的にデータをグループ化することもできる): ```{r time-series-with-colors, purl=TRUE} ggplot(data = mean_exp_by_time, - mapping = aes(x = time, y = mean_exp, color = gene)) + + mapping = aes(x = time, y = mean_exp, color = gene))+ geom_line() ``` -## Faceting +## ファセット -`ggplot2` has a special technique called _faceting_ that allows the user -to split one plot into multiple (sub) plots based on a factor included -in the dataset. These different subplots inherit the same properties -(axes limits, ticks, ...) to facilitate their direct comparison. We will -use it to make a line plot across time for each gene: +ggplot2\`には_faceting_と呼ばれる特別なテクニックがあり、 +、データセットに含まれる +、1つのプロットを複数の(サブ)プロットに分割することができる。 こ れ ら の異な る サブプ ロ ッ ト は、 同 じ プ ロ パテ ィ +を継承 し ます (軸の限界、目盛り、 ...)。 直接比較しやすくするためだ。 +、これを用いて各遺伝子について時間軸に沿った折れ線グラフを作成する: ```{r first-facet, purl=TRUE} ggplot(data = mean_exp_by_time, - mapping = aes(x = time, y = mean_exp)) + geom_line() + + mapping = aes(x = time, y = mean_exp))+ geom_line() + facet_wrap(~ gene) ``` -Here both x- and y-axis have the same scale for all the subplots. You -can change this default behavior by modifying `scales` in order to allow -a free scale for the y-axis: +ここでは、X軸とY軸はすべてのサブプロットで同じスケールになっている。、Y軸のスケールを自由に設定できるように`scales`を修正することで、このデフォルトの動作を変更することができる: ```{r first-facet-scales, purl=TRUE} ggplot(data = mean_exp_by_time, - mapping = aes(x = time, y = mean_exp)) + + mapping = aes(x = time, y = mean_exp))+ geom_line() + facet_wrap(~ gene, scales = "free_y") ``` -Now we would like to split the line in each plot by the sex of the mice. -To do that we need to calculate the mean expression in the data frame -grouped by `gene`, `time`, and `sex`: +ここで、各プロットの線をマウスの性別で分けたい。 +そのためには、 +、遺伝子、時間、性別でグループ化したデータフレームの平均発現を計算する必要がある: ```{r data-facet-by-gene-and-sex, purl=TRUE} -mean_exp_by_time_sex <- sub_rna %>% +mean_exp_by_time_sex<- sub_rna %>% group_by(gene, time, sex) %>% summarize(mean_exp = mean(expression_log)) mean_exp_by_time_sex ``` -We can now make the faceted plot by splitting further by sex using -`color` (within a single plot): +`color` を使ってさらに分割することで、ファセット化されたプロットを作ることができる(1つのプロット内で): ```{r facet-by-gene-and-sex, cache=FALSE, purl=TRUE} ggplot(data = mean_exp_by_time_sex, - mapping = aes(x = time, y = mean_exp, color = sex)) + + mapping = aes(x = time, y = mean_exp, color = sex))+ geom_line() + facet_wrap(~ gene, scales = "free_y") ``` -Usually plots with white background look more readable when printed. We -can set the background to white using the function `theme_bw()`. -Additionally, we can remove the grid: +通常、背景が白いプロットは印刷したときに読みやすくなる。 +、関数 `theme_bw()` を使って背景を白に設定することができる。 +さらに、グリッドを削除することもできる: ```{r facet-by-gene-and-sex-white-bg, cache=FALSE, purl=TRUE} ggplot(data = mean_exp_by_time_sex, - mapping = aes(x = time, y = mean_exp, color = sex)) + + mapping = aes(x = time, y = mean_exp, color = sex))+ geom_line() + facet_wrap(~ gene, scales = "free_y") + theme_bw() + @@ -662,23 +654,22 @@ ggplot(data = mean_exp_by_time_sex, ::::::::::::::::::::::::::::::::::::::: challenge -## Challenge +## チャレンジ -Use what you just learned to create a plot that depicts how the -average expression of each chromosome changes through the duration of -infection. +、各染色体の +平均発現量が感染期間を通じてどのように変化するかをプロットする。 -::::::::::::::: solution +:::::::::::::::::::: 解決策 -## Solution +## ソリューション ```{r mean-exp-chromosome-time-series, purl=TRUE} -mean_exp_by_chromosome <- rna %>% +mean_exp_by_chromosome<- rna %>% group_by(chromosome_name, time) %>% - summarize(mean_exp = mean(expression_log)) + summary(mean_exp = mean(expression_log)) ggplot(data = mean_exp_by_chromosome, mapping = aes(x = time, - y = mean_exp)) + + y = mean_exp))+ geom_line() + facet_wrap(~ chromosome_name, scales = "free_y") ``` @@ -687,197 +678,198 @@ ggplot(data = mean_exp_by_chromosome, mapping = aes(x = time, :::::::::::::::::::::::::::::::::::::::::::::::::: -The `facet_wrap` geometry extracts plots into an arbitrary number of -dimensions to allow them to cleanly fit on one page. On the other hand, -the `facet_grid` geometry allows you to explicitly specify how you want -your plots to be arranged via formula notation (`rows ~ columns`; a `.` -can be used as a placeholder that indicates only one row or column). +facet_wrap`ジオメトリは、1ページにきれいに収まるように、プロットを任意の数の +次元に抽出する。 一方、 +`facet_grid` ジオメトリでは、 +数式表記(`rows ~ columns`; `.\` +は、1つの行または列のみを示すプレースホルダとして使用できます)によって、プロットの配置方法を明示的に指定することができます。 -Let's modify the previous plot to compare how the mean gene expression -of males and females has changed through time: +先ほどのプロットを修正し、男性と女性の平均遺伝子発現 +が経時的にどのように変化したかを比較してみよう: ```{r mean-exp-time-facet-sex-rows, purl=TRUE} -# One column, facet by rows +# 1列、行ごとのファセット ggplot(data = mean_exp_by_time_sex, - mapping = aes(x = time, y = mean_exp, color = gene)) + + mapping = aes(x = time, y = mean_exp, color = gene))+ geom_line() + facet_grid(sex ~ .) ``` ```{r mean-exp-time-facet-sex-columns, purl=TRUE} -# One row, facet by column +# 1行、列ごとのファセット ggplot(data = mean_exp_by_time_sex, - mapping = aes(x = time, y = mean_exp, color = gene)) + + mapping = aes(x = time, y = mean_exp, color = gene))+ geom_line() + facet_grid(. ~ sex) ``` -## `ggplot2` themes +## ggplot2\`テーマ -In addition to `theme_bw()`, which changes the plot background to white, -`ggplot2` comes with several other themes which can be useful to quickly -change the look of your visualization. The complete list of themes is -available at [https://ggplot2.tidyverse.org/reference/ggtheme.html](https://ggplot2.tidyverse.org/reference/ggtheme.html). -`theme_minimal()` and `theme_light()` are popular, and `theme_void()` -can be useful as a starting point to create a new hand-crafted theme. +`ggplot2` には、プロットの背景を白に変更する `theme_bw()` に加えて、 +手早く視覚化の見た目を変更するのに便利なテーマがいくつか用意されている。 +テーマの全リストは[https://ggplot2.tidyverse.org/reference/ggtheme.html](https://ggplot2.tidyverse.org/reference/ggtheme.html)で入手できる。 +theme_minimal()`や`theme_light()`は人気があり、`theme_void()\` +は新しい手作りのテーマを作る出発点として役に立つ。 -The [ggthemes](https://jrnold.github.io/ggthemes/reference/index.html) -package provides a wide variety of options (including an Excel 2003 -theme). The ggplot2 provides a list of -packages that extend the capabilities of `ggplot2`, including additional -themes. +ggthemes](https://jrnold.github.io/ggthemes/reference/index.html) +パッケージは様々なオプション(Excel 2003 +テーマを含む)を提供します。 `ggplot2` extensions +website](https://exts.ggplot2.tidyverse.org/) は、追加の +テーマを含む `ggplot2` の機能を拡張する +パッケージのリストを提供しています。 -## Customisation +## カスタマイズ -Let's come back to the faceted plot of mean expression by time and gene, -colored by sex. +時間別、遺伝子別の平均発現のファセット・プロットに戻ろう。 +性別に色分けされている。 -Take a look at the ggplot2, -and think of ways you could improve the plot. +ggplot2\`のカンニング +シート](https://raw.githubusercontent.com/rstudio/cheatsheets/main/data-visualization.pdf), +を見て、プロットを改善する方法を考えてみてください。 -Now, we can change names of axes to something more informative than -'time' and 'mean\_exp', and add a title to the figure: +ここで、軸の名前を +'time' や 'mean_exp' よりも情報量の多いものに変更し、図にタイトルを追加する: ```{r mean_exp-time-with-right-labels, cache=FALSE, purl=TRUE} ggplot(data = mean_exp_by_time_sex, - mapping = aes(x = time, y = mean_exp, color = sex)) + + mapping = aes(x = time, y = mean_exp, color = sex))+ geom_line() + facet_wrap(~ gene, scales = "free_y") + theme_bw() + theme(panel.grid = element_blank()) + - labs(title = "Mean gene expression by duration of the infection", - x = "Duration of the infection (in days)", - y = "Mean gene expression") + labs(title = "感染期間別の平均遺伝子発現", + x = "感染期間(日単位)", + y = "平均遺伝子発現") ``` -The axes have more informative names, but their readability can be -improved by increasing the font size: +軸にはより情報量の多い名前が付けられているが、フォントサイズを大きくすることで読みやすさは +: ```{r mean_exp-time-with-right-labels-xfont-size, cache=FALSE, purl=TRUE} ggplot(data = mean_exp_by_time_sex, - mapping = aes(x = time, y = mean_exp, color = sex)) + + mapping = aes(x = time, y = mean_exp, color = sex))+ geom_line() + facet_wrap(~ gene, scales = "free_y") + theme_bw() + theme(panel.grid = element_blank()) + - labs(title = "Mean gene expression by duration of the infection", - x = "Duration of the infection (in days)", - y = "Mean gene expression") + + labs(title = "感染期間別の平均遺伝子発現", + x = "感染期間(日単位)", + y = "平均遺伝子発現") + theme(text = element_text(size = 16)) ``` -Note that it is also possible to change the fonts of your plots. If you -are on Windows, you may have to install the . +プロットのフォントを変更することも可能です。 +Windowsをお使いの場合は、 をインストールする必要があるかもしれません。 -We can further customize the color of x- and y-axis text, the color of -the grid, etc. We can also for example move the legend to the top by -setting `legend.position` to `"top"`. +さらに、X軸とY軸のテキストの色、 +グリッドの色などをカスタマイズできる。 例えば、 +`legend.position`を`"top"`に設定することで、凡例を一番上に移動させることもできる。 ```{r mean_exp-time-with-theme, cache=FALSE, purl=TRUE} ggplot(data = mean_exp_by_time_sex, - mapping = aes(x = time, y = mean_exp, color = sex)) + + mapping = aes(x = time, y = mean_exp, color = sex))+ geom_line() + facet_wrap(~ gene, scales = "free_y") + theme_bw() + theme(panel.grid = element_blank()) + - labs(title = "Mean gene expression by duration of the infection", - x = "Duration of the infection (in days)", - y = "Mean gene expression") + + labs(title = "感染期間別の平均遺伝子発現", + x = "感染期間(日単位)", + y = "平均遺伝子発現") + theme(text = element_text(size = 16), - axis.text.x = element_text(colour = "royalblue4", size = 12), - axis.text.y = element_text(colour = "royalblue4", size = 12), - panel.grid = element_line(colour="lightsteelblue1"), + axis.text.x = element_text(color = "royalblue4", size = 12), + axis.text.y = element_text(color = "royalblue4", size = 12), + panel.grid = element_line(color="lightsteelblue1"), legend.position = "top") ``` -If you like the changes you created better than the default theme, you -can save them as an object to be able to easily apply them to other -plots you may create. Here is an example with the histogram we have -previously created. +作成した変更がデフォルトのテーマよりも気に入った場合は、 +、オブジェクトとして保存して、作成した他の +プロットに簡単に適用することができます。 以下は、 +以前に作成したヒストグラムを使った例です。 ```{r mean_exp-time-with-right-labels-xfont, cache=FALSE, purl=TRUE} -blue_theme <- theme(axis.text.x = element_text(colour = "royalblue4", +blue_theme <- theme(axis.text.x = element_text(color = "royalblue4", size = 12), - axis.text.y = element_text(colour = "royalblue4", + axis.text.y = element_text(color = "royalblue4", size = 12), text = element_text(size = 16), - panel.grid = element_line(colour="lightsteelblue1")) + panel.grid = element_line(color="lightsteelblue1") -ggplot(rna, aes(x = expression_log)) + +ggplot(rna, aes(x = expression_log))+ geom_histogram(bins = 20) + blue_theme ``` ::::::::::::::::::::::::::::::::::::::: challenge -## Challenge +## チャレンジ -With all of this information in hand, please take another five minutes -to either improve one of the plots generated in this exercise or -create a beautiful graph of your own. Use the RStudio ggplot2 -for inspiration. Here are some ideas: +これらの情報を手に入れたら、 +、この練習で作成したプロットのいずれかを改良するか、 +、あなた自身の美しいグラフを作成してください。 RStudio ggplot2 +を参考にしてください。 いくつかアイデアを挙げてみよう: -- See if you can change the thickness of the lines. -- Can you find a way to change the name of the legend? What about - its labels? (hint: look for a ggplot function starting with - `scale_`) -- Try using a different color palette or manually specifying the - colors for the lines (see - [http://www.cookbook-r.com/Graphs/Colors\_(ggplot2)/](https://www.cookbook-r.com/Graphs/Colors_\\(ggplot2\\)/)). +- 線の太さを変えてみてください。 +- 伝説の名前を変える方法はありますか? + そのラベルについてはどうだろう? (ヒント: + `scale_` で始まる ggplot 関数を探す) +- 別のカラーパレットを使用するか、線の + カラーを手動で指定してみてください( + [http://www.cookbook-r.com/Graphs/Colors_(ggplot2)/](https://www.cookbook-r.com/Graphs/Colors_\(ggplot2\)/) 参照)。 -::::::::::::::: solution +:::::::::::::::::::: 解決策 -## Solution +## ソリューション -For example, based on this plot: +例えば、このプロットに基づくと ```{r, purl=TRUE} ggplot(data = mean_exp_by_time_sex, - mapping = aes(x = time, y = mean_exp, color = sex)) + + mapping = aes(x = time, y = mean_exp, color = sex))+ geom_line() + facet_wrap(~ gene, scales = "free_y") + theme_bw() + theme(panel.grid = element_blank()) ``` -We can customize it the following ways: +以下のようなカスタマイズが可能です: ```{r, purl=TRUE} -# change the thickness of the lines +# 線の太さを変更する ggplot(data = mean_exp_by_time_sex, - mapping = aes(x = time, y = mean_exp, color = sex)) + + mapping = aes(x = time, y = mean_exp, color = sex))+ geom_line(size=1.5) + facet_wrap(~ gene, scales = "free_y") + theme_bw() + theme(panel.grid = element_blank()) -# change the name of the legend and the labels +# 凡例とラベルの名前を変更する ggplot(data = mean_exp_by_time_sex, - mapping = aes(x = time, y = mean_exp, color = sex)) + + mapping = aes(x = time, y = mean_exp, color = sex)).+ geom_line() + facet_wrap(~ gene, scales = "free_y") + theme_bw() + theme(panel.grid = element_blank()) + - scale_color_discrete(name = "Gender", labels = c("F", "M")) + scale_color_discrete(name = "Gender", labels = c("F", "M")). # using a different color palette ggplot(data = mean_exp_by_time_sex, - mapping = aes(x = time, y = mean_exp, color = sex)) + + mapping = aes(x = time, y = mean_exp, color = sex))+ geom_line() + facet_wrap(~ gene, scales = "free_y") + theme_bw() + theme(panel.grid = element_blank()) + - scale_color_brewer(name = "Gender", labels = c("F", "M"), palette = "Dark2") + scale_color_brewer(name = "Gender", labels = c("F", "M"), palette = "Dark2"). -# manually specifying the colors +# 手動で色を指定 ggplot(data = mean_exp_by_time_sex, - mapping = aes(x = time, y = mean_exp, color = sex)) + + mapping = aes(x = time, y = mean_exp, color = sex))+ geom_line() + facet_wrap(~ gene, scales = "free_y") + theme_bw() + theme(panel.grid = element_blank()) + - scale_color_manual(name = "Gender", labels = c("F", "M"), - values = c("royalblue", "deeppink")) + scale_color_manual(name = "Gender", labels = c("F", "M"), + values = c("royalblue", "deeppink") ``` @@ -885,28 +877,27 @@ ggplot(data = mean_exp_by_time_sex, :::::::::::::::::::::::::::::::::::::::::::::::::: -## Composing plots +## プロットの構成 -Faceting is a great tool for splitting one plot into multiple subplots, -but sometimes you may want to produce a single figure that contains -multiple independent plots, i.e. plots that are based on different -variables or even different data frames. +ファセッティングは、1つのプロットを複数のサブプロットに分割するのに最適なツールです。 +しかし、 +複数の独立したプロット、すなわち異なる +変数、あるいは異なるデータフレームに基づくプロットを含む1つの図を作成したい場合があります。 -Let's start by creating the two plots that we want to arrange next to -each other: +、まず2つのプロットを作成する: -The first graph counts the number of unique genes per chromosome. We -first need to reorder the levels of `chromosome_name` and filter the -unique genes per chromosome. We also change the scale of the y-axis to a -log10 scale for better readability. +最初のグラフは、染色体ごとにユニークな遺伝子の数を数えている。 +、まず`chromosome_name`のレベルを並べ替え、 +、染色体ごとにユニークな遺伝子をフィルタリングする必要がある。 また、読みやすくするために、Y軸のスケールを +log10スケールに変更した。 ```{r sub1, purl=TRUE} rna$chromosome_name <- factor(rna$chromosome_name, - levels = c(1:19,"X","Y")) + levels = c(1:19, "X", "Y") -count_gene_chromosome <- rna %>% select(chromosome_name, gene) %>% +count_gene_chromosome<- rna %>% select(chromosome_name, gene) %>% distinct() %>% ggplot() + - geom_bar(aes(x = chromosome_name), fill = "seagreen", + geom_bar(aes(x = chromosome_name), fill = "seagreen"、 position = "dodge", stat = "count") + labs(y = "log10(n genes)", x = "chromosome") + scale_y_log10() @@ -914,12 +905,12 @@ count_gene_chromosome <- rna %>% select(chromosome_name, gene) %>% count_gene_chromosome ``` -Below, we also remove the legend altogether by setting the -`legend.position` to `"none"`. +以下では、 +`legend.position` を `"none"` に設定することで、凡例を完全に削除することもできる。 ```{r sub2, purl=TRUE} exp_boxplot_sex <- ggplot(rna, aes(y=expression_log, x = as.factor(time), - color=sex)) + + color=sex)).+ geom_boxplot(alpha = 0) + labs(y = "Mean gene exp", x = "time") + theme(legend.position = "none") @@ -927,11 +918,11 @@ exp_boxplot_sex <- ggplot(rna, aes(y=expression_log, x = as.factor(time), exp_boxplot_sex ``` -The [**patchwork**](https://github.com/thomasp85/patchwork) package -provides an elegant approach to combining figures using the `+` to -arrange figures (typically side by side). More specifically the `|` -explicitly arranges them side by side and `/` stacks them on top of each -other. +patchwork\*\*](https://github.com/thomasp85/patchwork)パッケージ +は、`+` を使って図形を組み合わせるエレガントなアプローチを提供し、 +図形を(通常は横に並べて)並べます。 より具体的には、`|` +は明示的に横に並べ、`/`は +上に重ねる。 ```{r install-patchwork, message=FALSE, eval=FALSE, purl=TRUE} install.packages("patchwork") @@ -947,8 +938,7 @@ count_gene_chromosome + exp_boxplot_sex count_gene_chromosome / exp_boxplot_sex ``` -We can combine further control the layout of the final composition with -`plot_layout` to create more complex layouts: +`plot_layout` と組み合わせることで、最終的なコンポジションのレイアウトをさらにコントロールし、より複雑なレイアウトを作成することができる: ```{r patchwork3, purl=TRUE} count_gene_chromosome + exp_boxplot_sex + plot_layout(ncol = 1) @@ -961,7 +951,7 @@ count_gene_chromosome + plot_layout(ncol = 1) ``` -The last plot can also be created using the `|` and `/` composers: +最後のプロットは `|` と `/` のコンポーザーを使っても作成できる: ```{r patchwork5, purl=TRUE} count_gene_chromosome / @@ -969,12 +959,12 @@ count_gene_chromosome / exp_boxplot_sex ``` -Learn more about `patchwork` on its -[webpage](https://patchwork.data-imaginist.com/) or in this -[video](https://www.youtube.com/watch?v=0m4yywqNPVY). +パッチワークの詳細については、 +[ウェブページ](https://patchwork.data-imaginist.com/)、またはこちらの +[ビデオ](https://www.youtube.com/watch?v=0m4yywqNPVY)をご覧ください。 -Another option is the **`gridExtra`** package that allows to combine -separate ggplots into a single figure using `grid.arrange()`: +もう一つのオプションは \*\*gridExtra`** パッケージで、 +の別々の ggplot を `grid.arrange()\` を使って一つの図にまとめることができます: ```{r install-gridextra, message=FALSE, eval=FALSE, purl=TRUE} install.packages("gridExtra") @@ -985,122 +975,119 @@ library("gridExtra") grid.arrange(count_gene_chromosome, exp_boxplot_sex, ncol = 2) ``` -In addition to the `ncol` and `nrow` arguments, used to make simple -arrangements, there are tools for constructing more complex -layouts. +引数 `ncol` と `nrow` は単純な +の配置を作るのに使われるが、それに加えて、より複雑な +のレイアウトを作る ためのツールもある。 -## Exporting plots +## プロットのエクスポート -After creating your plot, you can save it to a file in your favorite -format. The Export tab in the **Plot** pane in RStudio will save your -plots at low resolution, which will not be accepted by many journals and -will not scale well for posters. +プロットを作成したら、お好きな +形式でファイルに保存できます。 RStudioの**Plot**ペインのExportタブでは、 +のプロットが低解像度で保存されます。これは多くのジャーナルでは受け入れられませんし、 +ではポスター用にうまく拡大縮小できません。 -Instead, use the `ggsave()` function, which allows you easily change the -dimension and resolution of your plot by adjusting the appropriate -arguments (`width`, `height` and `dpi`). +代わりに `ggsave()` 関数を使います。この関数を使うと、 +の適切な引数(`width`, `height`, `dpi`)を調整することで、プロットの +次元と解像度を簡単に変更することができます。 -Make sure you have the `fig_output/` folder in your working directory. +作業ディレクトリに `fig_output/` フォルダがあることを確認してください。 ```{r ggsave-example, eval=FALSE, purl=TRUE} my_plot <- ggplot(data = mean_exp_by_time_sex, - mapping = aes(x = time, y = mean_exp, color = sex)) + + mapping = aes(x = time, y = mean_exp, color = sex)).+ geom_line() + facet_wrap(~ gene, scales = "free_y") + - labs(title = "Mean gene expression by duration of the infection", - x = "Duration of the infection (in days)", - y = "Mean gene expression") + - guides(color=guide_legend(title="Gender")) + + labs(title = "感染期間別の平均遺伝子発現", + x = "感染期間(日)", + y = "平均遺伝子発現") + + guides(color=guide_legend(title="Gender"))+ theme_bw() + - theme(axis.text.x = element_text(colour = "royalblue4", size = 12), - axis.text.y = element_text(colour = "royalblue4", size = 12), + theme(axis.text.x = element_text(color = "royalblue4", size = 12), + axis.text.y = element_text(color = "royalblue4", size = 12), text = element_text(size = 16), - panel.grid = element_line(colour="lightsteelblue1"), + panel.grid = element_line(color="lightsteelblue1"), legend.position = "top") ggsave("fig_output/mean_exp_by_time_sex.png", my_plot, width = 15, height = 10) -# This also works for grid.arrange() plots +# これは grid.arrange() プロットでも動作します combo_plot <- grid.arrange(count_gene_chromosome, exp_boxplot_sex, ncol = 2, widths = c(4, 6)) ggsave("fig_output/combo_plot_chromosome_sex.png", combo_plot, width = 10, dpi = 300) ``` -Note: The parameters `width` and `height` also determine the font size -in the saved plot. +注意: パラメータ `width` と `height` は、保存されたプロットのフォントサイズ +も決定します。 ```{r final-challenge, eval=FALSE, purl=TRUE, echo=FALSE} -### Final plotting challenge: -## With all of this information in hand, please take another five -## minutes to either improve one of the plots generated in this -## exercise or create a beautiful graph of your own. Use the RStudio -## ggplot2 cheat sheet for inspiration: -## https://raw.githubusercontent.com/rstudio/cheatsheets/main/data-visualization.pdf +### プロットの最終課題: + +## +## の練習で生成されたプロットを改良するか、あなた自身の美しいグラフを作成し てください。RStudio +## ggplot2 チートシートを参考にしてください: +## https://raw.githubusercontent.com/rstudio/cheatsheets/main/data-visualization.pdf ``` -## Other packages for visualisation +## その他のビジュアライゼーション用パッケージ -`ggplot2` is a very powerful package that fits very nicely in our _tidy -data_ and _tidy tools_ pipeline. There are other visualization packages -in R that shouldn't be ignored. +ggplot2\`は非常に強力なパッケージで、我々の_tidy +data_と_tidy tools_パイプラインにうまくフィットする。 Rには他にも無視できない可視化パッケージ +がある。 -### Base graphics +### ベースグラフィック -The default graphics system that comes with R, often called _base R -graphics_ is simple and fast. It is based on the _painter's or canvas -model_, where different output are directly overlaid on top of each -other (see figure @ref(fig:paintermodel)). This is a fundamental -difference with `ggplot2` (and with `lattice`, described below), that -returns dedicated objects, that are rendered on screen or in a file, and -that can even be updated. +Rに付属するデフォルトのグラフィックス・システムは、しばしば_ベースR +グラフィックス_と呼ばれ、シンプルで高速である。 これは_画家またはキャンバスの +モデル_に基づいており、異なる出力がそれぞれ +互いに直接重ね合わされる(図@ref(fig:paintermodel)を参照)。 これは`ggplot2`(および後述する`lattice`)との基本的な +の違いである。 +は専用のオブジェクトを返し、それは画面上またはファイル上にレンダリングされ、 +は更新することさえできる。 ```{r paintermodel, fig.width=12, fig.height=4, fig.cap="Successive layers added on top of each other."} par(mfrow = c(1, 3)) -plot(1:20, main = "First layer, produced with plot(1:20)") +plot(1:20, main = "最初のレイヤー、plot(1:20)で作成") -plot(1:20, main = "A horizontal red line, added with abline(h = 10)") +plot(1:20, main = "赤の水平線、abline(h = 10)で追加") abline(h = 10, col = "red") -plot(1:20, main = "A rectangle, added with rect(5, 5, 15, 15)") +plot(1:20, main = "A rectangle, added with rect(5, 5, 15, 15)") abline(h = 10, col = "red") rect(5, 5, 15, 15, lwd = 3) ``` -Another main difference is that base graphics' plotting function try to -do _the right_ thing based on their input type, i.e. they will adapt -their behaviour based on the class of their input. This is again very -different from what we have in `ggplot2`, that only accepts dataframes -as input, and that requires plots to be constructed bit by bit. +もうひとつの主な違いは、ベース・グラフィックスのプロット関数は、入力タイプに基づいて、 +、_正しい_ことをしようとする。 これは、 +データフレームしか入力として受け付けない`ggplot2`で、プロットをビットごとに構築する必要があるのとは、 +異なる。 ```{r plotmethod, fig.width=8, fig.height=8, fig.cap="Plotting boxplots (top) and histograms (bottom) vectors (left) or a matrices (right)."} par(mfrow = c(2, 2)) boxplot(rnorm(100), - main = "Boxplot of rnorm(100)") + main = "rnorm(100) の Boxplot") boxplot(matrix(rnorm(100), ncol = 10), - main = "Boxplot of matrix(rnorm(100), ncol = 10)") + main = "matrix(rnorm(100), ncol = 10) の Boxplot") hist(rnorm(100)) hist(matrix(rnorm(100), ncol = 10)) ``` -The out-of-the-box approach in base graphics can be very efficient for -simple, standard figures, that can be produced very quickly with a -single line of code and a single function such as `plot`, or `hist`, or -`boxplot`, ... The defaults are however not always the most appealing -and tuning of figures, especially when they become more complex (for -example to produce facets), can become lengthy and cumbersome. +、`plot`、`hist`、 `boxplot`、...のような1行のコードと1つの関数で非常に素早く作成できる。 +しかし、デフォルトは必ずしも最も魅力的なものではありません。 +、図形のチューニングは、特に複雑になると(例えばファセットを生成するために +)、時間がかかり面倒になります。 -### The lattice package +### 格子パッケージ -The **`lattice`** package is similar to `ggplot2` in that is uses -dataframes as input, returns graphical objects and supports faceting. -`lattice` however isn't based on the grammar of graphics and has a more -convoluted interface. +lattice`** パッケージは `ggplot2` と似ているが、 +データフレームを入力として使い、グラフィカルオブジェクトを返し、ファセットをサポートする。 +しかし、`lattice\`はグラフィックの文法に基づいておらず、 +より複雑なインターフェイスを持っている。 -A good reference for the `lattice` package is @latticebook. +lattice\`パッケージの良いリファレンスは@latticebookだ。 -:::::::::::::::::::::::::::::::::::::::: keypoints +::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: キーポイント -- Visualization in R +- Rによる可視化 :::::::::::::::::::::::::::::::::::::::::::::::::: From 0a1c65221471dea609afaeef03ab07e6db3b3620 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Mon, 13 May 2024 05:14:22 +0900 Subject: [PATCH 162/334] New translations 60-next-steps.md (Japanese) --- locale/ja/episodes/60-next-steps.Rmd | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/locale/ja/episodes/60-next-steps.Rmd b/locale/ja/episodes/60-next-steps.Rmd index a1fe83aee..eb68df4d6 100644 --- a/locale/ja/episodes/60-next-steps.Rmd +++ b/locale/ja/episodes/60-next-steps.Rmd @@ -292,7 +292,7 @@ function.--> 時刻0と時刻8のサンプル 、最初の3遺伝子の遺伝子発現レベルを抽出する。 -:::::::::::::::::::: 解決策 +::::::::::::::: solution ## ソリューション @@ -313,7 +313,7 @@ assay(se)[1:3, colData(se)$time != 4] 長い`rna`テーブルを使用して同じ値が得られることを確認する。 -:::::::::::::::::::: 解決策 +::::::::::::::: solution ## ソリューション @@ -448,7 +448,7 @@ RNAシーケンス解析に焦点を当てた次のトレーニング、 、Bioconductor `DESeq2`パッケージを使って、 差分発現解析を行う方法を学ぶ。 DESeq2`パッケージの全解析は`SummarizedExperiment\` で処理される。 -::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: キーポイント +:::::::::::::::::::::::::::::::::::::::: keypoints - Bioconductorは、ハイスループットな生物学データの理解( )のためのサポートとパッケージを提供するプロジェクトである。 From 9a26ea2dbfa5cb1822122099fb2aa46320d2f653 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Mon, 13 May 2024 06:17:36 +0900 Subject: [PATCH 163/334] New translations 40-visualization.md (Japanese) --- locale/ja/episodes/40-visualization.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/locale/ja/episodes/40-visualization.Rmd b/locale/ja/episodes/40-visualization.Rmd index 39ed69b48..6104c8959 100644 --- a/locale/ja/episodes/40-visualization.Rmd +++ b/locale/ja/episodes/40-visualization.Rmd @@ -11,7 +11,7 @@ download.file(url = "https://github.com/carpentries-incubator/bioc-intro/raw/mai destfile = "data/rnaseq.csv") ``` -::::::::::::::::::::::::::::::::::::::: 目的 +::::::::::::::::::::::::::::::::::::::: objectives - ggplotを使って散布図、箱ひげ図、折れ線グラフなどを作成する。 - ユニバーサルプロット設定を行う。 From 5d1a9457c4b604b183ab4da9c581723449ebdd36 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Mon, 13 May 2024 07:16:27 +0900 Subject: [PATCH 164/334] New translations 40-visualization.md (Japanese) --- locale/ja/episodes/40-visualization.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/locale/ja/episodes/40-visualization.Rmd b/locale/ja/episodes/40-visualization.Rmd index 6104c8959..4065fb4a6 100644 --- a/locale/ja/episodes/40-visualization.Rmd +++ b/locale/ja/episodes/40-visualization.Rmd @@ -158,7 +158,7 @@ ggplot(rna, aes(x = expression))+ geom_histogram()`の引数`bins`または`binwidth\` を変更して、 ビンの数または幅を変更する。 -:::::::::::::::::::: 解決策 +::::::::::::::: solution ## ソリューション From c72529012d60a09f66bec37839a571bf19e32738 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Mon, 13 May 2024 08:37:33 +0900 Subject: [PATCH 165/334] New translations 20-r-rstudio.md (Japanese) --- locale/ja/episodes/20-r-rstudio.Rmd | 928 ++++++++++++++-------------- 1 file changed, 457 insertions(+), 471 deletions(-) diff --git a/locale/ja/episodes/20-r-rstudio.Rmd b/locale/ja/episodes/20-r-rstudio.Rmd index 6e7104273..747be52b4 100644 --- a/locale/ja/episodes/20-r-rstudio.Rmd +++ b/locale/ja/episodes/20-r-rstudio.Rmd @@ -1,6 +1,6 @@ --- source: Rmd -title: R and RStudio +title: RとRStudio teaching: 30 exercises: 0 --- @@ -10,329 +10,320 @@ exercises: 0 ::::::::::::::::::::::::::::::::::::::: 目的 -- Describe the purpose of the RStudio Script, Console, Environment, and Plots panes. -- Organise files and directories for a set of analyses as an R project, and understand the purpose of the working directory. -- Use the built-in RStudio help interface to search for more information on R functions. -- Demonstrate how to provide sufficient information for troubleshooting with the R user community. +- RStudio スクリプト、コンソール、環境、およびプロットペインの目的について説明します。 +- Rプロジェクトとして一連の分析のためのファイルとディレクトリを整理し、作業ディレクトリの目的を理解する。 +- RStudio 組み込みのヘルプインターフェイスを使用して、R 関数の詳細情報を検索します。 +- Rのユーザーコミュニティとトラブルシューティングのために十分な情報を提供する方法を示す。 :::::::::::::::::::::::::::::::::::::::::::::::::: :::::::::::::::::::::::::::::::::::::::: questions -- What are R and RStudio? +- RとRStudioとは何ですか? :::::::::::::::::::::::::::::::::::::::::::::::::: -> This episode is based on the Data Carpentries's _Data Analysis and -> Visualisation in R for Ecologists_ lesson. +> このエピソードは、Data Carpentriesの_Data Analysis and +> Visualisation in R for Ecologists_レッスンに基づいています。 -## What is R? RStudioとは何ですか? +## Rとは? RStudioとは何ですか? -The term [R](https://www.r-project.org/) is used to refer to the -_programming language_, the _environment for statistical computing_ -and _the software_ that interprets the scripts written using it. +R](https://www.r-project.org/)という用語は、 +_プログラミング言語_、統計計算_のための_環境 +、それを使って書かれたスクリプトを解釈する_ソフトウェア_を指すのに使われる。 -[RStudio](https://rstudio.com) is currently a very popular way to not -only write your R scripts but also to interact with the R -software[^plainr]. To function correctly, RStudio needs R and -therefore both need to be installed on your computer. +[RStudio](https://rstudio.com)は現在、Rスクリプトを書くだけでなく、R +ソフトウェア[^plainr]と対話するための非常に人気のある方法です。 RStudio を正しく機能させるには、R と +が必要です。 -[^plainr]: As opposed to using R directly from the command line - console. There exist other software that interface and integrate - with R, but RStudio is particularly well suited for beginners - while providing numerous very advanced features. +[^plainr]: コマンドライン + コンソールから直接Rを使うのとは対照的だ。 + 、Rとインターフェイスし統合するソフトウェアは他にもあるが、RStudioは非常に高度な機能を数多く備えながら、特に初心者向け + 。 -The RStudio IDE Cheat -Sheet -provides much more information than will be covered here, but can be -useful to learn keyboard shortcuts and discover new features. +RStudio IDE Cheat +Sheet](https://raw.githubusercontent.com/rstudio/cheatsheets/main/rstudio-ide.pdf) +、ここで説明するよりもはるかに多くの情報を提供していますが、キーボードショートカットを学んだり、新しい機能を発見したりするのに便利です。 -## Why learn R? +## なぜRを学ぶのか? -### R does not involve lots of pointing and clicking, and that's a good thing +### Rはポインティングやクリックを多用しない。 -The learning curve might be steeper than with other software, but with -R, the results of your analysis do not rely on remembering a -succession of pointing and clicking, but instead on a series of -written commands, and that's a good thing! So, if you want to redo -your analysis because you collected more data, you don't have to -remember which button you clicked in which order to obtain your -results; you just have to run your script again. +学習曲線は他のソフトウェアよりも急かもしれないが、 +Rを使えば、分析結果は +のポインティングとクリックの連続を覚えることに依存するのではなく、代わりに +の一連のコマンドを書くことに依存する! +そのため、より多くのデータを収集したため、 +分析をやり直したい場合、 +結果を得るためにどのボタンをどの順番でクリックしたかを覚えておく必要はない。スクリプトを再度実行するだけでよい。 -Working with scripts makes the steps you used in your analysis clear, -and the code you write can be inspected by someone else who can give -you feedback and spot mistakes. +スクリプトを使用することで、分析で使用したステップが明確になり、 +、書いたコードを他の誰かが検査することができ、 +フィードバックを与え、間違いを発見することができる。 -Working with scripts forces you to have a deeper understanding of what -you are doing, and facilitates your learning and comprehension of the -methods you use. +スクリプトを使って仕事をすることで、自分がやっている +の内容をより深く理解することになり、自分が使っている +メソッドの学習と理解が容易になる。 -### R code is great for reproducibility +### Rコードは再現性に優れている -Reproducibility means that someone else (including your future self) can -obtain the same results from the same dataset when using the same -analysis code. +再現性とは、同じデータセットから同じ解析コード( +)を使ったときに、他の誰か(未来の自分を含む)が +、同じ結果を得られることを意味する。 -R integrates with other tools to generate manuscripts or reports from your -code. If you collect more data, or fix a mistake in your dataset, the -figures and the statistical tests in your manuscript or report are updated -automatically. +Rは他のツールと統合し、 +のコードから原稿やレポートを作成することができる。 さらにデータを集めたり、データセットの誤りを修正したりすると、原稿や報告書の +図や統計検定が自動的に更新されます。 +。 -An increasing number of journals and funding agencies expect analyses -to be reproducible, so knowing R will give you an edge with these -requirements. +ジャーナルや研究助成機関では、 +、再現性のある分析を求めるところが増えている。Rを知っていれば、このような +。 -### R is interdisciplinary and extensible +### Rは学際的で拡張性がある -With 10000+ packages[^whatarepkgs] that can be installed to extend its -capabilities, R provides a framework that allows you to combine -statistical approaches from many scientific disciplines to best suit -the analytical framework you need to analyse your data. For instance, -R has packages for image analysis, GIS, time series, population -genetics, and a lot more. +機能を拡張するためにインストールできる10000以上のパッケージ[^whatarepkgs]により、Rは、 +多くの科学分野からの統計的アプローチを組み合わせることができるフレームワークを提供し、 +データの分析に必要な分析フレームワークに最適です。 例えば、 +Rには画像分析、GIS、時系列、集団 +遺伝学、その他多くのパッケージがある。 -[^whatarepkgs]: i.e. add-ons that confer R with new functionality, - such as bioinformatics data analysis. +[^whatarepkgs]: すなわち、バイオインフォマティクスのデータ解析など、Rに新しい機能を付与するアドオンである。 + 。 ```{r, fig.cap="Exponential increase of the number of packages available on [CRAN](https://cran.r-project.org/), the Comprehensive R Archive Network. From the R Journal, Volume 10/2, December 2018.", echo=FALSE, message=FALSE} knitr::include_graphics("fig/cran.png") ``` -### R works on data of all shapes and sizes +### Rはあらゆる形や大きさのデータを扱う -The skills you learn with R scale easily with the size of your -dataset. Whether your dataset has hundreds or millions of lines, it -won't make much difference to you. +Rで学ぶスキルは、 +データセットの大きさに合わせて簡単にスケールアップできる。 データセットの行数が数百行であろうと数百万行であろうと、 +、大差はないだろう。 -R is designed for data analysis. It comes with special data structures -and data types that make handling of missing data and statistical -factors convenient. +Rはデータ分析用に設計されている。 欠損データや統計的 +因子の取り扱いを便利にする特別なデータ構造 +とデータ型が付属している。 -R can connect to spreadsheets, databases, and many other data formats, -on your computer or on the web. +Rは、スプレッドシート、データベース、その他多くのデータ形式、 +、コンピュータ上またはウェブ上に接続することができます。 -### R produces high-quality graphics +### Rは高品質のグラフィックを作成する -The plotting functionalities in R are extensive, and allow you to adjust -any aspect of your graph to convey most effectively the message from -your data. +Rのプロット機能は充実しており、 +データからのメッセージを最も効果的に伝えるために、グラフのあらゆる面を調整することができる。 -### R has a large and welcoming community +### Rは大きく歓迎されるコミュニティ -Thousands of people use R daily. Many of them are willing to help you -through mailing lists and websites such as Stack -Overflow, or on the RStudio -community. These broad user communities -extend to specialised areas such as bioinformatics. One such subset of the R community is [Bioconductor](https://bioconductor.org/), a scientific project for analysis and comprehension "of data from current and emerging biological assays." This workshop was developed by members of the Bioconductor community; for more information on Bioconductor, please see the companion workshop ["The Bioconductor Project"](https://carpentries-incubator.github.io/bioc-project/). +何千人もの人々が毎日Rを利用している。 彼らの多くは、メーリングリストやStack +Overflowのようなウェブサイト、またはRStudio +communityを通じて、 +。 こうした広範なユーザー・コミュニティは、 +、バイオインフォマティクスのような専門分野にも広がっている。 Rコミュニティのそのようなサブセットの1つが、[Bioconductor](https://bioconductor.org/)である。"現在および将来の生物学的アッセイからのデータの "分析と理解のための科学的プロジェクトである。 このワークショップは、Bioconductor コミュニティのメンバーによって開発されました。Bioconductor についての詳細は、関連ワークショップ ["The Bioconductor Project"](https://carpentries-incubator.github.io/bioc-project/) をご覧ください。 -### Not only is R free, but it is also open-source and cross-platform +### Rは無料であるだけでなく、オープンソースでクロスプラットフォームである。 -Anyone can inspect the source code to see how R works. Because of this -transparency, there is less chance for mistakes, and if you (or -someone else) find some, you can report and fix bugs. +Rがどのように動作するかは、誰でもソースコードを調べることができる。 この +の透明性により、ミスが発生する可能性は低くなり、もしあなた(または +他の誰か)がミスを発見したら、バグを報告して修正することができる。 -## Knowing your way around RStudio +## RStudioを使いこなす -Let's start by learning about [RStudio](https://www.rstudio.com/), -which is an Integrated Development Environment (IDE) for working with -R. +まずは[RStudio](https://www.rstudio.com/)について学んでみよう。 +は +R を扱うための統合開発環境(IDE)だ。 -The RStudio IDE open-source product is free under the Affero General -Public License (AGPL) v3. -The RStudio IDE is also available with a commercial license and -priority email support from Posit, Inc. +RStudio IDE オープンソース製品は、Affero General +Public License (AGPL) v3の下でフリーです。 +RStudio IDE は、Posit, Inc.の商用ライセンスおよび +優先メールサポートでもご利用いただけます。 -We will use the RStudio IDE to write code, navigate the files on our -computer, inspect the variables we are going to create, and visualise -the plots we will generate. RStudio can also be used for other things -(e.g., version control, developing packages, writing Shiny apps) that -we will not cover during the workshop. +RStudio IDE を使ってコードを書き、 +コンピュータ上のファイルを操作し、これから作成する変数を検査し、 +生成するプロットを視覚化する。 RStudioは他にも +(例:バージョン管理、パッケージの開発、Shynyアプリの作成)にも使えます。 +ワークショップでは取り上げません。 ```{r, results="markup", fig.cap="RStudio interface screenshot. Clockwise from top left: Source, Environment/History, Files/Plots/Packages/Help/Viewer, Console.", echo=FALSE, purl=FALSE, out.width="100%", fig.align="center"} knitr::include_graphics("fig/rstudio-screenshot.png") ``` -The RStudio window is divided into 4 "Panes": - -- the **Source** for your scripts and documents (top-left, in the - default layout) -- your **Environment/History** (top-right), -- your **Files/Plots/Packages/Help/Viewer** (bottom-right), and -- the R **Console** (bottom-left). - -The placement of these panes and their content can be customised (see -menu, `Tools -> Global Options -> Pane Layout`). - -One of the advantages of using RStudio is that all the information you -need to write code is available in a single window. Additionally, with -many shortcuts, **autocompletion**, and **highlighting** for the major -file types you use while developing in R, RStudio will make typing -easier and less error-prone. - -## Getting set up - -It is good practice to keep a set of related data, analyses, and text -self-contained in a single folder, called the **working -directory**. All of the scripts within this folder can then use -**relative paths** to files that indicate where inside the project a -file is located (as opposed to absolute paths, which point to where a -file is on a specific computer). Working this way makes it a lot -easier to move your project around on your computer and share it with -others without worrying about whether or not the underlying scripts -will still work. - -RStudio provides a helpful set of tools to do this through its "Projects" -interface, which not only creates a working directory for you, but also remembers -its location (allowing you to quickly navigate to it) and optionally preserves -custom settings and open files to make it easier to resume work after a -break. Go through the steps for creating an "R Project" for this -tutorial below. - -1. Start RStudio. -2. Under the `File` menu, click on `New project`. Choose `New directory`, then - `New project`. -3. Enter a name for this new folder (or "directory"), and choose a - convenient location for it. This will be your **working directory** - for this session (or whole course) (e.g., `bioc-intro`). -4. Click on `Create project`. -5. (Optional) Set Preferences to 'Never' save workspace in RStudio. - -RStudio's default preferences generally work well, but saving a workspace to -.RData can be cumbersome, especially if you are working with larger datasets. -To turn that off, go to Tools --> 'Global Options' and select the 'Never' option -for 'Save workspace to .RData' on exit. +RStudio ウィンドウは 4 つの「ペイン」に分かれています: + +- スクリプトとドキュメントの **ソース** ( + のデフォルトレイアウトでは左上) +- あなたの**環境/歴史**(右上)、 +- あなたの**Files/Plots/Packages/Help/Viewer**(右下)、そして +- R **コンソール**(左下)。 + +これらのペインの配置とその内容はカスタマイズすることができます( +メニューの `Tools -> Global Options -> Pane Layout` を参照してください)。 + +RStudioを使う利点の1つは、コードを書くために必要なすべての情報( +)が1つのウィンドウで利用できることです。 さらに、 +多くのショートカット、**オートコンプリート**、およびRでの開発中に使用する主な +ファイルタイプの**ハイライト**により、RStudioは +の入力を容易にし、エラーを少なくします。 + +## セットアップ + +関連するデータ、分析、テキスト( +)のセットは、**working +directory**と呼ばれる1つのフォルダに自己完結させておくのがよい習慣である。 このフォルダー内のすべてのスクリプトは、 +**相対パス** を使用して、 +ファイルがプロジェクト内のどこにあるかを示すことができます( +ファイルが特定のコンピューター上のどこにあるかを示す絶対パスとは異なります)。 この方法で作業することで、 +、自分のコンピュータ上でプロジェクトを移動したり、 +他の人と共有したりすることが、基盤となるスクリプト +がまだ動くかどうかを心配することなく、とても簡単になる。 + +RStudioは、"Projects" +インターフェイスを通じて、このような作業を行うための便利なツールセットを提供しています。このツールは、作業ディレクトリを作成するだけでなく、 +その場所を記憶し(すぐに移動できるようになります)、 +カスタム設定や開いているファイルを保存して、 +休憩後に作業を再開しやすくすることもできます。 この +チュートリアルのための "Rプロジェクト "の作成手順を以下に示す。 + +1. RStudioを起動します。 +2. File`メニューの下にある`New project`をクリックする。 新規ディレクトリ`を選択し、 + `新規プロジェクト`を選択する。 +3. この新しいフォルダ(または「ディレクトリ」)の名前を入力し、 + 便利な場所を選択します。 これはこのセッション (またはコース全体) の **作業ディレクトリ** + になります (例 `bioc-intro`)。 +4. Create project\`をクリックする。 +5. (オプション)RStudio でワークスペースを保存しない設定にします。 + +RStudioのデフォルトの環境設定は一般的にうまく機能しますが、ワークスペースを +.RDataに保存するのは、特に大きなデータセットを扱う場合は面倒です。 +これをオフにするには、「ツール」→「グローバル・オプション」で、終了時に「ワークスペースを.RDataに保存する」 +「決してしない」オプションを選択します。 ```{r, results="markup", fig.cap="Set 'Save workspace to .RData on exit' to 'Never'", echo=FALSE, purl=FALSE, out.width="70%", fig.align="center"} knitr::include_graphics("fig/rstudio-preferences.png") ``` -To avoid character encoding issues between Windows and other operating -systems, we are -going to set UTF-8 by default: +ウィンドウズと他のオペレーティング・システム間の文字エンコーディングの問題](https://yihui.name/en/2018/11/biggest-regret-knitr/)を避けるため、 +、デフォルトでUTF-8を設定します: ```{r, results="markup", fig.cap="Set the default text encoding to UTF-8 to save us headache in the coming future. (Figure from the link above).", echo=FALSE, purl=FALSE, out.width="70%", fig.align="center"} knitr::include_graphics("fig/utf8.png") ``` -### Organizing your working directory - -Using a consistent folder structure across your projects will help keep things -organised, and will also make it easy to find/file things in the future. This -can be especially helpful when you have multiple projects. In general, you may -create directories (folders) for **scripts**, **data**, and **documents**. - -- **`data/`** Use this folder to store your raw data and intermediate - datasets you may create for the need of a particular analysis. For - the sake of transparency and - [provenance](https://en.wikipedia.org/wiki/Provenance), you should - _always_ keep a copy of your raw data accessible and do as much of - your data cleanup and preprocessing programmatically (i.e., with - scripts, rather than manually) as possible. Separating raw data - from processed data is also a good idea. For example, you could - have files `data/raw/tree_survey.plot1.txt` and `...plot2.txt` kept - separate from a `data/processed/tree.survey.csv` file generated by - the `scripts/01.preprocess.tree_survey.R` script. -- **`documents/`** This would be a place to keep outlines, drafts, - and other text. -- **`scripts/`** (or `src`) This would be the location to keep your R - scripts for different analyses or plotting, and potentially a - separate folder for your functions (more on that later). - -You may want additional directories or subdirectories depending on -your project needs, but these should form the backbone of your working -directory. +### 作業ディレクトリの整理 + +プロジェクト全体で一貫性のあるフォルダ構造を使うことで、 +、整理整頓がしやすくなり、将来的に探し出したりファイルしたりするのも簡単になります。 この +、複数のプロジェクトを抱えているときには特に役立つ。 一般的に、 +、\***スクリプト**、\***データ**、\***ドキュメント**用のディレクトリ(フォルダ)を作成することができます。 + +- **data/**\* このフォルダは、生データと、特定の分析に必要な中間データセット( + )を保存するために使用します。 + 透明性と + [出所](https://en.wikipedia.org/wiki/Provenance)のために、 + _常に_ 生データのコピーにアクセスできるようにしておき、 + データのクリーンアップと前処理をできるだけプログラム的に(つまり、手作業ではなく + スクリプトで)行うべきである。 生データ + 、加工データから切り離すのも良いアイデアだ。 例えば、 + `data/raw/tree_survey.plot1.txt`と`...plot2.txt`のファイルを、 + `scripts/01.preprocess.tree_survey.R`スクリプトによって生成された`data/processed/tree.survey.csv`ファイルとは別に + 。 +- **`documents/`** ここは、アウトライン、下書き、 + 、その他のテキストを保管する場所になります。 +- \*\*scripts/`** (または `src\`) この場所には、さまざまな分析やプロット用の R + スクリプトを保存し、 + 関数用の別フォルダを作成することもできます(詳しくは後述します)。 + +あなたのプロジェクトの必要性に応じて、追加のディレクトリやサブディレクトリが必要になるかもしれないが、これらはあなたの作業用 +ディレクトリのバックボーンを形成するはずである。 ```{r, results="markup", fig.cap="Example of a working directory structure.", echo=FALSE, purl=FALSE, out.width="100%", fig.align="center"} knitr::include_graphics("fig/working-directory-structure.png") ``` -For this course, we will need a `data/` folder to store our raw data, -and we will use `data_output/` for when we learn how to export data as -CSV files, and `fig_output/` folder for the figures that we will save. +このコースでは、生データを保存するために `data/` フォルダが必要です。 +、データを +CSV ファイルとしてエクスポートする方法を学ぶために `data_output/` フォルダを使用し、図を保存するために `fig_output/` フォルダを使用します。 ::::::::::::::::::::::::::::::::::::::: challenge -## Challenge: create your project directory structure +## 課題:プロジェクトのディレクトリ構造を作る -Under the `Files` tab on the right of the screen, click on `New Folder` and -create a folder named `data` within your newly created working directory -(e.g., `~/bioc-intro/data`). (Alternatively, type `dir.create("data")` at -your R console.) Repeat these operations to create a `data_output/` and a -`fig_output` folders. +画面右側の`Files`タブで、`New Folder`をクリックし、 +、新しく作成した作業ディレクトリー +(例:`~/bioc-intro/data`)の中に`data`という名前のフォルダーを作成します。 (あるいは、Rのコンソール +で `dir.create("data")` とタイプする)。 これらの操作を繰り返して、`data_output/` と +`fig_output` フォルダーを作成する。 :::::::::::::::::::::::::::::::::::::::::::::::::: -We are going to keep the script in the root of our working directory -because we are only going to use one file and it will make things -easier. +スクリプトは作業ディレクトリのルート +に置くことにする。使用するのは1つのファイルだけだし、 +事が簡単になるからだ。 -Your working directory should now look like this: +作業ディレクトリはこのようになっているはずだ: ```{r, results="markup", fig.cap="How it should look like at the beginning of this lesson", echo=FALSE, purl=FALSE, out.width="100%", fig.align="center"} knitr::include_graphics("fig/r-starting-how-it-should-look-like.png") ``` -**Project management** is also applicable to bioinformatics projects, -of course[^bioindatascience]. William Noble (@Noble:2009) proposes the -following directory structure: - -[^bioindatascience]: In this course, we consider bioinformatics as - data science applied to biological or bio-medical data. - -> Directory names are in large typeface, and filenames are in smaller -> typeface. Only a subset of the files are shown here. Note that the -> dates are formatted `<year>-<month>-<day>` so that they can be -> sorted in chronological order. The source code `src/ms-analysis.c` -> is compiled to create `bin/ms-analysis` and is documented in -> `doc/ms-analysis.html`. The `README` files in the data directories -> specify who downloaded the data files from what URL on what -> date. The driver script `results/2009-01-15/runall` automatically -> generates the three subdirectories split1, split2, and split3, -> corresponding to three cross-validation splits. The -> `bin/parse-sqt.py` script is called by both of the `runall` driver -> scripts. +**プロジェクト管理**は、バイオインフォマティクス・プロジェクトにも適用できる。 +[^bioindatascience]。 William Noble (@Noble:2009)は、 +以下のディレクトリ構造を提案している: + +[^bioindatascience]: このコースでは、バイオインフォマティクスを、 + 、生物学的または生物医学的データに適用されるデータサイエンスと考える。 + +> ディレクトリ名は大きな書体で、ファイル名は小さな +> 。 ここに掲載したのは、その一部である。 なお、 +> の日付は、<year>-<month>-<day>`というフォーマットになっているので、 +> 時系列順に並べ替えることができる。 ソースコード`src/ms-analysis.c`がコンパイルされて`bin/ms-analysis`が作成され、`doc/ms-analysis.html`に文書化されている。 データ・ディレクトリ +> にある`README`ファイルには、誰がどの URL から +> の日付にデータ・ファイルをダウンロードしたかが明記されている。 ドライバスクリプト`results/2009-01-15/runall`は自動的に +> 3つのサブディレクトリ split1、split2、split3 を生成する。 +> は3つのクロスバリデーション分割に対応する。 `bin/parse-sqt.py`スクリプトは`runall\` ドライバ +> スクリプトの両方から呼び出される。 ```{r bioinfoproj, fig.cap="Directory structure for a sample bioinformatics project.", out.width="100%", echo=FALSE} knitr::include_graphics("fig/noble-bioinfo-project.png") ``` -The most important aspect of a well defined and well documented -project directory is to enable someone unfamiliar with the -project[^futureself] to +よく定義され、よく文書化された +プロジェクトのディレクトリの最も重要な側面は、 +プロジェクト[^futureself]をよく知らない人が、次のことをできるようにすることである。 -1. understand what the project is about, what data are available, what - analyses were run, and what results were produced and, most - importantly to +1. どのようなプロジェクトなのか、どのようなデータが入手可能なのか、どのような + 分析が行われ、どのような結果が出たのかを理解すること、そして最も重要なことは、 + を理解することである。 -2. repeat the analysis over again - with new data, or changing some - analysis parameters. +2. 新しいデータで、あるいは + 分析パラメーターの一部を変更して、分析を繰り返す。 -[^futureself]: That someone could be, and very likely will be your - future self, a couple of months or years after the analyses were - run. +[^futureself]: その誰かとは、 + 、分析が実行された数カ月後、あるいは数年後に、 + 、未来のあなた自身である可能性が高い。 -### The working directory +### 作業ディレクトリ -The working directory is an important concept to understand. It is the -place from where R will be looking for and saving the files. When you -write code for your project, it should refer to files in relation to -the root of your working directory and only need files within this -structure. +作業ディレクトリは理解すべき重要な概念である。 Rがファイルを探して保存する +。 +プロジェクトのコードを書くときは、作業ディレクトリのルート +に関連するファイルを参照し、この +構造内のファイルだけが必要です。 -Using RStudio projects makes this easy and ensures that your working -directory is set properly. If you need to check it, you can use -`getwd()`. If for some reason your working directory is not what it -should be, you can change it in the RStudio interface by navigating in -the file browser where your working directory should be, and clicking -on the blue gear icon `More`, and select `Set As Working Directory`. -Alternatively you can use `setwd("/path/to/working/directory")` to -reset your working directory. However, your scripts should not include -this line because it will fail on someone else's computer. +RStudioプロジェクトを使用すると、この作業が簡単になり、 +ディレクトリが適切に設定されます。 もし確認する必要があれば、 +`getwd()` を使うことができる。 何らかの理由で作業ディレクトリが +のようになっていない場合は、RStudio のインターフェイスで +ファイルブラウザーで作業ディレクトリがあるべき場所に移動し、 +青い歯車のアイコン `More` をクリックし、`Set As Working Directory` を選択して変更することができます。 +あるいは、`setwd("/path/to/working/directory")`を使って、 +作業ディレクトリをリセットすることもできる。 しかし、あなたのスクリプトには、 +、この行を含めるべきではありません。なぜなら、他の誰かのコンピューターで失敗してしまうからです。 -**Example** +\*\*例 -The schema below represents the working directory `bioc-intro` with the -`data` and `fig_output` sub-directories, and 2 files in the latter: +以下のスキーマは作業ディレクトリ `bioc-intro` と +`data` と `fig_output` のサブディレクトリ、そして後者にある2つのファイルを表しています: ``` bioc-intro/data/ @@ -340,329 +331,324 @@ bioc-intro/data/ /fig_output/fig2.png ``` -If we were in the working directory, we could refer to the `fig1.pdf` -file using the relative path `bioc-intro/fig_output/fig1.pdf` or the -absolute path `/home/user/bioc-intro/fig_output/fig1.pdf`. - -If we were in the `data` directory, we would use the relative path -`../fig_output/fig1.pdf` or the same absolute path -`/home/user/bioc-intro/fig_output/fig1.pdf`. - -## Interacting with R - -The basis of programming is that we write down instructions for the -computer to follow, and then we tell the computer to follow those -instructions. We write, or _code_, instructions in R because it is a -common language that both the computer and we can understand. We call -the instructions _commands_ and we tell the computer to follow the -instructions by _executing_ (also called _running_) those commands. - -There are two main ways of interacting with R: by using the -**console** or by using **scripts** (plain text files that contain -your code). The console pane (in RStudio, the bottom left panel) is -the place where commands written in the R language can be typed and -executed immediately by the computer. It is also where the results -will be shown for commands that have been executed. You can type -commands directly into the console and press `Enter` to execute those -commands, but they will be forgotten when you close the session. - -Because we want our code and workflow to be reproducible, it is better -to type the commands we want in the script editor, and save the -script. This way, there is a complete record of what we did, and -anyone (including our future selves!) can easily replicate the -results on their computer. Note, however, that merely typing the commands -in the script does not automatically _run_ them - they still need to -be sent to the console for execution. - -RStudio allows you to execute commands directly from the script editor -by using the `Ctrl` + `Enter` shortcut (on Macs, `Cmd` + `Return` will -work, too). The command on the current line in the script (indicated -by the cursor) or all of the commands in the currently selected text -will be sent to the console and executed when you press `Ctrl` + -`Enter`. You can find other keyboard shortcuts in this RStudio -cheatsheet about the RStudio -IDE. - -At some point in your analysis you may want to check the content of a -variable or the structure of an object, without necessarily keeping a -record of it in your script. You can type these commands and execute -them directly in the console. RStudio provides the `Ctrl` + `1` and -`Ctrl` + `2` shortcuts allow you to jump between the script and the -console panes. - -If R is ready to accept commands, the R console shows a `>` prompt. If -it receives a command (by typing, copy-pasting or sending from the script -editor using `Ctrl` + `Enter`), R will try to execute it, and when -ready, will show the results and come back with a new `>` prompt to -wait for new commands. - -If R is still waiting for you to enter more data because it isn't -complete yet, the console will show a `+` prompt. It means that you -haven't finished entering a complete command. This is because you have -not 'closed' a parenthesis or quotation, i.e. you don't have the same -number of left-parentheses as right-parentheses, or the same number of -opening and closing quotation marks. When this happens, and you -thought you finished typing your command, click inside the console -window and press `Esc`; this will cancel the incomplete command and -return you to the `>` prompt. - -## How to learn more during and after the course? - -The material we cover during this course will give you an initial -taste of how you can use R to analyse data for your own -research. However, you will need to learn more to do advanced -operations such as cleaning your dataset, using statistical methods, -or creating beautiful graphics[^inthiscoure]. The best way to become -proficient and efficient at R, as with any other tool, is to use it to -address your actual research questions. As a beginner, it can feel -daunting to have to write a script from scratch, and given that many -people make their code available online, modifying existing code to -suit your purpose might make it easier for you to get started. - -[^inthiscoure]: We will introduce most of these (except statistics) - here, but will only manage to scratch the surface of the wealth of - what is possible to do with R. +もし作業ディレクトリにいれば、相対パス `bioc-intro/fig_output/fig1.pdf` か、 +絶対パス `/home/user/bioc-intro/fig_output/fig1.pdf` を使って `fig1.pdf` +ファイルを参照することができます。 + +もし `data` ディレクトリにいたとしたら、相対パス +`../fig_output/fig1.pdf` か、同じ絶対パス +`/home/user/bioc-intro/fig_output/fig1.pdf` を使うことになる。 + +## Rとの対話 + +プログラミングの基本は、私たちが +、コンピュータが従うべき命令を書き記し、その +命令に従うようコンピュータに指示することである。 私たちがRで命令を書く、つまり_コード_を書くのは、それが +、コンピューターも私たちも理解できる共通言語だからだ。 私たちは +を_コマンド_と呼び、それらのコマンドを_実行_(_実行_とも呼ぶ)することによって、 +の指示に従うようにコンピュータに指示する。 + +Rと対話する主な方法は2つある: +**コンソール**を使う方法と、**スクリプト**( +あなたのコードを含むプレーンテキストファイル)を使う方法である。 コンソールペイン(RStudioでは左下のパネル)は、 +、R言語で書かれたコマンドを入力し、 +、コンピュータによって即座に実行される場所です。 また、 +、実行されたコマンドの結果が表示される場所でもある。 コンソールに直接 +コマンドを入力し、`Enter`を押すことで、それらの +コマンドを実行することができますが、セッションを閉じると忘れてしまいます。 + +コードとワークフローを再現できるようにしたいので、 +、スクリプトエディターで必要なコマンドを入力し、 +スクリプトを保存する方がよい。 こうすることで、私たちがしたことの完全な記録が残る。 +、誰にでも(未来の自分も含めて!)。 は、 +の結果を自分のコンピューターで簡単に再現できる。 ただし、スクリプトに +、単にコマンドを入力しただけでは自動的に_実行_されないことに注意してほしい。 +、コンソールに送信して実行させる必要がある。 + +RStudio ではスクリプトエディター +から `Ctrl` + `Enter` ショートカット(Mac では `Cmd` + `Return` で +も可)で直接コマンドを実行できます。 Ctrl`+`Enter\`を押すと、スクリプトの現在の行のコマンド(カーソルで +を示す)、または現在選択されているテキスト +のすべてのコマンドがコンソールに送られ、実行される。 その他のキーボードショートカットはRStudio +cheatsheet about RStudio +IDEを参照してください。 + +分析のある時点で、 +変数の内容やオブジェクトの構造をチェックしたくなるかもしれない。必ずしもスクリプトに +の記録を残しておく必要はない。 これらのコマンドを入力し、コンソールで直接 +。 RStudio には `Ctrl` + `1` と +`Ctrl` + `2` のショートカットがあり、スクリプトと +のコンソールペイン間をジャンプすることができます。 + +Rがコマンドを受け付ける準備ができたら、Rコンソールに `>` プロンプトが表示される。 +コマンドを受信すると(タイプ、コピーペースト、またはスクリプト +エディターから `Ctrl` + `Enter` を使って送信)、R はそれを実行しようとします。 +準備ができると、結果を表示し、 +新しいコマンドを待つために新しい `>` プロンプトで戻ってきます。 + +Rがまだ +、データの入力を待っている場合は、コンソールに`+`プロンプトが表示されます。 これは、 +、完全なコマンドの入力が終わっていないことを意味する。 これは、 +、括弧や引用符を「閉じて」いないからです。つまり、 +、左括弧と右括弧の数や、 +、開閉引用符の数が同じではないからです。 このようなことが起こり、 +、コマンドを入力し終わったと思った場合、コンソール +ウィンドウ内をクリックし、`Esc`を押してください。これにより、不完全なコマンドがキャンセルされ、 +`>`プロンプトに戻ります。 + +## コース中やコース終了後にさらに学ぶには? + +このコースで扱う内容は、あなた自身の +研究のためにデータを分析するために R をどのように使うことができるかを、 +初めに体験していただくものです。 しかし、データセットのクリーニング、統計的手法の使用、 +、美しいグラフィックスの作成[^inthiscoure]など、 +の高度な操作を行うには、さらに学ぶ必要がある。 +Rに習熟し、効率的に使えるようになるための最良の方法は、他のツールと同様、Rを使って +実際の研究課題に取り組むことである。 初心者の場合、ゼロからスクリプトを書かなければならないのは、 +困難に感じるかもしれない。 +多くの人が自分のコードをオンラインで公開していることを考えると、 +自分の目的に合うように既存のコードを修正することで、簡単に始めることができるかもしれない。 + +[^inthiscoure]: ここでは、これらのほとんど(統計学を除く) + を紹介するが、Rで可能なこと + の富の表面をかすめることしかできない。 ```{r kitten, results="markup", echo=FALSE, purl=FALSE, out.width="400px", fig.align="center"} knitr::include_graphics("fig/kitten-try-things.jpg") ``` -## Seeking help +## 助けを求める -### Use the built-in RStudio help interface to search for more information on R functions +### RStudio 組み込みのヘルプインターフェイスを使用して、R 関数の詳細情報を検索します。 ```{r rstudiohelp, fig.cap="RStudio help interface.", results="markup", echo=FALSE, purl=FALSE, out.width="70%", fig.align="center"} knitr::include_graphics("fig/rstudiohelp.png") ``` -One of the fastest ways to get help, is to use the RStudio help -interface. This panel by default can be found at the lower right hand -panel of RStudio. As seen in the screenshot, by typing the word -"Mean", RStudio tries to also give a number of suggestions that you -might be interested in. The description is then shown in the display -window. +RStudio ヘルプ +インターフェイスを使用するのが、ヘルプを得る最も早い方法の1つです。 このパネルはデフォルトで RStudio の右下 +パネルにあります。 スクリーンショットに見られるように、 +"Mean "という単語を入力すると、RStudioは +、あなたが興味を持ちそうな候補もいくつか出そうとする。 説明文は +ウィンドウに表示される。 -### I know the name of the function I want to use, but I'm not sure how to use it +### 使いたい関数の名前はわかっているが、その使い方がわからない。 -If you need help with a specific function, let's say `barplot()`, you -can type: +特定の関数、例えば`barplot()`のヘルプが必要な場合は、 +: ```{r, eval=FALSE, purl=TRUE} -?barplot +バープロット ``` -If you just need to remind yourself of the names of the arguments, you can use: +引数の名前を思い出す必要がある場合は、次のようにすればよい: ```{r, eval=FALSE, purl=TRUE} -args(lm) +引数(lm) ``` -### I want to use a function that does X, there must be a function for it but I don't know which one... +### Xを行う関数を使いたい。そのための関数があるはずだが、どれがあるのかわからない...。 -If you are looking for a function to do a particular task, you can use the -`help.search()` function, which is called by the double question mark `??`. -However, this only looks through the installed packages for help pages with a -match to your search request +特定のタスクを実行する関数を探している場合は、 +`help.search()`関数を使用することができます。この関数は二重の疑問符 \`? +しかし、これはインストールされているパッケージの中から、検索リクエストと +一致するヘルプページを探すだけです。 ```{r, eval=FALSE, purl=TRUE} -??kruskal +クルスカル ``` -If you can't find what you are looking for, you can use -the [rdocumentation.org](https://www.rdocumentation.org) website that searches -through the help files across all packages available. +探しているものが見つからない場合は、 +[rdocumentation.org](https://www.rdocumentation.org)のウェブサイトを使うことができます。このウェブサイトは、利用可能なすべてのパッケージのヘルプファイルから +を検索します。 -Finally, a generic Google or internet search "R \<task>" will often either send -you to the appropriate package documentation or a helpful forum where someone -else has already asked your question. +最後に、一般的なグーグルやインターネット検索で "R \<task>" を検索すると、多くの場合、 +適切なパッケージ・ドキュメントにたどり着くか、 +他の誰かがすでに質問している有益なフォーラムにたどり着く。 -### I am stuck... I get an error message that I don't understand +### 動けないんだ...。 理解できないエラーメッセージが表示されます。 -Start by googling the error message. However, this doesn't always work very well -because often, package developers rely on the error catching provided by R. You -end up with general error messages that might not be very helpful to diagnose a -problem (e.g. "subscript out of bounds"). If the message is very generic, you -might also include the name of the function or package you're using in your -query. +エラーメッセージをググることから始めよう。 +というのも、多くの場合、パッケージ開発者はRが提供するエラー・キャッチに依存しているからである。 +、一般的なエラー・メッセージが表示されることになるが、これは +の問題を診断するのにあまり役に立たないかもしれない(例えば、"subscript out of bounds")。 メッセージが非常に一般的なものであれば、 +、使用している関数やパッケージの名前を +クエリに含めることもできる。 -However, you should check Stack Overflow. Search using the `[r]` tag. Most -questions have already been answered, but the challenge is to use the right -words in the search to find the -answers: +しかし、Stack Overflowをチェックする必要がある。 r]\`タグを使って検索する。 ほとんどの +の質問にはすでに答えが出されているが、 +の答えを見つけるために、検索で適切な +の言葉を使うことが課題である: [http://stackoverflow.com/questions/tagged/r](https://stackoverflow.com/questions/tagged/r) -The [Introduction to R](https://cran.r-project.org/doc/manuals/R-intro.pdf) can -also be dense for people with little programming experience but it is a good -place to understand the underpinnings of the R language. - -The [R FAQ](https://cran.r-project.org/doc/FAQ/R-FAQ.html) is dense and technical -but it is full of useful information. - -### Asking for help - -The key to receiving help from someone is for them to rapidly grasp -your problem. You should make it as easy as possible to pinpoint where -the issue might be. - -Try to use the correct words to describe your problem. For instance, a -package is not the same thing as a library. Most people will -understand what you meant, but others have really strong feelings -about the difference in meaning. The key point is that it can make -things confusing for people trying to help you. Be as precise as -possible when describing your problem. - -If possible, try to reduce what doesn't work to a simple _reproducible -example_. If you can reproduce the problem using a very small data -frame instead of your 50000 rows and 10000 columns one, provide the -small one with the description of your problem. When appropriate, try -to generalise what you are doing so even people who are not in your -field can understand the question. For instance instead of using a -subset of your real dataset, create a small (3 columns, 5 rows) -generic one. For more information on how to write a reproducible -example see this article by Hadley -Wickham. - -To share an object with someone else, if it's relatively small, you -can use the function `dput()`. It will output R code that can be used -to recreate the exact same object as the one in memory: +R言語入門](https://cran.r-project.org/doc/manuals/R-intro.pdf) +も、プログラミング経験の少ない人にとっては内容が濃いかもしれないが、R言語の基礎を理解するには良い +場所である。 + +R FAQ](https://cran.r-project.org/doc/FAQ/R-FAQ.html)は密度が濃く、技術的である。 +、しかし有用な情報が満載である。 + +### 助けを求める + +誰かに助けてもらうために重要なのは、相手があなたの問題( +)を素早く把握することだ。 +、どこに問題がありそうかをできるだけ簡単に特定できるようにすべきだ。 + +あなたの問題を説明するために、正しい言葉を使うようにしてください。 例えば、 +パッケージはライブラリとは違う。 ほとんどの人は、 +、あなたが言いたかったことを理解するだろうが、意味の違いについて本当に強い感情を持つ人もいる +。 重要なのは、 +、あなたを助けようとする人々を混乱させる可能性があるということだ。 +できるだけ正確に問題を説明してください。 + +可能であれば、うまくいかないことを単純な例(再現可能な +)にまで落とし込むようにする。 もし、あなたが50000行10000列のデータではなく、非常に小さなデータ +フレームを使って問題を再現できるのであれば、 +その小さなデータを問題の説明とともに提供してください。 適切な場合には、 +、あなたのやっていることを一般化して、 +分野に関係のない人でも質問を理解できるようにする。 例えば、 +実際のデータセットのサブセットを使う代わりに、 +小さな(3列、5行)一般的なものを作成する。 再現可能な +例の書き方については、Hadley +Wickhamによるこの記事を参照のこと。 + +オブジェクトを他の人と共有するには、それが比較的小さければ、 +、関数 `dput()` を使うことができる。 +、メモリ上のオブジェクトとまったく同じオブジェクトを再作成するために使用できるRコードが出力される: ```{r, results="show", purl=TRUE} -## iris is an example data frame that comes with R and head() is a -## function that returns the first part of the data frame +## irisはRに付属するデータフレームの例であり、 head()はデータフレームの最初の部分を返す +## 関数である dput(head(iris)) ``` -If the object is larger, provide either the raw file (i.e., your CSV -file) with your script up to the point of the error (and after -removing everything that is not relevant to your -issue). Alternatively, in particular if your question is not related -to a data frame, you can save any R object to a file[^export]: +オブジェクトのサイズが大きい場合は、生ファイル(つまり、CSV +ファイル)と、エラーが発生した時点までのスクリプト(およびの問題に関係ないものをすべて削除した後のファイル)を提供してください。 あるいは、特にあなたの質問が +データフレームに関連していない場合は、任意のRオブジェクトをファイルに保存することができます[^export]: ```{r, eval=FALSE, purl=FALSE} saveRDS(iris, file="/tmp/iris.rds") ``` -The content of this file is however not human readable and cannot be -posted directly on Stack Overflow. Instead, it can be sent to someone -by email who can read it with the `readRDS()` command (here it is -assumed that the downloaded file is in a `Downloads` folder in the -user's home directory): +しかし、このファイルの内容は人間が読めるものではないので、 +Stack Overflowに直接投稿することはできません。 その代わりに、 +。その人は`readRDS()`コマンドでそのファイルを読むことができます(ここでは、 +、ダウンロードされたファイルは +、そのユーザーのホームディレクトリの`Downloads`フォルダにあると仮定しています): ```{r, eval=FALSE, purl=FALSE} some_data <- readRDS(file="~/Downloads/iris.rds") ``` -Last, but certainly not least, **always include the output of `sessionInfo()`** -as it provides critical information about your platform, the versions of R and -the packages that you are using, and other information that can be very helpful -to understand your problem. +最後になりますが、**必ず`sessionInfo()`** +の出力を含めるようにしてください。プラットフォーム、使用しているRと +パッケージのバージョン、その他問題を理解するのに非常に役立つ情報を提供してくれるからです。 ```{r, results="show", purl=TRUE} sessionInfo() ``` -### Where to ask for help? - -- The person sitting next to you during the course. Don't hesitate to - talk to your neighbour during the workshop, compare your answers, - and ask for help. -- Your friendly colleagues: if you know someone with more experience - than you, they might be able and willing to help you. -- [Stack Overflow](https://stackoverflow.com/questions/tagged/r): if - your question hasn't been answered before and is well crafted, - chances are you will get an answer in less than 5 min. Remember to - follow their guidelines on how to ask a good - question. -- The R-help mailing - list: it is read by a - lot of people (including most of the R core team), a lot of people - post to it, but the tone can be pretty dry, and it is not always - very welcoming to new users. If your question is valid, you are - likely to get an answer very fast but don't expect that it will come - with smiley faces. Also, here more than anywhere else, be sure to - use correct vocabulary (otherwise you might get an answer pointing - to the misuse of your words rather than answering your - question). You will also have more success if your question is about - a base function rather than a specific package. -- If your question is about a specific package, see if there is a - mailing list for it. Usually it's included in the DESCRIPTION file - of the package that can be accessed using - `packageDescription("name-of-package")`. You may also want to try to - email the author of the package directly, or open an issue on the - code repository (e.g., GitHub). -- There are also some topic-specific mailing lists (GIS, - phylogenetics, etc...), the complete list is - [here](https://www.r-project.org/mail.html). - -### More resources - -- The [Posting Guide](https://www.r-project.org/posting-guide.html) for - the R mailing lists. +### どこに助けを求めればいいのか? + +- コース中、あなたの隣に座っている人。 + 、ワークショップ中に隣の人と話し、自分の答えを比較し、 + 、助けを求めることをためらわないでください。 +- 友好的な同僚:もしあなたより経験豊富な人を知っていれば、 + 、あなたを助けてくれるかもしれない。 +- [Stack Overflow](https://stackoverflow.com/questions/tagged/r): + あなたの質問が過去に回答されたことがなく、よく練られたものであれば、 + 5分以内に回答が得られる可能性があります。 + How to ask good + questionのガイドラインに従うことを忘れずに。 +- R-helpメーリングリスト + ](https://stat.ethz.ch/mailman/listinfo/r-help): + 多くの人に読まれていて(Rコアチームのほとんどを含む)、多くの人が + 投稿していますが、口調はかなり辛口で、 + 新しいユーザーを必ずしも歓迎しているとは限りません。 あなたの質問が正当なものであれば、 + 、すぐに回答が返ってくる可能性が高いが、 + 、スマイルマーク付きで返ってくるとは思わないこと。 また、ここでは他のどこよりも、 + 正しい語彙を使うようにしましょう(そうしないと、 + の質問に答えるのではなく、 + あなたの言葉の誤用を指摘する答えが返ってくるかもしれません)。 また、質問内容が特定のパッケージではなく、 + ベースとなる関数に関するものであれば、より成功しやすいでしょう。 +- 質問が特定のパッケージに関するものであれば、そのパッケージのメーリングリスト( + )があるかどうか確認してください。 通常は + `packageDescription("name-of-package")` を使ってアクセスできるパッケージのDESCRIPTIONファイル + に含まれています。 また、 + そのパッケージの作者に直接メールを送ってみたり、 + コードリポジトリ(例:GitHub)にissueを開いてみるのもよいだろう。 +- また、トピックに特化したメーリングリスト(GIS、 + 系統遺伝学など)もある。全リストは + [こちら](https://www.r-project.org/mail.html)。 + +### その他のリソース + +- Rメーリングリストの[投稿ガイド](https://www.r-project.org/posting-guide.html)。 - How to ask for R help - useful guidelines. + 役立つガイドライン。 -- This blog post by Jon - Skeet - has quite comprehensive advice on how to ask programming questions. +- Jon + Skeetによるこのブログ記事 + プログラミングの質問の仕方について、かなり包括的なアドバイスがある。 -- The [reprex](https://cran.rstudio.com/web/packages/reprex/) package - is very helpful to create reproducible examples when asking for - help. The rOpenSci community call "How to ask questions so they get - answered" (Github +- reprex](https://cran.rstudio.com/web/packages/reprex/)パッケージ + は、 + ヘルプを求めるときに、再現可能な例を作成するのに非常に役立ちます。 + rOpenSciコミュニティコール "How to ask questions so they get answered" (Github link and video - recording) includes a presentation of - the reprex package and of its philosophy. + recording) には、 + reprexパッケージとその哲学のプレゼンテーションが含まれています。 -## R packages +## Rパッケージ -### Loading packages +### 荷物の積み込み -As we have seen above, R packages play a fundamental role in R. The -make use of a package's functionality, assuming it is installed, we -first need to load it to be able to use it. This is done with the -`library()` function. Below, we load `ggplot2`. +上で見てきたように、RパッケージはRの基本的な役割を担っている。 +、パッケージがインストールされていることを前提に、パッケージの機能を利用する。 +、それを利用できるようにするには、まずパッケージをロードする必要がある。 これは +`library()`関数で行う。 以下に `ggplot2` をロードする。 ```{r loadp, eval=FALSE, purl=TRUE} library("ggplot2") ``` -### Installing packages +### パッケージのインストール -The default package repository is The _Comprehensive R Archive -Network_ (CRAN), and any package that is available on CRAN can be -installed with the `install.packages()` function. Below, for example, -we install the `dplyr` package that we will learn about later. +デフォルトのパッケージリポジトリは The _Comprehensive R Archive +Network_ (CRAN) で、CRAN で利用可能なパッケージは `install.packages()` 関数で +インストールできます。 例えば、 +、後で説明する `dplyr` パッケージをインストールする。 ```{r craninstall, eval=FALSE, purl=TRUE} install.packages("dplyr") ``` -This command will install the `dplyr` package as well as all its -dependencies, i.e. all the packages that it relies on to function. +このコマンドは、`dplyr` パッケージと、その +依存パッケージ、つまり、そのパッケージが機能するために依存しているすべてのパッケージをインストールします。 -Another major R package repository is maintained by Bioconductor. [Bioconductor packages](https://bioconductor.org/packages/release/BiocViews.html#___Software) are managed and installed using a dedicated package, -namely `BiocManager`, that can be installed from CRAN with +もう一つの主要なRパッケージのリポジトリは、Bioconductorによって管理されている。 [Bioconductorパッケージ](https://bioconductor.org/packages/release/BiocViews.html#___Software) は、専用のパッケージ、 +すなわち `BiocManager` を使用して管理およびインストールされます。 ```{r, eval=FALSE, purl=TRUE} install.packages("BiocManager") ``` -Individual packages such as `SummarizedExperiment` (we will use it -later), `DESeq2` (for RNA-Seq analysis), and any others from either Bioconductor or CRAN can then be -installed with `BiocManager::install`. +SummarizedExperiment`(後で +)、`DESeq2`(RNA-Seq解析用)、その他BioconductorやCRANにあるパッケージは、`BiocManager::install\`で +。 ```{r, eval=FALSE, purl=TRUE} BiocManager::install("SummarizedExperiment") BiocManager::install("DESeq2") ``` -By default, `BiocManager::install()` will also check all your installed packages and see if there are newer versions available. If there are, it will show them to you and ask you if you want to `Update all/some/none? [a/s/n]:` and then wait for your answer. While you should strive to have the most up-to-date package versions, in practice we recommend only updating packages in a fresh R session before any packages are loaded. +デフォルトでは、`BiocManager::install()` はインストールされているすべてのパッケージをチェックし、新しいバージョンがあるかどうかも確認します。 もしあれば、それが表示され、「すべて/いくつか/なしを更新しますか? [a/s/n]:\`、そしてあなたの答えを待つ。 パッケージのバージョンは最新のものを用意するよう努力すべきですが、実際には、パッケージがロードされる前の新鮮なRセッションでのみパッケージを更新することをお勧めします。 :::::::::::::::::::::::::::::::::::::::: keypoints -- Start using R and RStudio +- RとRStudioを使い始める :::::::::::::::::::::::::::::::::::::::::::::::::: From 84e473fa3868dd9c2c1c6cdd6a2d8aa72e7ba51e Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Mon, 13 May 2024 09:37:58 +0900 Subject: [PATCH 166/334] New translations 25-starting-with-data.md (Japanese) --- locale/ja/episodes/25-starting-with-data.Rmd | 738 +++++++++---------- 1 file changed, 364 insertions(+), 374 deletions(-) diff --git a/locale/ja/episodes/25-starting-with-data.Rmd b/locale/ja/episodes/25-starting-with-data.Rmd index 2387ed393..e0cf3ff3a 100644 --- a/locale/ja/episodes/25-starting-with-data.Rmd +++ b/locale/ja/episodes/25-starting-with-data.Rmd @@ -1,6 +1,6 @@ --- source: Rmd -title: Starting with data +title: データから始める teaching: 30 exercises: 30 --- @@ -13,7 +13,7 @@ exercises: 30 - `data.frame` が何なのか説明してみましょう。 - .csv ファイルからデータ フレームに外部データを読み込みましょう。 - データフレームの内容を要約してみましょう。 -- Describe what a factor is. +- ファクターとは何か? - string と factor を変換してみましょう。 - factor の並び替えとリネームを行ってみましょう。 - 日付をフォーマットしてみましょう。 @@ -23,100 +23,98 @@ exercises: 30 :::::::::::::::::::::::::::::::::::::::: questions -- First data analysis in R +- Rによる最初のデータ分析 :::::::::::::::::::::::::::::::::::::::::::::::::: -> This episode is based on the Data Carpentries's _Data Analysis and -> Visualisation in R for Ecologists_ lesson. - -## Presentation of the gene expression data - -We are going to use part of the data published by Blackmore , _The -effect of upper-respiratory infection on transcriptomic changes in the -CNS_. The goal of the study was to determine the effect of an -upper-respiratory infection on changes in RNA transcription occurring -in the cerebellum and spinal cord post infection. Gender matched eight -week old C57BL/6 mice were inoculated with saline or with Influenza A by -intranasal route and transcriptomic changes in the cerebellum and -spinal cord tissues were evaluated by RNA-seq at days 0 -(non-infected), 4 and 8. - -The dataset is stored as a comma-separated values (CSV) file. Each row -holds information for a single RNA expression measurement, and the first eleven -columns represent: - -| Column | Description | -| ---------- | -------------------------------------------------------------------------------------------------------------------------------------------- | -| gene | The name of the gene that was measured | -| sample | The name of the sample the gene expression was measured in | -| expression | The value of the gene expression | -| organism | The organism/species - here all data stem from mice | -| age | The age of the mouse (all mice were 8 weeks here) | -| sex | The sex of the mouse | -| infection | The infection state of the mouse, i.e. infected with Influenza A or not infected. | -| strain | The Influenza A strain. | -| time | The duration of the infection (in days). | -| tissue | The tissue that was used for the gene expression experiment, i.e. cerebellum or spinal cord. | -| mouse | The mouse unique identifier. | - -We are going to use the R function `download.file()` to download the -CSV file that contains the gene expression data, and we will use -`read.csv()` to load into memory the content of the CSV file as an -object of class `data.frame`. Inside the `download.file` command, the -first entry is a character string with the source URL. This source URL -downloads a CSV file from a GitHub repository. The text after the -comma (`"data/rnaseq.csv"`) is the destination of the file on your -local machine. You'll need to have a folder on your machine called -`"data"` where you'll download the file. So this command downloads the -remote file, names it `"rnaseq.csv"` and adds it to a preexisting -folder named `"data"`. +> このエピソードは、Data Carpentriesの_Data Analysis and +> Visualisation in R for Ecologists_レッスンに基づいています。 + +## 遺伝子発現データのプレゼンテーション + +Blackmore _et al._ +(2017)](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5544260/), _The +effect of upper-respiratory infection on transcriptomic changes in +CNS_ によって発表されたデータの一部を使用する予定である。 研究の目的は、 +上部呼吸器感染症が、感染後の小脳と脊髄で +起こるRNA転写の変化に及ぼす影響を調べることであった。 性別を一致させた8匹の +週齢のC57BL/6マウスに、生理食塩水または +鼻腔内経路でインフルエンザAを接種し、0日目 +(非感染)、4日目、8日目に小脳と +脊髄組織におけるトランスクリプトーム変化をRNA-seqで評価した。 + +データセットは、カンマ区切りの値(CSV)ファイルとして保存される。 各行 +は1つのRNA発現測定の情報を持ち、最初の11列 +はそれを表している: + +| コラム | 説明 | +| ---- | ---------------------------------------- | +| 遺伝子 | 測定された遺伝子名 | +| サンプル | 遺伝子発現を測定したサンプル名 | +| 表現 | 遺伝子発現の値 | +| 有機体 | 生物/種 - ここではすべてのデータはマウスに由来する | +| 年齢 | マウスの年齢(ここではすべてのマウスが8週齢であった) | +| セックス | マウスの性別 | +| 感染症 | マウスの感染状態、すなわちA型インフルエンザに感染しているか、感染していないか。 | +| 緊張 | インフルエンザA型。 | +| 時間 | 感染期間(日単位)。 | +| 組織 | 遺伝子発現実験に使用した組織、すなわち小脳または脊髄。 | +| マウス | マウス固有の識別子。 | + +R関数の`download.file()`を使って遺伝子発現データを含む +CSVファイルをダウンロードし、 +`read.csv()` を使ってCSVファイルの内容を +`data.frame`クラスのオブジェクトとしてメモリにロードする。 download.file`コマンドの内部では、 +の最初のエントリーは、ソースURLの文字列である。 このソースURL +はGitHubリポジトリからCSVファイルをダウンロードします。 +、カンマ(`"data/rnaseq.csv"`)の後のテキストは、 +、ローカルマシン上のファイルの保存先です。 あなたのマシンに +`"data"`というフォルダを用意し、そこにファイルをダウンロードする必要があります。 そこで、このコマンドは +リモートファイルをダウンロードし、`"rnaseq.csv"` という名前を付けて、`"data"\` という名前の +フォルダに追加する。 ```{r, eval=TRUE} download.file(url = "https://github.com/carpentries-incubator/bioc-intro/raw/main/episodes/data/rnaseq.csv", destfile = "data/rnaseq.csv") ``` -You are now ready to load the data: +これでデータをロードする準備ができた: ```{r, eval=TRUE, purl=TRUE} rna <- read.csv("data/rnaseq.csv") ``` -This statement doesn't produce any output because, as you might -recall, assignments don't display anything. If we want to check that -our data has been loaded, we can see the contents of the data frame by -typing its name: +、代入は何も表示しないからだ。 +、データがロードされたことを確認したい場合は、 +、その名前をタイプすることでデータフレームの中身を見ることができる: ```{r, eval=FALSE} -rna +RNA ``` -Wow... that was a lot of output. At least it means the data loaded -properly. Let's check the top (the first 6 lines) of this data frame -using the function `head()`: +うわぁ...。 多くのアウトプットがあった。 少なくとも、 +。 関数 `head()` を使って、このデータ・フレーム +の先頭(最初の6行)をチェックしてみよう: ```{r, purl=TRUE} head(rna) -## Try also -## View(rna) -``` - -**Note** - -`read.csv()` assumes that fields are delineated by commas, however, in -several countries, the comma is used as a decimal separator and the -semicolon (;) is used as a field delineator. If you want to read in -this type of files in R, you can use the `read.csv2()` function. It -behaves exactly like `read.csv()` but uses different parameters for -the decimal and the field separators. If you are working with another -format, they can be both specified by the user. Check out the help for -`read.csv()` by typing `?read.csv` to learn more. There is also the -`read.delim()` function for reading tab separated data files. It is important to -note that all of these functions are actually wrapper functions for -the main `read.table()` function with different arguments. As such, -the data above could have also been loaded by using `read.table()` -with the separation argument as `,`. The code is as follows: +## +## View(rna)も試してみる。 +``` + +**注**\*。 + +read.csv()`は、フィールドがカンマで区切られていると仮定しているが、 +いくつかの国では、カンマは小数の区切り文字として使用され、 +セミコロン(;)はフィールドの区切り文字として使用される。 Rでこの種のファイルを +、 `read.csv2()`関数を使うことができる。 +`read.csv()`と全く同じ動作をするが、 +小数とフィールドのセパレーターに異なるパラメーターを使用する。 別の +フォーマットを使用している場合は、ユーザーが両方指定することができます。 詳しくは、 +`read.csv()`のヘルプを`?read.csv`と入力して確認してください。 また、`read.delim()`関数があり、タブ区切りのデータファイルを読み込むことができる。 +重要なことは、これらの関数はすべて、 +メインの`read.table()` 関数に異なる引数を指定するためのラッパー関数であるということです。 そのため、 +上のデータは、`read.table()`、区切りの引数を`,\` にしてロードすることもできた。 コードは以下の通り: ```{r, eval=TRUE, purl=TRUE} rna <- read.table(file = "data/rnaseq.csv", @@ -124,220 +122,217 @@ rna <- read.table(file = "data/rnaseq.csv", header = TRUE) ``` -The header argument has to be set to TRUE to be able to read the -headers as by default `read.table()` has the header argument set to -FALSE. +デフォルトでは `read.table()` の header 引数は +FALSE に設定されているので、 +のヘッダーを読むためには header 引数を TRUE に設定しなければならない。 -## What are data frames? +## データフレームとは? -Data frames are the _de facto_ data structure for most tabular data, -and what we use for statistics and plotting. +データ・フレームは、ほとんどの表データ、 +、統計やプロットに使われる_事実上の_データ構造である。 -A data frame can be created by hand, but most commonly they are -generated by the functions `read.csv()` or `read.table()`; in other -words, when importing spreadsheets from your hard drive (or the web). +データフレームは手作業で作成することもできますが、最も一般的なのは、関数 `read.csv()` や `read.table()` によって生成される +データフレームです。 -A data frame is the representation of data in the format of a table -where the columns are vectors that all have the same length. Because -columns are vectors, each column must contain a single type of data -(e.g., characters, integers, factors). For example, here is a figure -depicting a data frame comprising a numeric, a character, and a -logical vector. +データフレームとは、 +、列がすべて同じ長さのベクトルである表の形式でデータを表現したものである。 +、列はベクトルであるため、各列は1種類のデータ +(文字、整数、因子など)を含まなければならない。 例えば、 +、数値、文字、 +論理ベクトルからなるデータフレームを示す図である。 ![](./fig/data-frame.svg) -We can see this when inspecting the <b>str</b>ucture of a data frame -with the function `str()`: +str()\`という関数で +: ```{r} str(rna) ``` -## Inspecting `data.frame` Objects +## data.frame\` オブジェクトの検査 -We already saw how the functions `head()` and `str()` can be useful to -check the content and the structure of a data frame. Here is a -non-exhaustive list of functions to get a sense of the -content/structure of the data. Let's try them out! +関数 `head()` と `str()` が、 +データフレームの内容と構造をチェックするのに便利であることは、すでに説明した。 以下は、 +データの内容/構造を知るための、 +非網羅的な機能のリストである。 試してみよう! -**Size**: +\*\*サイズ -- `dim(rna)` - returns a vector with the number of rows as the first - element, and the number of columns as the second element (the - **dim**ensions of the object). -- `nrow(rna)` - returns the number of rows. -- `ncol(rna)` - returns the number of columns. +- dim(rna)\` - 行数を最初の + 要素とし、列数を2番目の要素(オブジェクトの + **dim**ensions )とするベクトルを返す。 +- nrow(rna)\` - 行の数を返す。 +- ncol(rna)\` - 列数を返す。 -**Content**: +\*\*内容 -- `head(rna)` - shows the first 6 rows. -- `tail(rna)` - shows the last 6 rows. +- head(rna)\` - 最初の6行を表示する。 +- tail(rna)\` - 最後の6行を表示する。 -**Names**: +**名前**: -- `names(rna)` - returns the column names (synonym of `colnames()` for - `data.frame` objects). -- `rownames(rna)` - returns the row names. +- names(rna)`- 列名を返す(`data.frame`オブジェクトの`colnames()\` と同義)。 +- rownames(rna)\` - 行の名前を返す。 -**Summary**: +**要約**: -- `str(rna)` - structure of the object and information about the - class, length and content of each column. -- `summary(rna)` - summary statistics for each column. +- str(rna)\` - オブジェクトの構造と、 + クラス、各カラムの長さと内容に関する情報。 +- `summary(rna)` - 各カラムの要約統計量。 -Note: most of these functions are "generic", they can be used on other types of -objects besides `data.frame`. +注:これらの関数のほとんどは "ジェネリック "であり、`data.frame`以外の +オブジェクトにも使用できます。 ::::::::::::::::::::::::::::::::::::::: challenge -## Challenge: +## チャレンジだ: -Based on the output of `str(rna)`, can you answer the following -questions? +str(rna)\`の出力に基づいて、以下の +の質問に答えられるか? -- What is the class of the object `rna`? -- How many rows and how many columns are in this object? +- オブジェクト `rna` のクラスは何ですか? +- このオブジェクトにはいくつの行といくつの列がありますか? -::::::::::::::: solution +:::::::::::::::::::: 解決策 -## Solution +## ソリューション -- class: data frame -- how many rows: 66465, how many columns: 11 +- クラス: データ・フレーム +- 行数:66465、列数:11:11 ::::::::::::::::::::::::: :::::::::::::::::::::::::::::::::::::::::::::::::: -## Indexing and subsetting data frames +## データフレームのインデックス化とサブセット化 -Our `rna` data frame has rows and columns (it has 2 dimensions); if we -want to extract some specific data from it, we need to specify the -"coordinates" we want. Row numbers come first, followed by -column numbers. However, note that different ways of specifying these -coordinates lead to results with different classes. +rna\`データフレームには行と列がある(2次元ある)。 +、そこから特定のデータを抽出したい場合は、 +「座標」を指定する必要がある。 行番号が最初に来て、 +列番号がそれに続く。 しかし、これらの +座標を指定する方法が異なれば、異なるクラスの結果が得られることに注意されたい。 ```{r, eval=FALSE, purl=TRUE} -# first element in the first column of the data frame (as a vector) +# データフレームの1列目の最初の要素(ベクトルとして) rna[1, 1] -# first element in the 6th column (as a vector) +# 6列目の最初の要素(ベクトルとして) rna[1, 6] -# first column of the data frame (as a vector) +# データフレームの1列目の要素(ベクトルとして) rna[, 1] -# first column of the data frame (as a data.frame) +# データフレームの1列目の要素(data.フレームとして) rna[1] -# first three elements in the 7th column (as a vector) +# 7列目の最初の3要素(ベクトルとして) rna[1:3, 7] -# the 3rd row of the data frame (as a data.frame) +# データフレームの3行目(data.frameとして) rna[3, ] -# equivalent to head_rna <- head(rna) +# head_rna <- head(rna) head_rna <- rna[1:6, ] head_rna ``` -`:` is a special function that creates numeric vectors of integers in -increasing or decreasing order, test `1:10` and `10:1` for -instance. See section @ref(sec:genvec) for details. +1:10`と `10:1\`は +の例で、 +の増加または減少の順序で整数の数値ベクトルを作成する特別な関数である。 詳しくは@ref(sec:genvec)を参照のこと。 -You can also exclude certain indices of a data frame using the "`-`" sign: +また、「`-`」記号を使ってデータフレームの特定のインデックスを除外することもできる: ```{r, eval=FALSE, purl=TRUE} -rna[, -1] ## The whole data frame, except the first column -rna[-c(7:66465), ] ## Equivalent to head(rna) +rna[, -1] ## 最初の列を除いたデータフレーム全体 +rna[-c(7:66465), ] ## head(rna)と等価 ``` -Data frames can be subsetted by calling indices (as shown previously), -but also by calling their column names directly: +データフレームは、インデックス(前に示したように)や +、列名を直接呼び出してサブセットすることもできる: ```{r, eval=FALSE, purl=TRUE} -rna["gene"] # Result is a data.frame -rna[, "gene"] # Result is a vector -rna[["gene"]] # Result is a vector -rna$gene # Result is a vector +rna["gene"] # Result is a data.frame +rna[, "gene"] # Result is a vector +rna[["gene"]]. # 結果はベクトル +rna$gene # 結果はベクトル ``` -In RStudio, you can use the autocompletion feature to get the full and -correct names of the columns. +RStudio では、オートコンプリート機能を使用して、列の完全で +正しい名前を取得できます。 ::::::::::::::::::::::::::::::::::::::: challenge -## Challenge +## チャレンジ -1. Create a `data.frame` (`rna_200`) containing only the data in - row 200 of the `rna` dataset. +1. データセット `rna` の + 行 200 番目のデータのみを含む `data.frame` (`rna_200`) を作成する。 -2. Notice how `nrow()` gave you the number of rows in a `data.frame`? +2. nrow()`が`data.frame\`の行数を示していることに気づいただろうか? -- Use that number to pull out just that last row in the initial - `rna` data frame. +- この数字を使って、最初の + `rna`データフレームの最後の行だけを取り出す。 -- Compare that with what you see as the last row using `tail()` to - make sure it's meeting expectations. +- tail()\`を使った最後の行と比較し、 + 、期待に応えていることを確認する。 -- Pull out that last row using `nrow()` instead of the row number. +- 行番号の代わりに `nrow()` を使って最後の行を取り出す。 -- Create a new data frame (`rna_last`) from that last row. +- 最後の行から新しいデータフレーム(`rna_last`)を作成する。 -3. Use `nrow()` to extract the row that is in the middle of the - `rna` dataframe. Store the content of this row in an object - named `rna_middle`. +3. `rna` データフレームの中央にある行を抽出するには `nrow()` を使用する。 この行の内容をオブジェクト + `rna_middle` に格納する。 -4. Combine `nrow()` with the `-` notation above to reproduce the - behavior of `head(rna)`, keeping just the first through 6th - rows of the rna dataset. +4. nrow()`と上記の `-` 表記を組み合わせると、rnaデータセットの1行目から6行目までの + 行だけを保持し、`head(rna)\`の + 挙動を再現することができる。 -::::::::::::::: solution +:::::::::::::::::::: 解決策 -## Solution +## ソリューション ```{r, purl=TRUE} -## 1. +## rna_200 <- rna[200, ] ## 2. -## Saving `n_rows` to improve readability and reduce duplication +## n_rows <- nrow(rna) rna_last <- rna[n_rows, ] ## 3. rna_middle <- rna[n_rows / 2, ] ## 4. -rna_head <- rna[-(7:n_rows), ] +rna_head <- rna[-(7:n_rows), ]。 ``` ::::::::::::::::::::::::: :::::::::::::::::::::::::::::::::::::::::::::::::: -## Factors +## 要因 -Factors represent **categorical data**. They are stored as integers -associated with labels and they can be ordered or unordered. While -factors look (and often behave) like character vectors, they are -actually treated as integer vectors by R. So you need to be very -careful when treating them as strings. +要因は**カテゴリーデータ**を表す。 これらは、 +ラベルに関連付けられた整数として格納され、順序付けされたものであっても、順序付けされていないものであってもよい。 +因子は文字ベクトルのように見える(そしてしばしば振舞う)が、 +、実際にはRでは整数ベクトルとして扱われる。そのため、文字列として扱う場合は非常に +注意する必要がある。 -Once created, factors can only contain a pre-defined set of values, -known as _levels_. By default, R always sorts levels in alphabetical -order. For instance, if you have a factor with 2 levels: +いったん作成されたファクターは、あらかじめ定義された値のセット( +、_レベル_として知られている)しか含むことができない。 デフォルトでは、Rは常にレベルをアルファベット順( +)でソートする。 例えば、2つのレベルを持つ因子があるとする: ```{r, purl=TRUE} -sex <- factor(c("male", "female", "female", "male", "female")) +セックス <- factor(c("male", "female", "female", "male", "female")) ``` -R will assign `1` to the level `"female"` and `2` to the level -`"male"` (because `f` comes before `m`, even though the first element -in this vector is `"male"`). You can see this by using the function -`levels()` and you can find the number of levels using `nlevels()`: +Rは`1`をレベル`"female"`に、`2`をレベル +`"male"`に割り当てる(このベクトルの最初の要素 +が`"male"`であるにもかかわらず、`f`が`m`の前に来るため)。 これは、 +`levels()` という関数を使うことで見ることができ、`nlevels()` を使えばレベル数を知ることができる: ```{r, purl=TRUE} levels(sex) nlevels(sex) ``` -Sometimes, the order of the factors does not matter, other times you -might want to specify the order because it is meaningful (e.g., "low", -"medium", "high"), it improves your visualization, or it is required -by a particular type of analysis. Here, one way to reorder our levels -in the `sex` vector would be: +要因の順番が重要でない場合もあるが、 +、意味がある(例えば、"low"、 +"medium"、"high")、視覚化が向上する、または特定のタイプの分析で必要である( +)ため、順番を指定したい場合もある。 ここで、 +`sex`ベクトルでレベルを並べ替える一つの方法は次のようになる: ```{r, purl=TRUE} sex ## current order @@ -345,30 +340,30 @@ sex <- factor(sex, levels = c("male", "female")) sex ## after re-ordering ``` -In R's memory, these factors are represented by integers (1, 2, 3), -but are more informative than integers because factors are self -describing: `"female"`, `"male"` is more descriptive than `1`, -`2`. Which one is "male"? You wouldn't be able to tell just from the -integer data. Factors, on the other hand, have this information built-in. -It is particularly helpful when there are many levels (like the -gene biotype in our example dataset). +Rの記憶では、これらの因子は整数(1, 2, 3)、 +で表現されるが、因子は自己 +を記述するため、整数よりも情報量が多い。`"女性"`、`"男性"`は`1`、 +`2`よりも説明的である。 どちらが "男性 "ですか?\ +の整数データだけではわからないだろう。 一方、ファクターはこの情報を内蔵している。 +特に、レベルが多い場合(例のデータセットの +遺伝子バイオタイプのような)に便利である。 -When your data is stored as a factor, you can use the `plot()` -function to get a quick glance at the number of observations -represented by each factor level. Let's look at the number of males -and females in our data. +データが因子として格納されているとき、各因子レベルによって表現されるオブザベーションの数 +を素早く見るために、 `plot()` +関数を使うことができます。 データ中の男性 +、女性の数を見てみよう。 ```{r firstfactorplot, fig.cap="Bar plot of the number of females and males.", purl=TRUE} -plot(sex) +プロット(性) ``` -### Converting to character +### 文字への変換 -If you need to convert a factor to a character vector, you use +因数を文字ベクトルに変換する必要がある場合は、 `as.character(x)`. ```{r, purl=TRUE} -as.character(sex) +as.character(性) ``` <!-- ### Numeric factors --> @@ -409,10 +404,10 @@ as.character(sex) <!-- vector `year_fct` inside the square brackets --> -### Renaming factors +### 要因の名称変更 -If we want to rename these factor, it is sufficient to change its -levels: +これらのファクターの名前を変えたい場合は、 +: ```{r, purl=TRUE} levels(sex) @@ -423,13 +418,13 @@ plot(sex) :::::::::::::::::::::::::::::::::::::: challenge -## Challenge: +## チャレンジだ: -- Rename "F" and "M" to "Female" and "Male" respectively. +- F "と "M "の名前をそれぞれ "Female "と "Male "に変更する。 -::::::::::::::: solution +:::::::::::::::::::: 解決策 -## Solution +## ソリューション ```{r, eval=TRUE, purl=TRUE} levels(sex) @@ -442,27 +437,27 @@ levels(sex) <- c("Male", "Female") ::::::::::::::::::::::::::::::::::::::: challenge -## Challenge: +## チャレンジだ: -We have seen how data frames are created when using `read.csv()`, but -they can also be created by hand with the `data.frame()` function. -There are a few mistakes in this hand-crafted `data.frame`. Can you -spot and fix them? Don't hesitate to experiment! +read.csv()`を使ってデータフレームを作成する方法を見てきましたが、 +`data.frame()`関数を使って手作業で作成することもできます。 +この手作りの`data.frame\`にはいくつか間違いがある。 +、それを見つけて修正することはできますか? 実験することをためらってはいけない! ```{r, eval=FALSE} animal_data <- data.frame( - animal = c(dog, cat, sea cucumber, sea urchin), + animal = c(dog, cat, sea cucumber, sea nurchin), feel = c("furry", "squishy", "spiny"), weight = c(45, 8 1.1, 0.8)) ``` -::::::::::::::: solution +:::::::::::::::::::: 解決策 -## Solution +## ソリューション -- missing quotations around the names of the animals -- missing one entry in the "feel" column (probably for one of the furry animals) -- missing one comma in the weight column +- 動物の名前の周りに引用符がない +- "feel "欄に1つ記入がない(おそらく毛皮の動物の1つ)。 +- 体重欄のコンマが1つ足りない ::::::::::::::::::::::::: @@ -470,39 +465,39 @@ animal_data <- data.frame( ::::::::::::::::::::::::::::::::::::::: challenge -## Challenge: +## チャレンジだ: -Can you predict the class for each of the columns in the following -example? +次の +の例で、各列のクラスを予測できますか? -Check your guesses using `str(country_climate)`: +str(country_climate)\`を使って推測をチェックする: -- Are they what you expected? Why? Why not? +- 期待通りですか? なぜですか? なぜだ? -- Try again by adding `stringsAsFactors = TRUE` after the last - variable when creating the data frame. What is happening now? - `stringsAsFactors` can also be set when reading text-based - spreadsheets into R using `read.csv()`. +- データフレームを作成する際に、最後の + 変数の後に `stringsAsFactors = TRUE` を追加してもう一度試してみてください。 今、何が起きているのか? + stringsAsFactors`は、`read.csv()\`を使ってテキストベースの + のスプレッドシートをRに読み込むときにも設定できる。 ```{r, eval=FALSE, purl=TRUE} country_climate <- data.frame( country = c("Canada", "Panama", "South Africa", "Australia"), - climate = c("cold", "hot", "temperate", "hot/temperate"), + climate = c("cold", "hot", "temperate", "hot/temperate")、 temperature = c(10, 30, 18, "15"), northern_hemisphere = c(TRUE, TRUE, FALSE, "FALSE"), has_kangaroo = c(FALSE, FALSE, FALSE, 1) - ) +) ``` -::::::::::::::: solution +:::::::::::::::::::: 解決策 -## Solution +## ソリューション ```{r, eval=TRUE, purl=TRUE} country_climate <- data.frame( country = c("Canada", "Panama", "South Africa", "Australia"), climate = c("cold", "hot", "temperate", "hot/temperate"), - temperature = c(10, 30, 18, "15"), + temperature = c(10、30, 18, "15"), northern_hemisphere = c(TRUE, TRUE, FALSE, "FALSE"), has_kangaroo = c(FALSE, FALSE, FALSE, 1) ) @@ -513,58 +508,58 @@ str(country_climate) :::::::::::::::::::::::::::::::::::::::::::::::::: -The automatic conversion of data type is sometimes a blessing, sometimes an -annoyance. Be aware that it exists, learn the rules, and double check that data -you import in R are of the correct type within your data frame. If not, use it -to your advantage to detect mistakes that might have been introduced during data -entry (a letter in a column that should only contain numbers for instance). +データ型の自動変換は、時に恵みであり、時に +迷惑である。 その存在を認識し、ルールを学び、Rでインポートするデータ +がデータフレーム内で正しい型であることを再確認すること。 そうでない場合は、 +データ入力中に生じたかもしれないミス(例えば、数字しか入っていないはずの列に文字が入っている)を検出するために、 +を活用する。 -Learn more in this RStudio -tutorial +詳しくはRStudio +チュートリアルをご覧ください。 -## Matrices +## マトリックス -Before proceeding, now that we have learnt about data frames, let's -recap package installation and learn about a new data type, namely the -`matrix`. Like a `data.frame`, a matrix has two dimensions, rows and -columns. But the major difference is that all cells in a `matrix` must -be of the same type: `numeric`, `character`, `logical`, ... In that -respect, matrices are closer to a `vector` than a `data.frame`. +先に進む前に、データ・フレームについて学んだので、 +パッケージのインストールを復習し、新しいデータ型、すなわち +`matrix` について学んでみよう。 data.frame`のように、行列は行と +列の2つの次元を持つ。 しかし大きな違いは、`行列`のすべてのセルは +同じ型でなければならないということである:numeric`、`character`、`logical`、... +その点で、行列は `data.frame` よりも `vector` に近い。 -The default constructor for a matrix is `matrix`. It takes a vector of -values to populate the matrix and the number of row and/or -columns[^ncol]. The values are sorted along the columns, as illustrated -below. +行列のデフォルトコンストラクタは `matrix` である。 行列を構成するための +の値のベクトルと、行および/または +の列数[^ncol]を取る。 下の図( +)のように、値は列に沿ってソートされる。 ```{r mat1, purl=TRUE} m <- matrix(1:9, ncol = 3, nrow = 3) m ``` -[^ncol]: Either the number of rows or columns are enough, as the other one can be deduced from the length of the values. Try out what happens if the values and number of rows/columns don't add up. +[^ncol]: 行数か列数のどちらかだけで十分で、もう一方は値の長さから推測できる。 値と行/列の数が合わない場合に何が起こるか試してみてください。 ::::::::::::::::::::::::::::::::::::::: challenge -## Challenge: +## チャレンジだ: -Using the function `installed.packages()`, create a `character` matrix -containing the information about all packages currently installed on -your computer. Explore it. +installed.packages()`という関数を使って、 +あなたのコンピューターに現在インストールされているすべてのパッケージの情報を含む `文字\`行列 +を作成します。 探検してみよう。 -::::::::::::::: solution +:::::::::::::::::::: 解決策 -## Solution: +## 解決策 ```{r pkg_sln, eval=FALSE, purl=TRUE} -## create the matrix +## ip <- installed.packages() head(ip) -## try also View(ip) -## number of package +## View(ip) +## パッケージの数 nrow(ip) -## names of all installed packages +## インストールされている全てのパッケージの名前 rownames(ip) -## type of information we have about each package +## 各パッケージに関する情報の種類 colnames(ip) ``` @@ -572,25 +567,24 @@ colnames(ip) :::::::::::::::::::::::::::::::::::::::::::::::::: -It is often useful to create large random data matrices as test -data. The exercise below asks you to create such a matrix with random -data drawn from a normal distribution of mean 0 and standard deviation -1, which can be done with the `rnorm()` function. +テストデータとして、大規模なランダムデータ行列を作成することはしばしば有用である。 以下の練習問題は、平均0、標準偏差 +1の正規分布から無作為に +データを抽出して、そのような行列を作成するものです。これは `rnorm()` 関数で行うことができます。 ::::::::::::::::::::::::::::::::::::::: challenge -## Challenge: +## チャレンジだ: -Construct a matrix of dimension 1000 by 3 of normally distributed data -(mean 0, standard deviation 1) +正規分布データ +(平均0、標準偏差1)の次元1000×3の行列を作る。 -::::::::::::::: solution +:::::::::::::::::::: 解決策 -## Solution +## ソリューション ```{r rnormmat_sln, purl=TRUE} set.seed(123) -m <- matrix(rnorm(3000), ncol = 3) +m <- matrix(rnorm(3000, ncol = 3) dim(m) head(m) ``` @@ -599,186 +593,182 @@ head(m) :::::::::::::::::::::::::::::::::::::::::::::::::: -## Formatting Dates +## 日付の書式設定 -One of the most common issues that new (and experienced!) R users have -is converting date and time information into a variable that is -appropriate and usable during analyses. +新人(そしてベテラン!)が抱える最も一般的な問題の1つである。 Rユーザーは、 +、日付と時刻の情報を、 +適切で分析中に使用可能な変数に変換している。 -### Note on dates in spreadsheet programs +### 表計算ソフトの日付に関する注意 -Dates in spreadsheets are generally stored in a single column. While -this seems the most natural way to record dates, it actually is not -best practice. A spreadsheet application will display the dates in a -seemingly correct way (to a human observer) but how it actually -handles and stores the dates may be problematic. It is often much -safer to store dates with YEAR, MONTH and DAY in separate columns or -as YEAR and DAY-OF-YEAR in separate columns. +スプレッドシートの日付は通常、1つの列に格納される。 +これが日付を記録する最も自然な方法のように思えるが、実際には +ベストプラクティスではない。 スプレッドシート・アプリケーションは、 +一見正しい方法で日付を表示する(人間の観察者には)。しかし、実際に +どのように日付を処理し、保存するかには問題があるかもしれない。 YEAR、MONTH、DAYを別々のカラムに、または +、YEARとDAY-OF-YEARを別々のカラムに保存した方が、 +より安全な場合が多い。 -Spreadsheet programs such as LibreOffice, Microsoft Excel, OpenOffice, -Gnumeric, ... have different (and often incompatible) ways of encoding -dates (even for the same program between versions and operating -systems). Additionally, Excel can turn things that aren't dates into -dates -(@Zeeberg:2004), for example names or identifiers like MAR1, DEC1, -OCT4. So if you're avoiding the date format overall, it's easier to -identify these issues. +LibreOffice、Microsoft Excel、OpenOffice、 +Gnumericなどの表計算プログラム。 は、 +日付のエンコード方法が異なる(そしてしばしば互換性がない)(同じプログラムであっても、バージョンやオペレーティング +システム間で)。 さらに、エクセルは日付でないものを +日付に変える +(@Zeeberg:2004)ことができる。例えば、MAR1、DEC1、 +OCT4のような名前や識別子である。 そのため、全体的に日付フォーマットを避けているのであれば、 +、こうした問題を特定しやすくなる。 -The Dates as +Data CarpentryレッスンのDates as data -section of the Data Carpentry lesson provides additional insights -about pitfalls of dates with spreadsheets. +セクションでは、スプレッドシートを使った日付の落とし穴について、さらなる洞察 +を提供しています。 -We are going to use the `ymd()` function from the package -**`lubridate`** (which belongs to the **`tidyverse`**; learn more -[here](https://www.tidyverse.org/)). . **`lubridate`** gets installed -as part of the **`tidyverse`** installation. When you load the -**`tidyverse`** (`library(tidyverse)`), the core packages (the -packages used in most data analyses) get loaded. **`lubridate`** -however does not belong to the core tidyverse, so you have to load it -explicitly with `library(lubridate)`. +**lubridate`** パッケージの `ymd()` 関数を使用します (**tidyverse`** に属します。詳しくは +[こちら](https://www.tidyverse.org/))。 . \*\*lubridate`**は**tidyverse`\*\*のインストールの一部として +。 +**`tidyverse`** (`library(tidyverse)`) をロードすると、コアパッケージ (ほとんどのデータ分析で使用される +パッケージ) がロードされます。 **`lubridate`** +しかし、コアTidyverseには属さないので、 +`library(lubridate)`で明示的にロードする必要があります。 -Start by loading the required package: +必要なパッケージをロードすることから始める: ```{r loadlibridate, message=FALSE, purl=TRUE} library("lubridate") ``` -`ymd()` takes a vector representing year, month, and day, and converts -it to a `Date` vector. `Date` is a class of data recognized by R as -being a date and can be manipulated as such. The argument that the -function requires is flexible, but, as a best practice, is a character -vector formatted as "YYYY-MM-DD". +ymd()`は年、月、日を表すベクトルを受け取り、 +`Date`ベクトルに変換する。 Date`はRが +、日付であると認識するデータのクラスであり、そのように操作することができる。 +関数が必要とする引数は柔軟であるが、ベストプラクティスとしては、"YYYY-MM-DD "としてフォーマットされた文字 +ベクトルである。 -Let's create a date object and inspect the structure: +日付オブジェクトを作成し、構造を調べてみよう: ```{r, purl=TRUE} my_date <- ymd("2015-01-01") str(my_date) ``` -Now let's paste the year, month, and day separately - we get the same result: +では、年、月、日を別々に貼り付けてみよう: ```{r, purl=TRUE} -# sep indicates the character to use to separate each component +# sep は各コンポーネントを区切るために使う文字を示す my_date <- ymd(paste("2015", "1", "1", sep = "-")) str(my_date) ``` -Let's now familiarise ourselves with a typical date manipulation -pipeline. The small data below has stored dates in different `year`, -`month` and `day` columns. +それでは、典型的な日付操作 +のパイプラインに慣れておこう。 以下の小さなデータには、異なる `year`、 +`month`、`day` 列に日付が格納されている。 ```{r, purl=TRUE} x <- data.frame(year = c(1996, 1992, 1987, 1986, 2000, 1990, 2002, 1994, 1997, 1985), - month = c(2, 3, 3, 10, 1, 8, 3, 4, 5, 5), - day = c(24, 8, 1, 5, 8, 17, 13, 10, 11, 24), - value = c(4, 5, 1, 9, 3, 8, 10, 2, 6, 7)) + month = c(2, 3, 10, 1, 8, 3, 4, 5, 5), + day = c(24, 8, 1, 5, 8, 17, 13, 10, 11, 24), + value = c(4, 5, 1, 9, 3, 8, 10, 2, 6, 7)) x ``` -Now we apply this function to the `x` dataset. We first create a -character vector from the `year`, `month`, and `day` columns of `x` -using `paste()`: +次に、この関数を `x` データセットに適用する。 まず、`paste()` を使って、`x` +の `year`、`month`、`day` 列から +の文字ベクトルを作る: ```{r, purl=TRUE} paste(x$year, x$month, x$day, sep = "-") ``` -This character vector can be used as the argument for `ymd()`: +この文字ベクトルは `ymd()` の引数として使うことができる: ```{r, purl=TRUE} ymd(paste(x$year, x$month, x$day, sep = "-")) ``` -The resulting `Date` vector can be added to `x` as a new column called `date`: +出来上がった `Date` ベクトルは `x` に `date` という新しいカラムとして追加することができる: ```{r, purl=TRUE} x$date <- ymd(paste(x$year, x$month, x$day, sep = "-")) -str(x) # notice the new column, with 'date' as the class +str(x) # '日付'をクラスとする新しいカラムに注目。 ``` -Let's make sure everything worked correctly. One way to inspect the -new column is to use `summary()`: +すべてが正しく機能していることを確認しよう。 +新しいカラムを検査する一つの方法は、`summary()`を使うことである: ```{r, purl=TRUE} summary(x$date) ``` -Note that `ymd()` expects to have the year, month and day, in that -order. If you have for instance day, month and year, you would need -`dmy()`. +ymd()`は、年、月、日を +の順番で持つことを期待している。 例えば、日、月、年があれば、 +`dmy()\` が必要になる。 ```{r, purl=TRUE} dmy(paste(x$day, x$month, x$year, sep = "-")) ``` -`lubdridate` has many functions to address all date variations. +lubdridate\`は、あらゆる日付のバリエーションに対応する多くの関数を持っている。 -## Summary of R objects +## Rオブジェクトの概要 -So far, we have seen several types of R object varying in the number -of dimensions and whether they could store a single or multiple data -types: +これまで、次元数( +)、格納できるデータの種類( +)が単一か複数かによって異なる、いくつかのタイプのRオブジェクトを見てきた: -- **`vector`**: one dimension (they have a length), single type of data. -- **`matrix`**: two dimensions, single type of data. -- **`data.frame`**: two dimensions, one type per column. +- **vector\`**:1次元(長さがある)、1種類のデータ。 +- **マトリックス\`**:2次元、単一データ型。 +- **data.frame\`**:2次元、1列1型。 -## Lists +## リスト -A data type that we haven't seen yet, but that is useful to know, and -follows from the summary that we have just seen are lists: +まだ見ていないが、知っておくと便利なデータ型がリストだ。 +、先ほどのまとめから続く: -- **`list`**: one dimension, every item can be of a different data - type. +- **`list`**: 1つの次元で、各項目は異なるデータ + 型にすることができる。 -Below, let's create a list containing a vector of numbers, characters, -a matrix, a dataframe and another list: +以下では、数値、文字、 +行列、データフレーム、別のリストのベクトルを含むリストを作ってみよう: ```{r list0, purl=TRUE} l <- list(1:10, ## numeric letters, ## character installed.packages(), ## a matrix cars, ## a data.frame - list(1, 2, 3)) ## a list + list(1, 2, 3)).## リスト length(l) str(l) ``` -List subsetting is done using `[]` to subset a new sub-list or `[[]]` -to extract a single element of that list (using indices or names, if -the list is named). +リストのサブセットは `[]` を使って新しいサブリストをサブセットするか、`[]]` +を使ってそのリストの単一要素を取り出す( +リストに名前がついている場合は、インデックスか名前を使う)。 ```{r, purl=TRUE} -l[[1]] ## first element -l[1:2] ## a list of length 2 -l[1] ## a list of length 1 +l[[1]]## +l[1:2] ## 長さ 2 のリスト +l[1] ## 長さ 1 のリスト ``` -## Exporting and saving tabular data {#sec:exportandsave} +## 表形式データのエクスポートと保存 {#sec:exportandsave} -We have seen how to read a text-based spreadsheet into R using the -`read.table` family of functions. To export a `data.frame` to a -text-based spreadsheet, we can use the `write.table` set of functions -(`write.csv`, `write.delim`, ...). They all take the variable to be -exported and the file to be exported to. For example, to export the -`rna` data to the `my_rna.csv` file in the `data_output` -directory, we would execute: +`read.table` ファミリーの関数を使って、テキストベースのスプレッドシートをRに読み込む方法を見てきた。 data.frame`を +テキストベースのスプレッドシートにエクスポートするには、 +関数の `write.table` セット(`write.csv`, `write.delim`, ...)を使用します。 これらはすべて、 +エクスポートする変数と、エクスポートするファイルを指定する。 例えば、 +`rna`のデータを`data_output`ディレクトリの`my_rna.csv\` ファイルにエクスポートするには、次のように実行する: ```{r, eval=FALSE, purl=TRUE} write.csv(rna, file = "data_output/my_rna.csv") ``` -This new csv file can now be shared with other collaborators who -aren't familiar with R. Note that even though there are commas in some of -the fields in the `data.frame` (see for example the "product" column), R will -by default surround each field with quotes, and thus we will be able to -read it back into R correctly, despite also using commas as column -separators. +この新しいcsvファイルは、 +、Rに精通していない他の共同研究者と共有することができます。`data.frame`のフィールドの一部(例えば、"product "列を参照)にカンマがあるにもかかわらず、Rはデフォルトで +、各フィールドを引用符で囲みます。したがって、 +、列の区切り文字としてカンマを使用しているにもかかわらず、 +、Rに正しく読み込むことができます。 -:::::::::::::::::::::::::::::::::::::::: keypoints +::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: キーポイント -- Tabular data in R +- Rでの表形式データ :::::::::::::::::::::::::::::::::::::::::::::::::: From 4560ad92554726ee7afaed8d2b2c08a56e80a2b8 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Mon, 13 May 2024 09:38:03 +0900 Subject: [PATCH 167/334] New translations 30-dplyr.md (Japanese) --- locale/ja/episodes/30-dplyr.Rmd | 820 ++++++++++++++++---------------- 1 file changed, 402 insertions(+), 418 deletions(-) diff --git a/locale/ja/episodes/30-dplyr.Rmd b/locale/ja/episodes/30-dplyr.Rmd index 0af50f431..acf596daa 100644 --- a/locale/ja/episodes/30-dplyr.Rmd +++ b/locale/ja/episodes/30-dplyr.Rmd @@ -1,6 +1,6 @@ --- source: Rmd -title: Manipulating and analysing data with dplyr +title: dplyrによるデータの操作と分析 teaching: 75 exercises: 75 --- @@ -10,18 +10,17 @@ exercises: 75 ::::::::::::::::::::::::::::::::::::::: 目的 -- Describe the purpose of the **`dplyr`** and **`tidyr`** packages. -- Describe several of their functions that are extremely useful to - manipulate data. -- Describe the concept of a wide and a long table format, and see - how to reshape a data frame from one format to the other one. -- Demonstrate how to join tables. +- dplyr`** と **tidyr`\*\* パッケージの目的を説明する。 +- データを操作するのに非常に便利な関数をいくつか説明する。 +- ワイド表形式とロング表形式の概念を説明し、 + 、データ・フレームを一方の形式から他方の形式に変更する方法を見る。 +- テーブルの結合方法を示す。 :::::::::::::::::::::::::::::::::::::::::::::::::: :::::::::::::::::::::::::::::::::::::::: questions -- Data analysis in R using the tidyverse meta-package +- tidyverseメタパッケージを用いたRでのデータ分析 :::::::::::::::::::::::::::::::::::::::::::::::::: @@ -31,159 +30,155 @@ download.file(url = "https://github.com/carpentries-incubator/bioc-intro/raw/mai destfile = "data/rnaseq.csv") ``` -> This episode is based on the Data Carpentries's _Data Analysis and -> Visualisation in R for Ecologists_ lesson. +> このエピソードは、Data Carpentriesの_Data Analysis and +> Visualisation in R for Ecologists_レッスンに基づいています。 -## Data manipulation using **`dplyr`** and **`tidyr`** +## dplyr`**と**tidyr`\*\*を使ったデータ操作 -Bracket subsetting is handy, but it can be cumbersome and difficult to -read, especially for complicated operations. +ブラケット・サブセットは便利だが、煩雑で +、特に複雑な操作では読みにくい。 -Some packages can greatly facilitate our task when we manipulate data. -Packages in R are basically sets of additional functions that let you -do more stuff. The functions we've been using so far, like `str()` or -`data.frame()`, come built into R; Loading packages can give you access to other -specific functions. Before you use a package for the first time you need to install -it on your machine, and then you should import it in every subsequent -R session when you need it. +いくつかのパッケージは、データを操作する際に私たちの作業を大いに助けてくれる。 +Rのパッケージは基本的に、 +、より多くのことができるようにする追加関数のセットである。 これまで使ってきた `str()` や +`data.frame()` などの関数は、Rに組み込まれています。パッケージをロードすることで、その他の +固有の関数にアクセスできるようになります。 初めてパッケージを使用する前に、 +をマシンにインストールする必要がある。その後、 +R セッションでパッケージが必要になったら、毎回インポートする必要がある。 -- The package **`dplyr`** provides powerful tools for data manipulation tasks. - It is built to work directly with data frames, with many manipulation tasks - optimised. +- dplyr\`\*\* パッケージは、データ操作タスクのための強力なツールを提供します。 + データフレームを直接操作できるように構築されており、多くの操作タスクが + に最適化されている。 -- As we will see latter on, sometimes we want a data frame to be reshaped to be able - to do some specific analyses or for visualisation. The package **`tidyr`** addresses - this common problem of reshaping data and provides tools for manipulating - data in a tidy way. +- 後述するように、 + 、特定の分析や視覚化を行うために、データフレームの形を変えたいことがある。 tidyr\`\*\*パッケージは、 + 、データの形を変えるというこの一般的な問題に対処し、 + データを整然と操作するためのツールを提供する。 -To learn more about **`dplyr`** and **`tidyr`** after the workshop, -you may want to check out this handy data transformation with +ワークショップの後、\*\*dplyr`**と**tidyr`\*\*についてもっと知りたい方は、 +、こちらのhandy data transformation with +をご覧ください。 -and this one about -. +- tidyverse`**パッケージは "umbrella-package "であり、 + 、データ解析のためのいくつかの便利なパッケージがインストールされます。 + には、**tidyr`\*\*, **dplyr`**, **ggplot2`**, \*\*tibble\`\*\*などがあります。 + これらのパッケージは、データを操作したり対話したりするのに役立ちます。 + サブセット化、変換、 + ビジュアライズなど、データを使ってさまざまなことができる。 -- The **`tidyverse`** package is an "umbrella-package" that installs - several useful packages for data analysis which work well together, - such as **`tidyr`**, **`dplyr`**, **`ggplot2`**, **`tibble`**, etc. - These packages help us to work and interact with the data. - They allow us to do many things with your data, such as subsetting, transforming, - visualising, etc. - -If you did the set up, you should have already installed the tidyverse package. -Check to see if you have it by trying to load in from the library: +セットアップを行ったのであれば、すでにtidyverseパッケージがインストールされているはずです。 +ライブラリから読み込んでみて、それがあるかどうか確認してください: ```{r, message=FALSE, purl=TRUE} -## load the tidyverse packages, incl. dplyr +## dplyr を含む tidyverse パッケージをロード library("tidyverse") ``` -If you got an error message `there is no package called ‘tidyverse’` then you have not -installed the package yet for this version of R. To install the **`tidyverse`** package type: +tidyverse\`\*\* パッケージをインストールするには、以下のようにタイプしてください: ```{r, eval=FALSE, purl=TRUE} BiocManager::install("tidyverse") ``` -If you had to install the **`tidyverse`** package, do not forget to load it in this R session by using the `library()` command above! +もし、\*\*tidyverse`**パッケージをインストールしなければならなかったなら、上記の`library()\`コマンドを使って、このRセッションでロードすることを忘れないでください! -## Loading data with tidyverse +## tidyverseでデータをロードする -Instead of `read.csv()`, we will read in our data using the `read_csv()` -function (notice the `_` instead of the `.`), from the tidyverse package -**`readr`**. +read.csv()`の代わりに、tidyverseパッケージ **readr`\*\*の `read_csv()` +関数(`.`の代わりに`_`があることに注意)を使ってデータを読み込みます。 ```{r, message=FALSE, purl=TRUE} rna <- read_csv("data/rnaseq.csv") -## view the data +## データを見る rna ``` -Notice that the class of the data is now referred to as a "tibble". +データのクラスが "tibble "と呼ばれていることに注目してほしい。 -Tibbles tweak some of the behaviors of the data frame objects we introduced in the -previously. The data structure is very similar to a data frame. For our purposes -the only differences are that: +Tibblesは、以前 +で紹介したデータ・フレーム・オブジェクトの動作の一部を微調整している。 データ構造はデータフレームによく似ている。 +我々の目的にとって、唯一の違いはそれだ: -1. It displays the data type of each column under its name. - Note that \<`dbl`\> is a data type defined to hold numeric values with - decimal points. +1. 各列のデータ型が列名の下に表示される。 + <`dbl`\> は + の小数点を持つ数値を保持するために定義されたデータ型である。 -2. It only prints the first few rows of data and only as many columns as fit on - one screen. +2. これは、データの最初の数行と、 + 1画面に収まるだけの列数だけを印刷する。 -We are now going to learn some of the most common **`dplyr`** functions: +これから、最も一般的な **dplyr\`** 関数のいくつかを学びます: -- `select()`: subset columns -- `filter()`: subset rows on conditions -- `mutate()`: create new columns by using information from other columns -- `group_by()` and `summarise()`: create summary statistics on grouped data -- `arrange()`: sort results -- `count()`: count discrete values +- select()\`: カラムのサブセット +- `filter()`: 条件で行をサブセットする。 +- mutate()\`: 他のカラムの情報を使って新しいカラムを作成する。 +- group_by()`と`summarise()\`: グループ化されたデータの要約統計量を作成する。 +- arrange()\`:結果の並べ替え +- count()\`: 離散値を数える -## Selecting columns and filtering rows +## 列の選択と行のフィルタリング -To select columns of a data frame, use `select()`. The first argument -to this function is the data frame (`rna`), and the subsequent -arguments are the columns to keep. +データフレームの列を選択するには `select()` を使う。 この関数の最初の引数 +はデータフレーム (`rna`) で、続く +の引数は保持する列です。 ```{r, purl=TRUE} select(rna, gene, sample, tissue, expression) ``` -To select all columns _except_ certain ones, put a "-" in front of -the variable to exclude it. +特定の列を除く\*すべての列を選択するには、 +その変数の前に"-"を付けて除外する。 ```{r, purl=TRUE} select(rna, -tissue, -organism) ``` -This will select all the variables in `rna` except `tissue` -and `organism`. +これは `rna` の中の、 +`tissue` と `organism` 以外のすべての変数を選択する。 -To choose rows based on a specific criteria, use `filter()`: +特定の条件に基づいて行を選択するには、`filter()` を使用する: ```{r, purl=TRUE} filter(rna, sex == "Male") filter(rna, sex == "Male" & infection == "NonInfected") ``` -Now let's imagine we are interested in the human homologs of the mouse -genes analysed in this dataset. This information can be found in the -last column of the `rna` tibble, named -`hsapiens_homolog_associated_gene_name`. To visualise it easily, we -will create a new table containing just the 2 columns `gene` and -`hsapiens_homolog_associated_gene_name`. +ここで、このデータセットで解析されたマウス +遺伝子のヒトホモログに興味があるとしよう。 この情報は、 +`hsapiens_homolog_associated_gene_name` という名前の `rna` tibbleの +最後のカラムにある。 簡単に視覚化するために、 +、2つの列`gene`と +`hsapiens_homolog_associated_gene_name`だけを含む新しいテーブルを作成する。 ```{r} genes <- select(rna, gene, hsapiens_homolog_associated_gene_name) genes ``` -Some mouse genes have no human homologs. These can be retrieved using -`filter()` and the `is.na()` function, that determines whether -something is an `NA`. +マウス遺伝子の中にはヒトにホモログがないものもある。 これらは、 +`filter()` と、 +何かが `NA` かどうかを判定する `is.na()` 関数を使って取得することができる。 ```{r, purl=TRUE} filter(genes, is.na(hsapiens_homolog_associated_gene_name)) ``` -If we want to keep only mouse genes that have a human homolog, we can -insert a "!" symbol that negates the result, so we're asking for -every row where hsapiens\_homolog\_associated\_gene\_name _is not_ an -`NA`. +ヒトのホモログを持つマウス遺伝子だけを保持したい場合、 +、結果を否定する"!"記号を挿入することができる。したがって、 +、hsapiens_homolog_associated_gene_name _is not_ an +`NA` となるすべての行を求めることになる。 ```{r, purl=TRUE} filter(genes, !is.na(hsapiens_homolog_associated_gene_name)) ``` -## Pipes +## パイプ -What if you want to select and filter at the same time? There are three -ways to do this: use intermediate steps, nested functions, or pipes. +選択とフィルタを同時に行いたい場合は? これを行うには、 +、中間ステップ、ネストされた関数、パイプの3つの方法がある。 -With intermediate steps, you create a temporary data frame and use -that as input to the next function, like this: +中間ステップでは、一時的なデータフレームを作成し、 +、次の関数の入力として使用する: ```{r, purl=TRUE} rna2 <- filter(rna, sex == "Male") @@ -191,40 +186,39 @@ rna3 <- select(rna2, gene, sample, tissue, expression) rna3 ``` -This is readable, but can clutter up your workspace with lots of -intermediate objects that you have to name individually. With multiple -steps, that can be hard to keep track of. +これは読みやすいが、 +、個別に名前を付けなければならない中間オブジェクトがたくさんあるため、ワークスペースが散らかる可能性がある。 複数の +、それを把握するのは難しいかもしれない。 -You can also nest functions (i.e. one function inside of another), -like this: +、関数を入れ子にすることもできる: ```{r, purl=TRUE} -rna3 <- select(filter(rna, sex == "Male"), gene, sample, tissue, expression) +rna3 <- select(filter(rna, sex == "Male", gene, sample, tissue, expression)) rna3 ``` -This is handy, but can be difficult to read if too many functions are nested, as -R evaluates the expression from the inside out (in this case, filtering, then selecting). +これは便利だが、 +Rは式を内側から外側へと評価する(この場合、フィルタリングしてから選択する)ため、関数が入れ子になりすぎると読みにくくなることがある。 -The last option, _pipes_, are a recent addition to R. Pipes let you take -the output of one function and send it directly to the next, which is useful -when you need to do many things to the same dataset. +最後のオプションである_パイプ_は、Rに最近追加されたものである。パイプを使うと、ある関数の出力を +、次の関数に直接送ることができる。これは、同じデータセットに対して多くの処理を行う必要がある場合に便利である +。 -Pipes in R look like `%>%` (made available via the **`magrittr`** -package) or `|>` (through base R). If you use RStudio, you can type -the pipe with <kbd>Ctrl</kbd> + <kbd>Shift</kbd> + <kbd>M</kbd> if you -have a PC or <kbd>Cmd</kbd> + <kbd>Shift</kbd> + <kbd>M</kbd> if you -have a Mac. +R のパイプは `%>%` (**`magrittr`** +パッケージで利用可能) または `|>` (ベース R で利用可能) のように見えます。 RStudioを使用する場合は、 +PCをお持ちの場合は<kbd>Ctrl</kbd>+<kbd>Shift</kbd>+<kbd>M</kbd>、 +Macをお持ちの場合は<kbd>Cmd</kbd>+<kbd>Shift</kbd>+<kbd>Mで</kbd>パイプを +。 -In the above code, we use the pipe to send the `rna` dataset first -through `filter()` to keep rows where `sex` is Male, then through -`select()` to keep only the `gene`, `sample`, `tissue`, and -`expression`columns. +上記のコードでは、パイプを使って `rna` データセットをまず +`filter()` を通して `sex` が Male である行を残し、次に +`select()` を通して `gene`, `sample`, `tissue`, +`expression`columns だけを残すように送っている。 -The pipe `%>%` takes the object on its left and passes it directly as -the first argument to the function on its right, we don't need to -explicitly include the data frame as an argument to the `filter()` and -`select()` functions any more. +パイプ `%>%` はその左側にあるオブジェクトを受け取り、 +その右側にある関数の最初の引数として直接渡します。 +`filter()` と +`select()` 関数の引数として明示的にデータフレームを含める必要はもうありません。 ```{r, purl=TRUE} rna %>% @@ -232,20 +226,20 @@ rna %>% select(gene, sample, tissue, expression) ``` -Some may find it helpful to read the pipe like the word "then". For instance, -in the above example, we took the data frame `rna`, _then_ we `filter`ed -for rows with `sex == "Male"`, _then_ we `select`ed columns `gene`, `sample`, -`tissue`, and `expression`. +パイプを "then "のように読むことが役に立つと思う人もいるだろう。 例えば、 +上の例では、データフレーム `rna` を取得し、`sex=="Male"` の行を +で `フィルター`し、`gene`, `sample`, +`tissue`, `expression` の列を `選択` した。 -The **`dplyr`** functions by themselves are somewhat simple, but by -combining them into linear workflows with the pipe, we can accomplish -more complex manipulations of data frames. +dplyr\`\*\*関数はそれ自体ではやや単純だが、 +、パイプを使った線形ワークフローに組み合わせることで、 +、データフレームのより複雑な操作を行うことができる。 -If we want to create a new object with this smaller version of the data, we -can assign it a new name: +この小さいバージョンのデータで新しいオブジェクトを作りたい場合、 +、新しい名前を割り当てることができる: ```{r, purl=TRUE} -rna3 <- rna %>% +rna3<- rna %>% filter(sex == "Male") %>% select(gene, sample, tissue, expression) @@ -254,15 +248,15 @@ rna3 ::::::::::::::::::::::::::::::::::::::: challenge -## Challenge: +## チャレンジだ: -Using pipes, subset the `rna` data to keep observations in female mice at time 0, -where the gene has an expression higher than 50000, and retain only the columns -`gene`, `sample`, `time`, `expression` and `age`. +パイプを使用して、時間0、 +、遺伝子の発現が50000より高い雌マウスのオブザベーションを保持するように`rna`データをサブセットし、 +`gene`、`sample`、`time`、`expression`、`age`の列のみを保持する。 -::::::::::::::: solution +:::::::::::::::::::: 解決策 -## Solution +## ソリューション ```{r} rna %>% @@ -276,13 +270,13 @@ rna %>% :::::::::::::::::::::::::::::::::::::::::::::::::: -## Mutate +## ミューテート -Frequently you'll want to create new columns based on the values of existing -columns, for example to do unit conversions, or to find the ratio of values in two -columns. For this we'll use `mutate()`. +例えば、単位変換をしたり、2つの +列の値の比率を求めたりするために、既存の +列の値に基づいて新しい列を作成したいことがよくあります。 これには `mutate()` を使う。 -To create a new column of time in hours: +時間単位の新しい列を作成する: ```{r, purl=TRUE} rna %>% @@ -290,7 +284,7 @@ rna %>% select(time, time_hours) ``` -You can also create a second new column based on the first new column within the same call of `mutate()`: +また、`mutate()`の同じ呼び出しの中で、最初の新しいカラムに基づいて2番目の新しいカラムを作成することもできる: ```{r, purl=TRUE} rna %>% @@ -301,21 +295,21 @@ rna %>% ::::::::::::::::::::::::::::::::::::::: challenge -## Challenge +## チャレンジ -Create a new data frame from the `rna` data that meets the following -criteria: contains only the `gene`, `chromosome_name`, -`phenotype_description`, `sample`, and `expression` columns. The expression -values should be log-transformed. This data frame must -only contain genes located on sex chromosomes, associated with a -phenotype\_description, and with a log expression higher than 5. +以下の +条件を満たす `rna` データから新しいデータフレームを作成する: `gene`、`chromosome_name`、 +`phenotype_description`、`sample`、`expression` 列のみを含む。 +の値は対数変換する。 このデータフレームは、 +、性染色体に位置し、 +phenotype_descriptionに関連し、log expressionが5より高い遺伝子のみを含んでいなければならない。 -**Hint**: think about how the commands should be ordered to produce -this data frame! +**ヒント**:このデータフレームを +、どのようにコマンドを並べるべきか考えてみよう! -::::::::::::::: solution +:::::::::::::::::::: 解決策 -## Solution +## ソリューション ```{r, eval=TRUE, purl=TRUE} rna %>% @@ -330,60 +324,59 @@ rna %>% :::::::::::::::::::::::::::::::::::::::::::::::::: -## Split-apply-combine data analysis +## 分割-適用-結合データ分析 -Many data analysis tasks can be approached using the -_split-apply-combine_ paradigm: split the data into groups, apply some -analysis to each group, and then combine the results. **`dplyr`** -makes this very easy through the use of the `group_by()` function. +多くのデータ分析タスクは、 +_split-apply-combine_パラダイムを使ってアプローチすることができる:データをグループに分割し、各グループにいくつかの +分析を適用し、その結果を組み合わせる。 \*\*dplyr`** +は `group_by()\` 関数を使って、これを非常に簡単にしている。 ```{r} rna %>% - group_by(gene) + group_by(遺伝子) ``` -The `group_by()` function doesn't perform any data processing, it -groups the data into subsets: in the example above, our initial -`tibble` of `r nrow(rna)` observations is split into -`r length(unique(rna$gene))` groups based on the `gene` variable. +group_by()`関数はデータ処理を行わず、 +データをサブセットにグループ化する。上の例では、 +`r nrow(rna)`オブザベーションの最初の`tibble`は、`r length(unique(rna$gene))`グループに`gene\` 変数に基づいて分割される。 -We could similarly decide to group the tibble by the samples: +同様に、ティブルをサンプルごとにグループ分けすることもできる: ```{r} rna %>% group_by(sample) ``` -Here our initial `tibble` of `r nrow(rna)` observations is split into -`r length(unique(rna$sample))` groups based on the `sample` variable. +ここで、最初の `r nrow(rna)` オブザベーションの `tibble` は、`sample` 変数に基づいて、 +`r length(unique(rna$sample))` グループに分割される。 -Once the data has been grouped, subsequent operations will be -applied on each group independently. +いったんデータがグループ化されると、その後の操作は各グループに独立して +。 -### The `summarise()` function +### summarise()\`関数 -`group_by()` is often used together with `summarise()`, which -collapses each group into a single-row summary of that group. +group_by()`は`summarise()\` と一緒に使われることが多く、 +は各グループを1行の要約に折りたたむ。 -`group_by()` takes as arguments the column names that contain the -**categorical** variables for which you want to calculate the summary -statistics. So to compute the mean `expression` by gene: +group_by()\` は、 +**カテゴリー** 変数を含むカラム名を引数として取り、 +統計のサマリーを計算します。 そこで、遺伝子ごとの平均「発現」を計算する: ```{r} rna %>% - group_by(gene) %>% + group_by(gene %>% summarise(mean_expression = mean(expression)) ``` -We could also want to calculate the mean expression levels of all genes in each sample: +また、各サンプルの全遺伝子の平均発現量を計算することもできる: ```{r} rna %>% - group_by(sample) %>% + group_by(sample %>% summarise(mean_expression = mean(expression)) ``` -But we can can also group by multiple columns: +しかし、複数の列でグループ化することもできる: ```{r} rna %>% @@ -391,26 +384,26 @@ rna %>% summarise(mean_expression = mean(expression)) ``` -Once the data is grouped, you can also summarise multiple variables at the same -time (and not necessarily on the same variable). For instance, we could add a -column indicating the median `expression` by gene and by condition: +いったんデータがグループ化されると、同じ +(必ずしも同じ変数でなくてもよい)時間に複数の変数を要約することもできる。 例えば、遺伝子別、条件別の「発現」の中央値を示す +列を追加することができる: ```{r, purl=TRUE} rna %>% - group_by(gene, infection, time) %>% - summarise(mean_expression = mean(expression), + group_by(遺伝子, 感染, 時間) %>% + summary(mean_expression = mean(expression), median_expression = median(expression)) ``` ::::::::::::::::::::::::::::::::::::::: challenge -## Challenge +## チャレンジ -Calculate the mean expression level of gene "Dok3" by timepoints. +遺伝子 "Dok3 "のタイムポイントごとの平均発現量を計算する。 -::::::::::::::: solution +:::::::::::::::::::: 解決策 -## Solution +## ソリューション ```{r, purl=TRUE} rna %>% @@ -423,47 +416,47 @@ rna %>% :::::::::::::::::::::::::::::::::::::::::::::::::: -### Counting +### カウント -When working with data, we often want to know the number of observations found -for each factor or combination of factors. For this task, **`dplyr`** provides -`count()`. For example, if we wanted to count the number of rows of data for -each infected and non-infected samples, we would do: +データで作業するとき、我々はしばしば、各因子または因子の組み合わせについて +見つかったオブザベーションの数を知りたい。 このタスクのために、\*\*dplyr`** は +`count()\` を提供している。 例えば、感染したサンプルと感染していないサンプルそれぞれについて、 +、データの行数をカウントしたい場合、次のようにする: ```{r, purl=TRUE} rna %>% - count(infection) + count(感染) ``` -The `count()` function is shorthand for something we've already seen: grouping by a variable, and summarising it by counting the number of observations in that group. In other words, `rna %>% count(infection)` is equivalent to: +count()`関数は、すでに見たことのある、変数でグループ化し、そのグループ内のオブザベーションの数をカウントして要約する、ということの省略記法です。 言い換えれば、`rna %>% count(infection)\`は次のものと等価である: ```{r, purl=TRUE} rna %>% - group_by(infection) %>% + group_by(感染) %>% summarise(n = n()) ``` -The previous example shows the use of `count()` to count the number of rows/observations -for _one_ factor (i.e., `infection`). -If we wanted to count a _combination of factors_, such as `infection` and `time`, -we would specify the first and the second factor as the arguments of `count()`: +先ほどの例では、`count()` を使って、_1つの_要因(つまり`感染`)について +、行数/観察数を数えている。 +もし、`感染`と`時間`のような_要因の組み合わせ_をカウントしたいのであれば、 +、`count()`の引数として1つ目と2つ目の要因を指定することになる: ```{r, purl=TRUE} rna %>% - count(infection, time) + count(感染、時間) ``` -which is equivalent to this: +これと等価である: ```{r, purl=TRUE} rna %>% - group_by(infection, time) %>% + group_by(感染、時間) %>% summarise(n = n()) ``` -It is sometimes useful to sort the result to facilitate the comparisons. -We can use `arrange()` to sort the table. -For instance, we might want to arrange the table above by time: +比較を容易にするために、結果を並べ替えると便利なことがある。 +arrange()\`を使って表を並べ替えることができる。 +例えば、上の表を時間順に並べたいとする: ```{r, purl=TRUE} rna %>% @@ -471,7 +464,7 @@ rna %>% arrange(time) ``` -or by counts: +あるいは回数で: ```{r, purl=TRUE} rna %>% @@ -479,7 +472,7 @@ rna %>% arrange(n) ``` -To sort in descending order, we need to add the `desc()` function: +降順にソートするには、`desc()`関数を追加する必要がある: ```{r, purl=TRUE} rna %>% @@ -489,25 +482,25 @@ rna %>% ::::::::::::::::::::::::::::::::::::::: challenge -## Challenge +## チャレンジ -1. How many genes were analysed in each sample? -2. Use `group_by()` and `summarise()` to evaluate the sequencing depth (the sum of all counts) in each sample. Which sample has the highest sequencing depth? -3. Pick one sample and evaluate the number of genes by biotype. -4. Identify genes associated with the "abnormal DNA methylation" phenotype description, and calculate their mean expression (in log) at time 0, time 4 and time 8. +1. 各サンプルで分析された遺伝子の数は? +2. group_by()`と `summarise()\`を使用して、各サンプルのシーケンス深度(全カウントの合計)を評価する。 シーケンス深度が最も深いサンプルはどれですか? +3. サンプルを1つ選び、バイオタイプ別に遺伝子数を評価する。 +4. DNAメチル化異常」という表現型に関連する遺伝子を特定し、時間0、時間4、時間8における平均発現量(対数)を計算する。 -::::::::::::::: solution +:::::::::::::::::::: 解決策 -## Solution +## ソリューション ```{r} -## 1. +## rna %>% count(sample) ## 2. rna %>% group_by(sample) %>% - summarise(seq_depth = sum(expression)) %>% + summary(seq_depth = sum(expression)) %>% arrange(desc(seq_depth)) ## 3. rna %>% @@ -518,7 +511,7 @@ rna %>% rna %>% filter(phenotype_description == "abnormal DNA methylation") %>% group_by(gene, time) %>% - summarise(mean_expression = mean(log(expression))) %>% + summary(mean_expression = mean(log(expression)) %>% arrange() ``` @@ -526,31 +519,30 @@ rna %>% :::::::::::::::::::::::::::::::::::::::::::::::::: -## Reshaping data +## データの再構築 -In the `rna` tibble, the rows contain expression values (the unit) that are -associated with a combination of 2 other variables: `gene` and `sample`. +rna`tibble の行には、`gene`と`sample\` という2つの変数の組み合わせに関連付けられた発現値(単位)が格納されている。 -All the other columns correspond to variables describing either -the sample (organism, age, sex, ...) or the gene (gene\_biotype, ENTREZ\_ID, product, ...). -The variables that don't change with genes or with samples will have the same value in all the rows. +その他の列はすべて、 +(生物、年齢、性別、...)のいずれかを記述する変数に対応している。 または遺伝子(gene_biotype, ENTREZ_ID, product, ...)。 +遺伝子やサンプルによって変化しない変数は、すべての行で同じ値を持つ。 ```{r} rna %>% arrange(gene) ``` -This structure is called a `long-format`, as one column contains all the values, -and other column(s) list(s) the context of the value. +この構造は`long-format`と呼ばれ、1つのカラムにはすべての値、 +、もう1つのカラムには値のコンテキストが列挙されている。 -In certain cases, the `long-format` is not really "human-readable", and another format, -a `wide-format` is preferred, as a more compact way of representing the data. -This is typically the case with gene expression values that scientists are used to -look as matrices, were rows represent genes and columns represent samples. +場合によっては、`long-format`は実際には "human-readable "ではなく、別のフォーマット、 +`wide-format`がよりコンパクトにデータを表現する方法として好まれる。 +これは通常、科学者が +、行が遺伝子、列がサンプルを表す行列として見るのに慣れている遺伝子発現値の場合である。 -In this format, it would therefore become straightforward -to explore the relationship between the gene expression levels within, and -between, the samples. +このフォーマットでは、 +、サンプル内の遺伝子発現レベルとサンプル間の遺伝子発現レベル +の関係を調べることができる。 ```{r, echo=FALSE} rna %>% @@ -559,73 +551,72 @@ rna %>% values_from = expression) ``` -To convert the gene expression values from `rna` into a wide-format, -we need to create a new table where the values of the `sample` column would -become the names of column variables. +rna`の遺伝子発現値をワイドフォーマットに変換するには、 +、`sample\`カラムの値が +、カラム変数の名前になる新しいテーブルを作成する必要がある。 -The key point here is that we are still following -a tidy data structure, but we have **reshaped** the data according to -the observations of interest: expression levels per gene instead -of recording them per gene and per sample. +ここでの重要なポイントは、我々はまだ +、整然としたデータ構造に従っているが、 +、興味のある観察に従ってデータを**整形**したということである:遺伝子ごと、サンプルごとに記録する代わりに、遺伝子ごとの発現レベル +。 -The opposite transformation would be to transform column names into -values of a new variable. +逆の変換は、列名を新しい変数の値( +)に変換することである。 -We can do both these of transformations with two `tidyr` functions, -`pivot_longer()` and `pivot_wider()` (see -[here](https://tidyr.tidyverse.org/dev/articles/pivot.html) for -details). +`pivot_longer()` と `pivot_wider()` の2つの `tidyr` 関数を使って、これらの変換を行うことができます( +[こちら](https://tidyr.tidyverse.org/dev/articles/pivot.html) を参照してください)( +)。 -### Pivoting the data into a wider format +### より広いフォーマットへのデータのピボット -Let's select the first 3 columns of `rna` and use `pivot_wider()` -to transform the data into a wide-format. +rna`の最初の3列を選択し、`pivot_wider()\` +を使ってデータをワイドフォーマットに変換してみよう。 ```{r, purl=TRUE} -rna_exp <- rna %>% +rna_exp<- rna %>% select(gene, sample, expression) rna_exp ``` -`pivot_wider` takes three main arguments: +pivot_wider\`は主に3つの引数を取る: -1. the data to be transformed; -2. the `names_from` : the column whose values will become new column - names; -3. the `values_from`: the column whose values will fill the new - columns. +1. 変換されるデータ; +2. the `names_from` : その値が新しいカラム + の名前になるカラム; +3. value_from\`: 新しいカラム + を埋める値。 -\`\`\`{r, fig.cap="Wide pivot of the `rna` data.", echo=FALSE, message=FALSE} +\`\`{r, fig.cap="`rna`データのワイドピボット。", echo=FALSE, message=FALSE} knitr::include_graphics("fig/pivot_wider.png") -```` +``` -```{r, purl=TRUE} -rna_wide <- rna_exp %>% +``{r, purl=TRUE}. +rna_wide<- rna_exp %>% pivot_wider(names_from = sample, values_from = expression) rna_wide -```` +``` -Note that by default, the `pivot_wider()` function will add `NA` for missing values. +デフォルトでは、`pivot_wider()` 関数は欠損値に対して `NA` を追加することに注意してください。 -Let's imagine that for some reason, we had some missing expression values for some -genes in certain samples. In the following fictive example, the gene Cyp2d22 has only -one expression value, in GSM2545338 sample. +何らかの理由で、あるサンプルで +、いくつかの遺伝子の発現値が欠落していたとしよう。 以下の架空の例では、遺伝子Cyp2d22の発現値はGSM2545338サンプルの +。 ```{r, purl=TRUE} -rna_with_missing_values <- rna %>% +rna_with_missing_values<- rna %>% select(gene, sample, expression) %>% filter(gene %in% c("Asl", "Apod", "Cyp2d22")) %>% filter(sample %in% c("GSM2545336", "GSM2545337", "GSM2545338")) %>% arrange(sample) %>% - filter(!(gene == "Cyp2d22" & sample != "GSM2545338")) + filter(!(gene == "Cyp2d22" & sample != "GSM2545338")) rna_with_missing_values ``` -By default, the `pivot_wider()` function will add `NA` for missing -values. This can be parameterised with the `values_fill` argument of -the `pivot_wider()` function. +デフォルトでは、`pivot_wider()`関数は、 +の値が見つからない場合に `NA` を追加する。 これは、 +`pivot_wider()` 関数の `values_fill` 引数でパラメータ化できる。 ```{r, purl=TRUE} rna_with_missing_values %>% @@ -638,70 +629,70 @@ rna_with_missing_values %>% values_fill = 0) ``` -### Pivoting data into a longer format +### データを長いフォーマットにピボットする -In the opposite situation we are using the column names and turning them into -a pair of new variables. One variable represents the column names as -values, and the other variable contains the values previously -associated with the column names. +逆の状況では、カラム名を使い、 +、新しい変数のペアに変えている。 一方の変数はカラム名を +の値で表し、もう一方の変数にはカラム名に関連付けられている以前の値 +が格納されている。 -`pivot_longer()` takes four main arguments: +pivot_longer()\`は主に4つの引数を取る: -1. the data to be transformed; -2. the `names_to`: the new column name we wish to create and populate with the - current column names; -3. the `values_to`: the new column name we wish to create and populate with - current values; -4. the names of the columns to be used to populate the `names_to` and - `values_to` variables (or to drop). +1. 変換されるデータ; +2. names_to\`: + の現在のカラム名で作成したい新しいカラム名; +3. value_to\`: 作成したい新しいカラム名で、 + の現在の値を格納する; +4. 変数 `names_to` と + `values_to` に格納する(または削除する)列の名前。 -\`\`\`{r, fig.cap="Long pivot of the `rna` data.", echo=FALSE, message=FALSE} +\`\`{r, fig.cap="`rna`データのロングピボット。", echo=FALSE, message=FALSE} knitr::include_graphics("fig/pivot_longer.png") -```` +``` -To recreate `rna_long` from `rna_wide` we would create a key -called `sample` and value called `expression` and use all columns -except `gene` for the key variable. Here we drop `gene` column -with a minus sign. +rna_wide`から`rna_long`を再作成するには、 +`sample`というキーと`expression`という値を作成し、`gene`以外のすべてのカラム +。ここでは、`gene`カラム +をマイナス記号で削除する。 -Notice how the new variable names are to be quoted here. +ここで、新しい変数名がどのように引用されるかに注目してください。 -```{r} -rna_long <- rna_wide %>% +{r} +rna_long<- rna_wide %>% pivot_longer(names_to = "sample", values_to = "expression", -gene) rna_long -```` +``` -We could also have used a specification for what columns to -include. This can be useful if you have a large number of identifying -columns, and it's easier to specify what to gather than what to leave -alone. Here the `starts_with()` function can help to retrieve sample -names without having to list them all! -Another possibility would be to use the `:` operator! +また、 +、どのようなカラムを含めるかという指定も使えたはずだ。 これは、 +のカラムが多数あり、 +のままにしておくよりも、何を集めるかを指定する方が簡単な場合に便利である。 ここで、`starts_with()`関数を使えば、 +のサンプル名をすべてリストアップすることなく取得することができる! +もう一つの可能性は `:` 演算子を使うことである! ```{r} rna_wide %>% pivot_longer(names_to = "sample", values_to = "expression", - cols = starts_with("GSM")) + cols = starts_with("GSM") rna_wide %>% pivot_longer(names_to = "sample", values_to = "expression", GSM2545336:GSM2545380) ``` -Note that if we had missing values in the wide-format, the `NA` would be -included in the new long format. +ワイドフォーマットで欠損値があった場合、新しいロングフォーマットでは`NA`が +。 -Remember our previous fictive tibble containing missing values: +前回の欠損値を含む架空のティブルを思い出してほしい: ```{r} rna_with_missing_values -wide_with_NA <- rna_with_missing_values %>% +wide_with_NA<- rna_with_missing_values %>% pivot_wider(names_from = sample, values_from = expression) wide_with_NA @@ -712,23 +703,23 @@ wide_with_NA %>% -gene) ``` -Pivoting to wider and longer formats can be a useful way to balance out a dataset -so every replicate has the same composition. +より幅の広い、より長いフォーマットへの移行は、データセットのバランスをとるのに有効な方法である。 +、どの複製も同じ構成になる。 ::::::::::::::::::::::::::::::::::::::: challenge -## Question +## 質問 -Starting from the rna table, use the `pivot_wider()` function to create -a wide-format table giving the gene expression levels in each mouse. -Then use the `pivot_longer()` function to restore a long-format table. +rnaテーブルから始めて、`pivot_wider()`関数を使用して、 +、各マウスの遺伝子発現レベルを示すワイドフォーマットのテーブルを作成する。 +そして、`pivot_longer()`関数を使って、ロングフォーマットの表を復元する。 -::::::::::::::: solution +:::::::::::::::::::: 解決策 -## Solution +## ソリューション ```{r, answer=TRUE, purl=TRUE} -rna1 <- rna %>% +rna1<- rna %>% select(gene, mouse, expression) %>% pivot_wider(names_from = mouse, values_from = expression) rna1 @@ -743,48 +734,46 @@ pivot_longer(names_to = "mouse_id", values_to = "counts", -gene) ::::::::::::::::::::::::::::::::::::::: challenge -## Question +## 質問 -Subset genes located on X and Y chromosomes from the `rna` data frame and -spread the data frame with `sex` as columns, `chromosome_name` as -rows, and the mean expression of genes located in each chromosome as the values, -as in the following tibble: +rna`データフレームから X 染色体と Y 染色体に位置する遺伝子をサブセットし、`sex` を列、`chromosome_name\` を +行、各染色体に位置する遺伝子の平均発現量を値として、 +以下のようにデータフレームを広げる: ```{r, echo=FALSE, message=FALSE} knitr::include_graphics("fig/Exercise_pivot_W.png") ``` -You will need to summarise before reshaping! +整形する前にまとめる必要がある! -::::::::::::::: solution +:::::::::::::::::::: 解決策 -## Solution +## ソリューション -Let's first calculate the mean expression level of X and Y linked genes from -male and female samples... +まず、 +男性と女性のサンプルから、XとYの連鎖遺伝子の平均発現量を計算してみよう... ```{r} rna %>% filter(chromosome_name == "Y" | chromosome_name == "X") %>% group_by(sex, chromosome_name) %>% - summarise(mean = mean(expression)) + summise(mean = mean(expression)) ``` -And pivot the table to wide format +そして、表をワイドフォーマットにピボットする ```{r, answer=TRUE, purl=TRUE} -rna_1 <- rna %>% +rna_1<- rna %>% filter(chromosome_name == "Y" | chromosome_name == "X") %>% group_by(sex, chromosome_name) %>% - summarise(mean = mean(expression)) %>% + summise(mean = mean(expression)) %>% pivot_wider(names_from = sex, values_from = mean) rna_1 ``` -Now take that data frame and transform it with `pivot_longer()` so -each row is a unique `chromosome_name` by `gender` combination. +各行が一意な `chromosome_name` と `gender` の組み合わせになるように、このデータフレームを `pivot_longer()` で変換する。 ```{r, answer=TRUE, purl=TRUE} rna_1 %>% @@ -800,17 +789,17 @@ rna_1 %>% ::::::::::::::::::::::::::::::::::::::: challenge -## Question +## 質問 -Use the `rna` dataset to create an expression matrix where each row -represents the mean expression levels of genes and columns represent -the different timepoints. +rna\`データセットを使って、 +各行が遺伝子の平均発現量を表し、 +各列が異なるタイムポイントを表す発現行列を作成する。 -::::::::::::::: solution +:::::::::::::::::::: 解決策 -## Solution +## ソリューション -Let's first calculate the mean expression by gene and by time +まず、遺伝子別、時間別の平均発現量を計算してみよう。 ```{r} rna %>% @@ -818,10 +807,10 @@ rna %>% summarise(mean_exp = mean(expression)) ``` -before using the pivot\_wider() function +pivot_wider()関数を使用する前に ```{r} -rna_time <- rna %>% +rna_time<- rna %>% group_by(gene, time) %>% summarise(mean_exp = mean(expression)) %>% pivot_wider(names_from = time, @@ -829,20 +818,20 @@ rna_time <- rna %>% rna_time ``` -Notice that this generates a tibble with some column names starting by a number. -If we wanted to select the column corresponding to the timepoints, -we could not use the column names directly... What happens when we select the column 4? +これにより、数字で始まるカラム名を持つティブルが生成されることに注意。 +タイムポイントに対応するカラムを選択したい場合、 +、カラム名を直接使うことはできない。 列4を選択するとどうなるか? ```{r} rna %>% group_by(gene, time) %>% - summarise(mean_exp = mean(expression)) %>% + summary(mean_exp = mean(expression)) %>% pivot_wider(names_from = time, values_from = mean_exp) %>% select(gene, 4) ``` -To select the timepoint 4, we would have to quote the column name, with backticks "\\`" +タイムポイント4を選択するには、"˶\`" というバックスティックを付けたカラム名を引用しなければならない。 ```{r} rna %>% @@ -853,13 +842,12 @@ rna %>% select(gene, `4`) ``` -Another possibility would be to rename the column, -choosing a name that doesn't start by a number : +、数字で始まらない名前を選択する: ```{r} rna %>% group_by(gene, time) %>% - summarise(mean_exp = mean(expression)) %>% + summary(mean_exp = mean(expression)) %>% pivot_wider(names_from = time, values_from = mean_exp) %>% rename("time0" = `0`, "time4" = `4`, "time8" = `8`) %>% @@ -872,31 +860,31 @@ rna %>% ::::::::::::::::::::::::::::::::::::::: challenge -## Question +## 質問 -Use the previous data frame containing mean expression levels per timepoint and create -a new column containing fold-changes between timepoint 8 and timepoint 0, and fold-changes -between timepoint 8 and timepoint 4. -Convert this table into a long-format table gathering the fold-changes calculated. +タイムポイントごとの平均発現レベルを含む前のデータフレームを使用し、 +、タイムポイント8とタイムポイント0の間のfold-changes、およびタイムポイント8とタイムポイント4の間のfold-changes +を含む新しい列を作成する。 +この表を、計算されたフォールド・チェンジを集めたロングフォーマットの表に変換する。 -::::::::::::::: solution +:::::::::::::::::::: 解決策 -## Solution +## ソリューション -Starting from the rna\_time tibble: +rna_time tibbleから開始する: ```{r} rna_time ``` -Calculate fold-changes: +フォールドチェンジを計算する: ```{r} rna_time %>% mutate(time_8_vs_0 = `8` / `0`, time_8_vs_4 = `8` / `4`) ``` -And use the pivot\_longer() function: +そして、pivot_longer()関数を使用する: ```{r} rna_time %>% @@ -910,41 +898,40 @@ rna_time %>% :::::::::::::::::::::::::::::::::::::::::::::::::: -## Joining tables +## テーブルの結合 -In many real life situations, data are spread across multiple tables. -Usually this occurs because different types of information are -collected from different sources. +実生活の多くの場面で、データは複数のテーブルにまたがっている。 +通常このようなことが起こるのは、異なる情報源から異なるタイプの情報が +収集されるからである。 -It may be desirable for some analyses to combine data from two or more -tables into a single data frame based on a column that would be common -to all the tables. +分析によっては、2つ以上のテーブル( +)のデータを、すべてのテーブルに共通するカラム( +)に基づいて1つのデータフレームにまとめることが望ましい場合がある。 -The `dplyr` package provides a set of join functions for combining two -data frames based on matches within specified columns. Here, we -provide a short introduction to joins. For further reading, please -refer to the chapter about table -joins. The -Data Transformation Cheat -Sheet -also provides a short overview on table joins. +dplyr\` パッケージは、指定されたカラム内のマッチに基づいて、2つの +データフレームを結合するための結合関数のセットを提供する。 ここでは、 +、結合について簡単に紹介する。 詳しくは、 +テーブル +ジョインの章を参照されたい。 +データ変換チート +シート +、テーブル結合に関する簡単な概要も提供している。 -We are going to illustrate join using a small table, `rna_mini` that -we will create by subsetting the original `rna` table, keeping only 3 -columns and 10 lines. +、元の`rna`テーブルをサブセットして作成し、 +、3つのカラムと10行だけを残す。 ```{r} -rna_mini <- rna %>% +rna_mini<- rna %>% select(gene, sample, expression) %>% head(10) rna_mini ``` -The second table, `annot1`, contains 2 columns, gene and -gene\_description. You can either +2番目のテーブル`annot1`には、遺伝子と +gene_descriptionの2つのカラムがある。 [download annot1.csv](https://carpentries-incubator.github.io/bioc-intro/data/annot1.csv) -by clicking on the link and then moving it to the `data/` folder, or -you can use the R code below to download it directly to the folder. +リンクをクリックして`data/`フォルダに移動するか、 +以下のRコードを使って直接フォルダにダウンロードすることができる。 ```{r, message=FALSE} download.file(url = "https://carpentries-incubator.github.io/bioc-intro/data/annot1.csv", @@ -953,23 +940,22 @@ annot1 <- read_csv(file = "data/annot1.csv") annot1 ``` -We now want to join these two tables into a single one containing all -variables using the `full_join()` function from the `dplyr` package. The -function will automatically find the common variable to match columns -from the first and second table. In this case, `gene` is the common -variable. Such variables are called keys. Keys are used to match -observations across different tables. +ここで、`dplyr` パッケージの `full_join()` 関数を使用して、これら2つのテーブルを、すべての +変数を含む1つのテーブルに結合したいと思います。 +関数は、最初のテーブルと2番目のテーブルの列 +に一致する共通変数を自動的に見つける。 この場合、`gene`は共通の +。 このような変数をキーと呼ぶ。 キーは、 +オブザベーションを異なるテーブル間でマッチさせるために使用される。 ```{r} full_join(rna_mini, annot1) ``` -In real life, gene annotations are sometimes labelled differently. +実生活では、遺伝子アノテーションのラベルが異なることがある。 -The `annot2` table is exactly the same than `annot1` except that the -variable containing gene names is labelled differently. Again, either -[download annot2.csv](https://carpentries-incubator.github.io/bioc-intro/data/annot2.csv) -yourself and move it to `data/` or use the R code below. +annot2`テーブルは、遺伝子名を含む +変数のラベルが異なる以外は、`annot1`と全く同じである。 この場合も、 [download annot2.csv](https://carpentries-incubator.github.io/bioc-intro/data/annot2.csv) +、自分で`data/\`に移動するか、以下のRコードを使う。 ```{r, message=FALSE} download.file(url = "https://carpentries-incubator.github.io/bioc-intro/data/annot2.csv", @@ -978,70 +964,68 @@ annot2 <- read_csv(file = "data/annot2.csv") annot2 ``` -In case none of the variable names match, we can set manually the -variables to use for the matching. These variables can be set using -the `by` argument, as shown below with `rna_mini` and `annot2` tables. +どの変数名も一致しない場合、マッチングに使用する +変数を手動で設定することができる。 これらの変数は、`rna_mini` と `annot2` テーブルを使用して以下に示すように、 +`by` 引数を使用して設定することができる。 ```{r} full_join(rna_mini, annot2, by = c("gene" = "external_gene_name")) ``` -As can be seen above, the variable name of the first table is retained -in the joined one. +上で見たように、最初のテーブルの変数名は、結合されたテーブルでも +。 ::::::::::::::::::::::::::::::::::::::: challenge -## Challenge: +## チャレンジだ: -Download the `annot3` table by clicking -[here](https://carpentries-incubator.github.io/bioc-intro/data/annot3.csv) -and put the table in your data/ repository. Using the `full_join()` -function, join tables `rna_mini` and `annot3`. What has happened for -genes _Klk6_, _mt-Tf_, _mt-Rnr1_, _mt-Tv_, _mt-Rnr2_, and _mt-Tl1_ ? +[こちら](https://carpentries-incubator.github.io/bioc-intro/data/annot3.csv) +をクリックして `annot3` テーブルをダウンロードし、そのテーブルをあなたの data/ リポジトリに置いてください。 full_join()`関数を使用して、テーブル`rna_mini`と`annot3\` を結合する。 +、遺伝子_Klk6_、_mt-Tf_、_mt-Rnr1_、_mt-Tv_、_mt-Rnr2_、_mt-Tl1_はどうなったのか? -::::::::::::::: solution +:::::::::::::::::::: 解決策 -## Solution +## ソリューション ```{r, message=FALSE} annot3 <- read_csv("data/annot3.csv") full_join(rna_mini, annot3) ``` -Genes _Klk6_ is only present in `rna_mini`, while genes _mt-Tf_, _mt-Rnr1_, _mt-Tv_, -_mt-Rnr2_, and _mt-Tl1_ are only present in `annot3` table. Their respective values for the -variables of the table have been encoded as missing. +遺伝子_Klk6_は`rna_mini`にのみ存在し、遺伝子_mt-Tf_、_mt-Rnr1_、_mt-Tv_、 +_mt-Rnr2_、_mt-Tl1_は`annot3`テーブルにのみ存在する。 表の +変数のそれぞれの値は、欠損として符号化されている。 ::::::::::::::::::::::::: :::::::::::::::::::::::::::::::::::::::::::::::::: -## Exporting data +## データのエクスポート -Now that you have learned how to use `dplyr` to extract information from -or summarise your raw data, you may want to export these new data sets to share -them with your collaborators or for archival. +dplyr\`を使って、 +から情報を抽出したり、生データを要約したりする方法を学んだので、これらの新しいデータセットをエクスポートして、 +を共同研究者と共有したり、アーカイブしたりしたいと思うかもしれない。 -Similar to the `read_csv()` function used for reading CSV files into R, there is -a `write_csv()` function that generates CSV files from data frames. +RにCSVファイルを読み込むために使用される `read_csv()` 関数と同様に、 +、データフレームからCSVファイルを生成する `write_csv()` 関数があります。 -Before using `write_csv()`, we are going to create a new folder, `data_output`, -in our working directory that will store this generated dataset. We don't want -to write generated datasets in the same directory as our raw data. -It's good practice to keep them separate. The `data` folder should only contain -the raw, unaltered data, and should be left alone to make sure we don't delete -or modify it. In contrast, our script will generate the contents of the `data_output` -directory, so even if the files it contains are deleted, we can always -re-generate them. +write_csv()`を使う前に、生成されたデータセットを格納する新しいフォルダ `data_output` +を作業ディレクトリに作成する。 +、生成されたデータセットを生データと同じディレクトリに書き込みたくない。 +別々にするのは良い習慣だ。 data`フォルダーには、 +、変更されていない生のデータだけを入れておく。 +、削除したり変更したりしないように、そのままにしておく。 対照的に、このスクリプトは`data_output` +ディレクトリの内容を生成するので、そこに含まれるファイルが削除されても、 +再生成することができる。 -Let's use `write_csv()` to save the rna\_wide table that we have created previously. +write_csv()\`を使用して、以前に作成したrna_wideテーブルを保存しよう。 ```{r, purl=TRUE, eval=FALSE} write_csv(rna_wide, file = "data_output/rna_wide.csv") ``` -:::::::::::::::::::::::::::::::::::::::: keypoints +::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: キーポイント -- Tabular data in R using the tidyverse meta-package +- tidyverseメタパッケージを使用したRでの表形式データ :::::::::::::::::::::::::::::::::::::::::::::::::: From bf75b271dbb84800c64da0abab9da5d679ebd613 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Tue, 14 May 2024 00:29:42 +0900 Subject: [PATCH 168/334] New translations 10-data-organisation.md (Spanish) --- locale/es/episodes/10-data-organisation.Rmd | 1158 +++++++++---------- 1 file changed, 579 insertions(+), 579 deletions(-) diff --git a/locale/es/episodes/10-data-organisation.Rmd b/locale/es/episodes/10-data-organisation.Rmd index de1b53e0e..5d672156a 100644 --- a/locale/es/episodes/10-data-organisation.Rmd +++ b/locale/es/episodes/10-data-organisation.Rmd @@ -10,294 +10,294 @@ exercises: 30 ::::::::::::::::::::::::::::::::::::::: objetivos -- Learn about spreadsheets, their strengths and weaknesses. -- How do we format data in spreadsheets for effective data use? -- Learn about common spreadsheet errors and how to correct them. -- Organise your data according to tidy data principles. -- Learn about text-based spreadsheet formats such as the comma-separated (CSV) or tab-separated (TSV) formats. +- Aprenda sobre las hojas de cálculo, sus fortalezas y debilidades. +- ¿Cómo damos formato a los datos en hojas de cálculo para un uso eficaz de los datos? +- Obtenga información sobre los errores comunes de las hojas de cálculo y cómo corregirlos. +- Organice sus datos de acuerdo con los principios de datos ordenados. +- Obtenga información sobre los formatos de hojas de cálculo basados en texto, como los formatos separados por comas (CSV) o separados por tabulaciones (TSV). :::::::::::::::::::::::::::::::::::::::::::::::::: :::::::::::::::::::::::::::::::::::::::: questions -- How to organise tabular data? +- ¿Cómo organizar datos tabulares? :::::::::::::::::::::::::::::::::::::::::::::::::: -> This episode is based on the Data Carpentries's _Data Analysis and -> Visualisation in R for Ecologists_ lesson. +> Este episodio se basa en la lección _Análisis de datos y +> Visualización en R para ecologistas_ de Data Carpentries. -## Spreadsheet programs +## Programas de hoja de cálculo -**Question** +**Pregunta** -- What are basic principles for using spreadsheets for good data - organization? +- ¿Cuáles son los principios básicos para usar hojas de cálculo para una buena organización de datos + ? -**Objective** +**Objetivo** -- Describe best practices for organizing data so computers can make - the best use of datasets. +- Describir las mejores prácticas para organizar datos para que las computadoras puedan hacer + el mejor uso de los conjuntos de datos. -**Keypoint** +**Punto clave** -- Good data organization is the foundation of any research project. +- Una buena organización de los datos es la base de cualquier proyecto de investigación. -Good data organization is the foundation of your research -project. Most researchers have data or do data entry in -spreadsheets. Spreadsheet programs are very useful graphical -interfaces for designing data tables and handling very basic data -quality control functions. See also @Broman:2018. +Una buena organización de los datos es la base de su proyecto de investigación +. La mayoría de los investigadores tienen datos o los ingresan en +hojas de cálculo. Los programas de hojas de cálculo son interfaces gráficas +muy útiles para diseñar tablas de datos y manejar datos muy básicos +funciones de control de calidad. Ver también @Broman:2018. -### Spreadsheet outline +### Esquema de hoja de cálculo -Spreadsheets are good for data entry. Therefore we have a lot of data -in spreadsheets. Much of your time as a researcher will be spent in -this 'data wrangling' stage. It's not the most fun, but it's -necessary. We'll teach you how to think about data organization and -some practices for more effective data wrangling. +Las hojas de cálculo son buenas para la entrada de datos. Por lo tanto, tenemos muchos datos +en hojas de cálculo. Gran parte de su tiempo como investigador lo dedicará a +esta etapa de "disputación de datos". No es lo más divertido, pero es +necesario. Le enseñaremos cómo pensar en la organización de datos y +algunas prácticas para una gestión de datos más eficaz. -### What this lesson will not teach you +### Lo que esta lección no te enseñará -- How to do _statistics_ in a spreadsheet -- How to do _plotting_ in a spreadsheet -- How to _write code_ in spreadsheet programs +- Cómo hacer _estadísticas_ en una hoja de cálculo +- Cómo hacer _trazar_ en una hoja de cálculo +- Cómo _escribir código_ en programas de hojas de cálculo -If you're looking to do this, a good reference is Head First +Si desea hacer esto, una buena referencia es Head First Excel, -published by O'Reilly. +publicado por O'Reilly. -### Why aren't we teaching data analysis in spreadsheets +### ¿Por qué no enseñamos análisis de datos en hojas de cálculo? -- Data analysis in spreadsheets usually requires a lot of manual - work. If you want to change a parameter or run an analysis with a - new dataset, you usually have to redo everything by hand. (We do - know that you can create macros, but see the next point.) +- El análisis de datos en hojas de cálculo suele requerir mucho trabajo manual + . Si desea cambiar un parámetro o ejecutar un análisis con un + nuevo conjunto de datos, normalmente tendrá que rehacer todo a mano. (Sabemos + que puedes crear macros, pero mira el siguiente punto). -- It is also difficult to track or reproduce statistical or plotting - analyses done in spreadsheet programs when you want to go back to - your work or someone asks for details of your analysis. +- También es difícil rastrear o reproducir análisis estadísticos o gráficos + realizados en programas de hojas de cálculo cuando desea volver a + su trabajo o alguien solicita detalles de su análisis. -Many spreadsheet programs are available. Since most participants -utilise Excel as their primary spreadsheet program, this lesson will -make use of Excel examples. A free spreadsheet program that can also -be used is LibreOffice. Commands may differ a bit between programs, -but the general idea is the same. +Hay muchos programas de hojas de cálculo disponibles. Dado que la mayoría de los participantes +utilizan Excel como su programa principal de hoja de cálculo, esta lección +utilizará ejemplos de Excel. Un programa de hoja de cálculo gratuito que también se puede utilizar +es LibreOffice. Los comandos pueden diferir un poco entre programas, +pero la idea general es la misma. -Spreadsheet programs encompass a lot of the things we need to be able -to do as researchers. We can use them for: +Los programas de hojas de cálculo abarcan muchas de las cosas que necesitamos poder hacer +como investigadores. Podemos usarlos para: -- Data entry -- Organizing data -- Subsetting and sorting data -- Statistics -- Plotting +- Entrada de datos +- Organizar datos +- Subcomponer y ordenar datos +- Estadísticas +- Graficado -Spreadsheet programs use tables to represent and display data. Data -formatted as tables is also the main theme of this chapter, and we -will see how to organise data into tables in a standardised way to -ensure efficient downstream analysis. +Los programas de hojas de cálculo utilizan tablas para representar y mostrar datos. Los datos +formateados como tablas también son el tema principal de este capítulo, y +veremos cómo organizar los datos en tablas de una manera estandarizada para +garantizar un análisis posterior eficiente. ::::::::::::::::::::::::::::::::::::::: challenge -## Challenge: Discuss the following points with your neighbour +## Desafío: discute los siguientes puntos con tu vecino. -- Have you used spreadsheets, in your research, courses, - or at home? -- What kind of operations do you do in spreadsheets? -- Which ones do you think spreadsheets are good for? -- Have you accidentally done something in a spreadsheet program that made you - frustrated or sad? +- ¿Has utilizado hojas de cálculo en tus investigaciones, cursos, + o en casa? +- ¿Qué tipo de operaciones hacéis en hojas de cálculo? +- ¿Para cuáles crees que son buenas las hojas de cálculo? +- ¿Has hecho accidentalmente algo en un programa de hoja de cálculo que te hizo + frustrado o triste? :::::::::::::::::::::::::::::::::::::::::::::::::: -### Problems with spreadsheets +### Problemas con las hojas de cálculo -Spreadsheets are good for data entry, but in reality we tend to -use spreadsheet programs for much more than data entry. We use them -to create data tables for publications, to generate summary -statistics, and make figures. +Las hojas de cálculo son buenas para ingresar datos, pero en realidad tendemos a +usar programas de hojas de cálculo para mucho más que ingresar datos. Los usamos +para crear tablas de datos para publicaciones, para generar estadísticas resumidas +y hacer cifras. Generating tables for publications in a spreadsheet is not optimal - often, when formatting a data table for publication, we're reporting key summary statistics in a way that is not really meant to be read as data, and often involves special formatting -(merging cells, creating borders, making it pretty). We advise you to -do this sort of operation within your document editing software. +(merging cells, creating borders, making it pretty). Le recomendamos que +realice este tipo de operación dentro de su software de edición de documentos. -The latter two applications, generating statistics and figures, should -be used with caution: because of the graphical, drag and drop nature of -spreadsheet programs, it can be very difficult, if not impossible, to -replicate your steps (much less retrace anyone else's), particularly if your -stats or figures require you to do more complex calculations. Furthermore, -in doing calculations in a spreadsheet, it's easy to accidentally apply a -slightly different formula to multiple adjacent cells. When using a -command-line based statistics program like R or SAS, it's practically -impossible to apply a calculation to one observation in your -dataset but not another unless you're doing it on purpose. +Las dos últimas aplicaciones, que generan estadísticas y cifras, deben usarse +con precaución: debido a la naturaleza gráfica, de arrastrar y soltar, de los programas de hojas de cálculo +, puede ser muy difícil, si no imposible, +replica tus pasos (y mucho menos vuelve sobre los de cualquier otra persona), especialmente si tus estadísticas o cifras de +requieren que hagas cálculos más complejos. Además, +al hacer cálculos en una hoja de cálculo, es fácil aplicar accidentalmente una fórmula +ligeramente diferente a varias celdas adyacentes. Cuando se utiliza un programa de estadísticas basado en línea de comandos +como R o SAS, es prácticamente +imposible aplicar un cálculo a una observación en su conjunto de datos +pero no a otra, a menos que esté haciendo a propósito. -### Using spreadsheets for data entry and cleaning +### Uso de hojas de cálculo para la entrada y limpieza de datos. -In this lesson, we will assume that you are most likely using Excel as -your primary spreadsheet program - there are others (gnumeric, Calc -from OpenOffice), and their functionality is similar, but Excel seems -to be the program most used by biologists and biomedical researchers. +En esta lección, asumiremos que lo más probable es que esté utilizando Excel como +su programa de hoja de cálculo principal; hay otros (gnumeric, Calc +de OpenOffice) y su funcionalidad es similar, pero Excel parece +ser el programa más utilizado por biólogos e investigadores biomédicos. -In this lesson we're going to talk about: +En esta lección vamos a hablar de: -1. Formatting data tables in spreadsheets -2. Formatting problems -3. Exporting data +1. Formatear tablas de datos en hojas de cálculo +2. Problemas de formato +3. Exportar datos -## Formatting data tables in spreadsheets +## Formatear tablas de datos en hojas de cálculo -**Questions** +**Preguntas** -- How do we format data in spreadsheets for effective data use? +- ¿Cómo damos formato a los datos en hojas de cálculo para un uso eficaz de los datos? -**Objectives** +**Objetivos** -- Describe best practices for data entry and formatting in - spreadsheets. +- Describir las mejores prácticas para la entrada y formato de datos en + hojas de cálculo. -- Apply best practices to arrange variables and observations in a - spreadsheet. +- Aplique las mejores prácticas para organizar variables y observaciones en una hoja de cálculo + . -**Keypoints** +**Puntos clave** -- Never modify your raw data. Always make a copy before making any - changes. +- Nunca modifique sus datos sin procesar. Siempre haga una copia antes de realizar + cambios. -- Keep track of all of the steps you take to clean your data in a - plain text file. +- Realice un seguimiento de todos los pasos que sigue para limpiar sus datos en un + archivo de texto sin formato. -- Organise your data according to tidy data principles. +- Organice sus datos de acuerdo con los principios de datos ordenados. -The most common mistake made is treating spreadsheet programs like lab -notebooks, that is, relying on context, notes in the margin, spatial -layout of data and fields to convey information. As humans, we can -(usually) interpret these things, but computers don't view information -the same way, and unless we explain to the computer what every single -thing means (and that can be hard!), it will not be able to see how -our data fits together. +El error más común que se comete es tratar los programas de hojas de cálculo como cuadernos de laboratorio +, es decir, confiar en el contexto, las notas al margen, el diseño espacial +de los datos y los campos para transmitir información. Como humanos, podemos +(generalmente) interpretar estas cosas, pero las computadoras no ven la información +de la misma manera, y a menos que le expliquemos a la computadora qué significa cada cosa +(¡y eso puede ser difícil!), no podrá ver cómo +encajan nuestros datos. -Using the power of computers, we can manage and analyse data in much -more effective and faster ways, but to use that power, we have to set -up our data for the computer to be able to understand it (and -computers are very literal). +Usando el poder de las computadoras, podemos administrar y analizar datos de maneras mucho +más efectivas y rápidas, pero para usar ese poder, tenemos que configurar +nuestros datos para que la computadora pueda entenderlo (y +las computadoras son muy literales). -This is why it's extremely important to set up well-formatted tables -from the outset - before you even start entering data from your very -first preliminary experiment. Data organization is the foundation of -your research project. It can make it easier or harder to work with -your data throughout your analysis, so it's worth thinking about when -you're doing your data entry or setting up your experiment. You can -set things up in different ways in spreadsheets, but some of these -choices can limit your ability to work with the data in other programs -or have the you-of-6-months-from-now or your collaborator work with -the data. +Por eso es extremadamente importante configurar tablas bien formateadas +desde el principio, incluso antes de comenzar a ingresar datos de su +primer experimento preliminar. La organización de datos es la base de +su proyecto de investigación. Puede hacer que sea más fácil o más difícil trabajar con +sus datos a lo largo de su análisis, por lo que vale la pena pensar en cuándo +está ingresando datos o configurando su experimento. Puedes +configurar las cosas de diferentes maneras en hojas de cálculo, pero algunas de estas +opciones pueden limitar tu capacidad para trabajar con los datos en otros programas +o tener la tuya de- Dentro de 6 meses o tu colaborador trabajará con +los datos. -**Note:** the best layouts/formats (as well as software and -interfaces) for data entry and data analysis might be different. It is -important to take this into account, and ideally automate the -conversion from one to another. +**Nota:** los mejores diseños/formatos (así como el software y las +interfaces) para la entrada y el análisis de datos pueden ser diferentes. Es +importante tener esto en cuenta e idealmente automatizar la +conversión de uno a otro. -### Keeping track of your analyses +### Seguimiento de sus análisis -When you're working with spreadsheets, during data clean up or -analyses, it's very easy to end up with a spreadsheet that looks very -different from the one you started with. In order to be able to -reproduce your analyses or figure out what you did when a reviewer or -instructor asks for a different analysis, you should +Cuando trabajas con hojas de cálculo, durante la limpieza de datos o +análisis, es muy fácil terminar con una hoja de cálculo que se ve muy +diferente de la que tenías al principio. Para poder +reproducir tus análisis o descubrir qué hiciste cuando un revisor o +instructor te pide un análisis diferente, debes -- create a new file with your cleaned or analysed data. Don't modify - the original dataset, or you will never know where you started! +- cree un nuevo archivo con sus datos limpios o analizados. ¡No modifiques + el conjunto de datos original o nunca sabrás por dónde empezaste! -- keep track of the steps you took in your clean up or analysis. You - should track these steps as you would any step in an experiment. We - recommend that you do this in a plain text file stored in the same - folder as the data file. +- Lleve un registro de los pasos que siguió en su limpieza o análisis. Usted + debe realizar un seguimiento de estos pasos como lo haría con cualquier paso de un experimento. + recomendamos que haga esto en un archivo de texto sin formato almacenado en la misma carpeta + que el archivo de datos. -This might be an example of a spreadsheet setup: +Este podría ser un ejemplo de configuración de una hoja de cálculo: ![](fig/spreadsheet-setup-updated.png) -Put these principles in to practice today during your exercises. +Pon estos principios en práctica hoy durante tus ejercicios. -While versioning is out of scope for this course, you can look at the -Carpentries lesson on -['Git'](https://swcarpentry.github.io/git-novice/) to learn how to -maintain **version control** over your data. See also this blog -post for a quick tutorial or -@Perez-Riverol:2016 for a more research-oriented use-case. +Si bien el control de versiones está fuera del alcance de este curso, puede consultar la lección +Carpentries en +['Git'](https://swcarpentry.github.io/git-novice/) para aprenda cómo +mantener **control de versiones** sobre sus datos. Vea también este blog +post para un tutorial rápido o +@Perez-Riverol:2016 para un tutorial más orientado a la investigación. caso de uso. -### Structuring data in spreadsheets +### Estructurar datos en hojas de cálculo. -The cardinal rules of using spreadsheet programs for data: +Las reglas fundamentales para el uso de programas de hojas de cálculo para datos: -1. Put all your variables in columns - the thing you're measuring, - like 'weight' or 'temperature'. -2. Put each observation in its own row. -3. Don't combine multiple pieces of information in one cell. Sometimes - it just seems like one thing, but think if that's the only way - you'll want to be able to use or sort that data. -4. Leave the raw data raw - don't change it! -5. Export the cleaned data to a text-based format like CSV - (comma-separated values) format. This ensures that anyone can use - the data, and is required by most data repositories. +1. Coloque todas sus variables en columnas: lo que está midiendo, + como 'peso' o 'temperatura'. +2. Coloque cada observación en su propia fila. +3. No combine varias piezas de información en una celda. A veces + simplemente parece una cosa, pero piensa si esa es la única manera + en la que querrás poder usar u ordenar esos datos. +4. Deje los datos sin procesar, ¡no los cambie! +5. Exporte los datos limpios a un formato basado en texto como CSV + (valores separados por comas). Esto garantiza que cualquiera pueda usar + los datos y es requerido por la mayoría de los repositorios de datos. -For instance, we have data from patients that visited several -hospitals in Brussels, Belgium. They recorded the date of the visit, -the hospital, the patients' gender, weight and blood group. +Por ejemplo, tenemos datos de pacientes que visitaron varios +hospitales en Bruselas, Bélgica. Registraron la fecha de la visita, +el hospital, el sexo, el peso y el grupo sanguíneo de los pacientes. -If we were to keep track of the data like this: +Si tuviéramos que realizar un seguimiento de los datos como este: ![](fig/multiple-info.png) -the problem is that the ABO and Rhesus groups are in the same `Blood` -type column. So, if they wanted to look at all observations of the A -group or look at weight distributions by ABO group, it would be tricky -to do this using this data setup. If instead we put the ABO and Rhesus -groups in different columns, you can see that it would be much easier. +el problema es que los grupos ABO y Rhesus están en la misma columna de tipo `Blood` +. Entonces, si quisieran ver todas las observaciones del grupo A +o ver las distribuciones de peso por grupo ABO, sería complicado +hacerlo usando esta configuración de datos. Si en cambio ponemos los grupos ABO y Rhesus +en columnas diferentes, puedes ver que sería mucho más fácil. ![](fig/single-info.png) -An important rule when setting up a datasheet, is that **columns are -used for variables** and **rows are used for observations**: +Una regla importante al configurar una hoja de datos es que **las columnas se usan +para las variables** y **las filas se usan para las observaciones**: -- columns are variables -- rows are observations -- cells are individual values +- las columnas son variables +- las filas son observaciones +- las celdas son valores individuales ::::::::::::::::::::::::::::::::::::::: challenge -## Challenge: We're going to take a messy dataset and describe how we would clean it up. +## Desafío: tomaremos un conjunto de datos desordenado y describiremos cómo lo limpiaríamos. -1. Download a messy dataset by clicking - [here](https://github.com/UCLouvain-CBIO/WSBIM1207/raw/master/data/messy_covid.xlsx). +1. Descargue un conjunto de datos desordenado haciendo clic en + [aquí](https://github.com/UCLouvain-CBIO/WSBIM1207/raw/master/data/messy_covid.xlsx). -2. Open up the data in a spreadsheet program. +2. Abra los datos en un programa de hoja de cálculo. -3. You can see that there are two tabs. The data contains various - clinical variables recorded in various hospitals in Brussels during - the first and second COVID-19 waves in 2020. As you can see, the - data have been recorded differently during the March and November - waves. Now you're the person in charge of this project and you want - to be able to start analyzing the data. +3. Puedes ver que hay dos pestañas. Los datos contienen varias + variables clínicas registradas en varios hospitales de Bruselas durante + la primera y segunda oleada de COVID-19 en 2020. Como puede ver, los datos + se registraron de manera diferente durante las oleadas + de marzo y noviembre. Ahora eres la persona a cargo de este proyecto y quieres que + pueda comenzar a analizar los datos. -4. With the person next to you, identify what is wrong with this - spreadsheet. Also discuss the steps you would need to take to clean - up first and second wave tabs, and to put them all together in one - spreadsheet. +4. Con la persona a tu lado, identifica qué está mal en esta hoja de cálculo + . También analice los pasos que necesitaría seguir para limpiar + las pestañas de la primera y segunda ola, y juntarlas todas en una + hoja de cálculo. -**Important:** Do not forget our first piece of advice: to create a -new file (or tab) for the cleaned data, never modify your original -(raw) data. +**Importante:** No olvide nuestro primer consejo: para crear un +nuevo archivo (o pestaña) para los datos limpios, nunca modifique sus datos originales +(sin procesar). :::::::::::::::::::::::::::::::::::::::::::::::::: -After you go through this exercise, we'll discuss as a group what was -wrong with this data and how you would fix it. +Después de realizar este ejercicio, discutiremos en grupo qué +estaba mal con estos datos y cómo solucionarlo. <!-- - Take about 10 minutes to work on this exercise. --> @@ -319,45 +319,45 @@ wrong with this data and how you would fix it. ::::::::::::::::::::::::::::::::::::::: challenge -## Challenge: Once you have tidied up the data, answer the following questions: +## Desafío: una vez que haya ordenado los datos, responda las siguientes preguntas: -- How many men and women took part in the study? -- How many A, AB, and B types have been tested? -- As above, but disregarding the contaminated samples? -- How many Rhesus + and - have been tested? -- How many universal donors (O-) have been tested? -- What is the average weight of AB men? -- How many samples have been tested in the different hospitals? +- ¿Cuántos hombres y mujeres participaron en el estudio? +- ¿Cuántos tipos A, AB y B se han probado? +- ¿Como arriba, pero sin tener en cuenta las muestras contaminadas? +- ¿Cuántos Rhesus + y - se han probado? +- ¿Cuántos donantes universales (O-) se han probado? +- ¿Cuál es el peso promedio de los hombres AB? +- ¿Cuántas muestras se han analizado en los diferentes hospitales? :::::::::::::::::::::::::::::::::::::::::::::::::: -An **excellent reference**, in particular with regard to R scripting -is the _Tidy Data_ paper @Wickham:2014. +Una **excelente referencia**, en particular con respecto a las secuencias de comandos R +es el artículo _Tidy Data_ @Wickham:2014. -## Common spreadsheet errors +## Errores comunes en las hojas de cálculo -**Questions** +**Preguntas** -- What are some common challenges with formatting data in spreadsheets - and how can we avoid them? +- ¿Cuáles son algunos de los desafíos comunes al formatear datos en hojas de cálculo + y cómo podemos evitarlos? -**Objectives** +**Objetivos** -- Recognise and resolve common spreadsheet formatting problems. +- Reconocer y resolver problemas comunes de formato de hojas de cálculo. -**Keypoints** +**Puntos clave** -- Avoid using multiple tables within one spreadsheet. -- Avoid spreading data across multiple tabs. -- Record zeros as zeros. -- Use an appropriate null value to record missing data. -- Don't use formatting to convey information or to make your spreadsheet look pretty. -- Place comments in a separate column. -- Record units in column headers. -- Include only one piece of information in a cell. -- Avoid spaces, numbers and special characters in column headers. -- Avoid special characters in your data. -- Record metadata in a separate plain text file. +- Evite el uso de varias tablas dentro de una hoja de cálculo. +- Evite distribuir datos en varias pestañas. +- Registre los ceros como ceros. +- Utilice un valor nulo apropiado para registrar los datos faltantes. +- No utilices el formato para transmitir información o hacer que tu hoja de cálculo luzca bonita. +- Coloque los comentarios en una columna separada. +- Registre las unidades en los encabezados de las columnas. +- Incluya solo una pieza de información en una celda. +- Evite espacios, números y caracteres especiales en los encabezados de las columnas. +- Evite caracteres especiales en sus datos. +- Registre los metadatos en un archivo de texto sin formato separado. <!-- This lesson is meant to be used as a reference for discussion as --> @@ -367,376 +367,376 @@ is the _Tidy Data_ paper @Wickham:2014. <!-- refer to responses to the exercise in the previous lesson. --> -There are a few potential errors to be on the lookout for in your own -data as well as data from collaborators or the Internet. If you are -aware of the errors and the possible negative effect on downstream -data analysis and result interpretation, it might motivate yourself -and your project members to try and avoid them. Making small changes -to the way you format your data in spreadsheets, can have a great -impact on efficiency and reliability when it comes to data cleaning -and analysis. - -- [Using multiple tables](#tables) -- [Using multiple tabs](#tabs) -- [Not filling in zeros](#zeros) -- [Using problematic null values](#null) -- [Using formatting to convey information](#formatting) -- [Using formatting to make the data sheet look pretty](#formatting_pretty) -- [Placing comments or units in cells](#units) -- [Entering more than one piece of information in a cell](#info) -- [Using problematic field names](#field_name) -- [Using special characters in data](#special) -- [Inclusion of metadata in data table](#metadata) - -### Using multiple tables {#tables} - -A common strategy is creating multiple data tables within one -spreadsheet. This confuses the computer, so don't do this! When you -create multiple tables within one spreadsheet, you're drawing false -associations between things for the computer, which sees each row as -an observation. You're also potentially using the same field name in -multiple places, which will make it harder to clean your data up into -a usable form. The example below depicts the problem: +Hay algunos errores potenciales a los que debe prestar atención en sus propios datos +, así como en los datos de sus colaboradores o de Internet. Si es +consciente de los errores y del posible efecto negativo en el análisis de datos y la interpretación de resultados de +posteriores, podría motivarse a usted mismo +y a los miembros de su proyecto a intentar evitarlos. Hacer pequeños cambios +en la forma en que formatea sus datos en las hojas de cálculo puede tener un gran +impacto en la eficiencia y confiabilidad cuando se trata de limpieza y análisis de datos +. + +- [Usando varias tablas](#tables) +- [Usar varias pestañas](#pestañas) +- [Sin completar ceros](#zeros) +- [Usando valores nulos problemáticos](#null) +- [Usar formato para transmitir información](#formateo) +- [Usar formato para que la hoja de datos se vea bonita](#formatting_pretty) +- [Colocar comentarios o unidades en celdas](#unidades) +- [Ingresar más de un dato en una celda](#info) +- [Usando nombres de campos problemáticos](#field_name) +- [Usar caracteres especiales en los datos](#special) +- [Inclusión de metadatos en tabla de datos](#metadata) + +### Usando múltiples tablas {#tables} + +Una estrategia común es crear varias tablas de datos dentro de una +hoja de cálculo. Esto confunde a la computadora, ¡así que no hagas esto! Cuando +creas varias tablas dentro de una hoja de cálculo, estás dibujando +asociaciones falsas entre cosas para la computadora, que ve cada fila como +una observación. También estás potencialmente usando el mismo nombre de campo en +varios lugares, lo que hará que sea más difícil limpiar tus datos en +un formulario utilizable. El siguiente ejemplo muestra el problema: ![](fig/2_datasheet_example.jpg) -In the example above, the computer will see (for example) row 4 and -assume that all columns A-AF refer to the same sample. This row -actually represents four distinct samples (sample 1 for each of four -different collection dates - May 29th, June 12th, June 19th, and June -26th), as well as some calculated summary statistics (an average (avr) -and standard error of measurement (SEM)) for two of those -samples. Other rows are similarly problematic. +En el ejemplo anterior, la computadora verá (por ejemplo) la fila 4 y +asumirá que todas las columnas A-AF se refieren a la misma muestra. Esta fila +en realidad representa cuatro muestras distintas (muestra 1 para cada una de las cuatro +fechas de recolección diferentes: 29 de mayo, 12 de junio, 19 de junio y +26 de junio), así como algunas estadísticas resumidas calculadas (un promedio (avr) +y un error estándar de medición (SEM)) para dos de esas +muestras. Otras filas son igualmente problemáticas. -### Using multiple tabs {#tabs} +### Usando múltiples pestañas {#tabs} -But what about workbook tabs? That seems like an easy way to organise -data, right? Well, yes and no. When you create extra tabs, you fail to -allow the computer to see connections in the data that are there (you -have to introduce spreadsheet application-specific functions or -scripting to ensure this connection). Say, for instance, you make a -separate tab for each day you take a measurement. +Pero ¿qué pasa con las pestañas del libro de trabajo? Parece una manera fácil de organizar +datos, ¿verdad? Bueno, sí y no. Cuando crea pestañas adicionales, no +permite que la computadora vea las conexiones en los datos que están allí (usted +tiene que introducir funciones específicas de la aplicación de hoja de cálculo o +secuencias de comandos para garantizar esta conexión). Digamos, por ejemplo, que creas una +pestaña separada para cada día que tomas una medición. -This isn't good practice for two reasons: +Esta no es una buena práctica por dos razones: -1. you are more likely to accidentally add inconsistencies to your - data if each time you take a measurement, you start recording data - in a new tab, and +1. es más probable que accidentalmente agregue inconsistencias a sus + datos si cada vez que toma una medición, comienza a registrar datos + en una nueva pestaña, y 2. even if you manage to prevent all inconsistencies from creeping in, you will add an extra step for yourself before you analyse the data because you will have to combine these data into a single - datatable. You will have to explicitly tell the computer how to - combine tabs - and if the tabs are inconsistently formatted, you - might even have to do it manually. - -The next time you're entering data, and you go to create another tab -or table, ask yourself if you could avoid adding this tab by adding -another column to your original spreadsheet. We used multiple tabs in -our example of a messy data file, but now you've seen how you can -reorganise your data to consolidate across tabs. - -Your data sheet might get very long over the course of the -experiment. This makes it harder to enter data if you can't see your -headers at the top of the spreadsheet. But don't repeat your header -row. These can easily get mixed into the data, leading to problems -down the road. Instead you can freeze the column -headers -so that they remain visible even when you have a spreadsheet with many -rows. - -### Not filling in zeros {#zeros} - -It might be that when you're measuring something, it's usually a zero, -say the number of times a rabbit is observed in the survey. Why bother -writing in the number zero in that column, when it's mostly zeros? - -However, there's a difference between a zero and a blank cell in a -spreadsheet. To the computer, a zero is actually data. You measured or -counted it. A blank cell means that it wasn't measured and the -computer will interpret it as an unknown value (also known as a null -or missing value). - -The spreadsheets or statistical programs will likely misinterpret -blank cells that you intend to be zeros. By not entering the value of -your observation, you are telling your computer to represent that data -as unknown or missing (null). This can cause problems with subsequent -calculations or analyses. For example, the average of a set of numbers -which includes a single null value is always null (because the -computer can't guess the value of the missing observations). Because -of this, it's very important to record zeros as zeros and truly -missing data as nulls. - -### Using problematic null values {#null} - -**Example**: using -999 or other numerical values (or zero) to -represent missing data. - -**Solutions**: - -There are a few reasons why null values get represented differently -within a dataset. Sometimes confusing null values are automatically -recorded from the measuring device. If that's the case, there's not -much you can do, but it can be addressed in data cleaning with a tool -like -[OpenRefine](https://www.datacarpentry.org/OpenRefine-ecology-lesson/) -before analysis. Other times different null values are used to convey -different reasons why the data isn't there. This is important -information to capture, but is in effect using one column to capture -two pieces of information. Like for using formatting to convey + datatable. Tendrás que decirle explícitamente a la computadora cómo + combinar pestañas, y si las pestañas tienen un formato inconsistente, es posible que + incluso tengas que hacerlo manualmente. + +La próxima vez que ingrese datos y vaya a crear otra pestaña +o tabla, pregúntese si podría evitar agregar esta pestaña agregando +otra columna a su hoja de cálculo original. Usamos varias pestañas en +nuestro ejemplo de un archivo de datos desordenado, pero ahora has visto cómo puedes +reorganizar tus datos para consolidarlos entre pestañas. + +Su hoja de datos puede volverse muy larga durante el transcurso del experimento +. Esto hace que sea más difícil ingresar datos si no puedes ver los encabezados +en la parte superior de la hoja de cálculo. Pero no repita la fila del encabezado +. Estos pueden mezclarse fácilmente con los datos, generando problemas +en el futuro. En su lugar, puede [congelar los encabezados de la columna +](https://support.office.com/en-ca/article/Freeze-column-headings-for-easy-scrolling-57ccce0c-cf85-4725-9579 -c5d13106ca6a) +para que permanezcan visibles incluso cuando tengas una hoja de cálculo con muchas +filas. + +### No completar ceros {#zeros} + +Puede ser que cuando estás midiendo algo, generalmente sea un cero, +dice la cantidad de veces que se observa un conejo en la encuesta. ¿Por qué molestarse +en escribir el número cero en esa columna, cuando en su mayoría son ceros? + +Sin embargo, existe una diferencia entre un cero y una celda en blanco en una hoja de cálculo +. Para la computadora, un cero son en realidad datos. Lo mediste o +lo contaste. Una celda en blanco significa que no se midió y la computadora +la interpretará como un valor desconocido (también conocido como nulo +o valor faltante). + +Es probable que las hojas de cálculo o los programas estadísticos malinterpreten +celdas en blanco que usted pretende que sean ceros. Al no ingresar el valor de +su observación, le está diciendo a su computadora que represente esos datos +como desconocidos o faltantes (nulos). Esto puede causar problemas con +cálculos o análisis posteriores. Por ejemplo, el promedio de un conjunto de números +que incluye un único valor nulo siempre es nulo (porque la computadora +no puede adivinar el valor de las observaciones faltantes). Debido a +esto, es muy importante registrar los ceros como ceros y realmente +los datos faltantes como nulos. + +### Usar valores nulos problemáticos {#null} + +**Ejemplo**: usar -999 u otros valores numéricos (o cero) para +representar datos faltantes. + +**Soluciones**: + +Hay algunas razones por las que los valores nulos se representan de manera diferente +dentro de un conjunto de datos. A veces, los valores nulos confusos se registran automáticamente +desde el dispositivo de medición. Si ese es el caso, no hay +mucho que puedas hacer, pero se puede abordar en la limpieza de datos con una herramienta +como +[OpenRefine](https://www.datacarpentry .org/OpenRefine-ecology-lesson/) +antes del análisis. Otras veces se utilizan diferentes valores nulos para transmitir +diferentes razones por las que los datos no están ahí. Esta es +información importante para capturar, pero en realidad se utiliza una columna para capturar +dos piezas de información. Like for using formatting to convey information it would be good here to create a new column like 'data\_missing' and use that column to capture the different reasons. -Whatever the reason, it's a problem if unknown or missing data is -recorded as -999, 999, or 0. +Cualquiera sea el motivo, es un problema si datos desconocidos o faltantes se registran +como -999, 999 o 0. -Many statistical programs will not recognise that these are intended -to represent missing (null) values. How these values are interpreted -will depend on the software you use to analyse your data. It is -essential to use a clearly defined and consistent null indicator. +Muchos programas estadísticos no reconocerán que estos +pretenden representar valores faltantes (nulos). La forma en que se interpreten estos valores +dependerá del software que utilice para analizar sus datos. Es +esencial utilizar un indicador nulo claramente definido y consistente. -Blanks (most applications) and NA (for R) are good -choices. @White:2013 explain good choices for indicating null values -for different software applications in their article: +Los espacios en blanco (la mayoría de las aplicaciones) y NA (para R) son buenas opciones +. @White:2013 explica buenas opciones para indicar valores nulos +para diferentes aplicaciones de software en su artículo: ![](fig/3_white_table_1.jpg) -### Using formatting to convey information {#formatting} +### Usar formato para transmitir información {#formatting} -**Example**: highlighting cells, rows or columns that should be -excluded from an analysis, leaving blank rows to indicate -separations in data. +**Ejemplo**: resaltar celdas, filas o columnas que deben excluirse +de un análisis, dejando filas en blanco para indicar +separaciones en los datos. -![](fig/formatting.png) +![](fig/formateo.png) -**Solution**: create a new field to encode which data should be -excluded. +**Solución**: cree un nuevo campo para codificar qué datos deben excluirse +. ![](fig/good_formatting.png) -### Using formatting to make the data sheet look pretty {#formatting\_pretty} +### Usar formato para que la hoja de datos se vea bonita {#formatting\_pretty} -**Example**: merging cells. +**Ejemplo**: fusionar celdas. -**Solution**: If you're not careful, formatting a worksheet to be more -aesthetically pleasing can compromise your computer's ability to see -associations in the data. Merged cells will make your data unreadable -by statistics software. Consider restructuring your data in such a way -that you will not need to merge cells to organise your data. +**Solución**: Si no tiene cuidado, formatear una hoja de cálculo para que sea más +estéticamente agradable puede comprometer la capacidad de su computadora para ver +asociaciones en los datos. Las celdas combinadas harán que sus datos sean ilegibles +para el software de estadísticas. Considere reestructurar sus datos de tal manera +que no necesite fusionar celdas para organizar sus datos. -### Placing comments or units in cells {#units} +### Colocar comentarios o unidades en celdas {#units} -Most analysis software can't see Excel or LibreOffice comments, and -would be confused by comments placed within your data cells. As -described above for formatting, create another field if you need to -add notes to cells. Similarly, don't include units in cells: ideally, -all the measurements you place in one column should be in the same -unit, but if for some reason they aren't, create another field and -specify the units the cell is in. +La mayoría del software de análisis no puede ver los comentarios de Excel o LibreOffice, y +se confundiría con los comentarios colocados dentro de sus celdas de datos. Como +se describió anteriormente para el formato, cree otro campo si necesita +agregar notas a las celdas. De manera similar, no incluyas unidades en las celdas: idealmente, +todas las medidas que coloques en una columna deberían estar en la misma +unidad, pero si por alguna razón no lo están, crea otro campo y +especifican las unidades en las que se encuentra la celda. -### Entering more than one piece of information in a cell {#info} +### Ingresar más de un dato en una celda {#info} -**Example**: Recording ABO and Rhesus groups in one cell, such as A+, +**Ejemplo**: Grabar grupos ABO y Rhesus en una celda, como A+, B+, A-, ... -**Solution**: Don't include more than one piece of information in a -cell. This will limit the ways in which you can analyse your data. If -you need both these measurements, design your data sheet to include -this information. For example, include one column for the ABO group and -one for the Rhesus group. - -### Using problematic field names {#field\_name} - -Choose descriptive field names, but be careful not to include spaces, -numbers, or special characters of any kind. Spaces can be -misinterpreted by parsers that use whitespace as delimiters and some -programs don't like field names that are text strings that start with -numbers. - -Underscores (`_`) are a good alternative to spaces. Consider writing -names in camel case (like this: ExampleFileName) to improve -readability. Remember that abbreviations that make sense at the moment -may not be so obvious in 6 months, but don't overdo it with names that -are excessively long. Including the units in the field names avoids -confusion and enables others to readily interpret your fields. - -**Examples** - -| Good Name | Good Alternative | Avoid | -| -------------------------------------------------------------- | ---------------------------------------- | ------------------------------------ | -| Max\_temp\_C | MaxTemp | Maximum Temp (°C) | -| Precipitation\_mm | Precipitation | precmm | -| Mean\_year\_growth | MeanYearGrowth | Mean growth/year | -| sex | sex | M/F | -| weight | weight | w. | -| cell\_type | CellType | Cell Type | -| Observation\_01 | first\_observation | 1st Obs | - -### Using special characters in data {#special} - -**Example**: You treat your spreadsheet program as a word processor -when writing notes, for example copying data directly from Word or -other applications. - -**Solution**: This is a common strategy. For example, when writing -longer text in a cell, people often include line breaks, em-dashes, -etc. in their spreadsheet. Also, when copying data in from -applications such as Word, formatting and fancy non-standard -characters (such as left- and right-aligned quotation marks) are -included. When exporting this data into a coding/statistical -environment or into a relational database, dangerous things may occur, -such as lines being cut in half and encoding errors being thrown. - -General best practice is to avoid adding characters such as newlines, -tabs, and vertical tabs. In other words, treat a text cell as if it -were a simple web form that can only contain text and spaces. - -### Inclusion of metadata in data table {#metadata} - -**Example**: You add a legend at the top or bottom of your data table -explaining column meaning, units, exceptions, etc. - -**Solution**: Recording data about your data ("metadata") is -essential. You may be on intimate terms with your dataset while you +**Solución**: No incluyas más de un dato en una celda +. Esto limitará las formas en que puede analizar sus datos. Si +necesita ambas medidas, diseñe su hoja de datos para incluir +esta información. Por ejemplo, incluya una columna para el grupo ABO y +una para el grupo Rhesus. + +### Usar nombres de campos problemáticos {#field\_name} + +Elija nombres de campos descriptivos, pero tenga cuidado de no incluir espacios, números +ni caracteres especiales de ningún tipo. Los espacios pueden ser +mal interpretados por analizadores que usan espacios en blanco como delimitadores y a algunos programas +no les gustan los nombres de campos que son cadenas de texto que comienzan con +números. + +Los guiones bajos (`_`) son una buena alternativa a los espacios. Considere escribir +nombres en mayúsculas y minúsculas (como este: Ejemplo de nombre de archivo) para mejorar la legibilidad de +. Recuerda que las abreviaturas que tienen sentido en este momento +pueden no ser tan obvias en 6 meses, pero no te excedas con nombres que +son excesivamente largos. Incluir las unidades en los nombres de los campos evita +confusión y permite que otros interpreten fácilmente sus campos. + +**Ejemplos** + +| Buen nombre | Buena alternativa | Evitar | +| -------------------------------------------------------- | ------------------------------------------ | ------------------------------------------ | +| Máx\_temp\_C | Temperatura máxima | Temperatura máxima (°C) | +| Precipitación\_mm | Precipitación | precm | +| Crecimiento medio\_año\_ | Crecimiento medio anual | Crecimiento medio/año | +| sexo | sexo | L/F | +| peso | peso | w. | +| tipo de célula | Tipo de célula | Tipo de célula | +| Observación\_01 | primera\_observación | 1.ª observación | + +### Usar caracteres especiales en datos {#special} + +**Ejemplo**: Tratas tu programa de hoja de cálculo como un procesador de textos +cuando escribes notas, por ejemplo, copiando datos directamente desde Word u +otras aplicaciones. + +**Solución**: Esta es una estrategia común. Por ejemplo, al escribir +texto más largo en una celda, las personas suelen incluir saltos de línea, guiones, +, etc. en su hoja de cálculo. Además, al copiar datos desde +aplicaciones como Word, se incluyen +formatos y caracteres elegantes no estándar +(como comillas alineadas a la izquierda y a la derecha). Al exportar estos datos a un entorno de codificación/estadístico +o a una base de datos relacional, pueden ocurrir cosas peligrosas, +, como líneas cortadas por la mitad y errores de codificación. + +La mejor práctica general es evitar agregar caracteres como nuevas líneas, pestañas +y pestañas verticales. En otras palabras, trate una celda de texto como si +fuera un formulario web simple que solo puede contener texto y espacios. + +### Inclusión de metadatos en tabla de datos {#metadata} + +**Ejemplo**: Agrega una leyenda en la parte superior o inferior de su tabla de datos +explicando el significado de las columnas, las unidades, las excepciones, etc. + +**Solución**: Registrar datos sobre sus datos ("metadatos") es +esencial. You may be on intimate terms with your dataset while you are collecting and analysing it, but the chances that you will still remember that the variable "sglmemgp" means single member of group, for example, or the exact algorithm you used to transform a variable or create a derived one, after a few months, a year, or more are slim. -As well, there are many reasons other people may want to examine or -use your data - to understand your findings, to verify your findings, -to review your submitted publication, to replicate your results, to -design a similar study, or even to archive your data for access and -re-use by others. While digital data by definition are -machine-readable, understanding their meaning is a job for human -beings. The importance of documenting your data during the collection -and analysis phase of your research cannot be overestimated, -especially if your research is going to be part of the scholarly -record. - -However, metadata should not be contained in the data file -itself. Unlike a table in a paper or a supplemental file, metadata (in -the form of legends) should not be included in a data file since this -information is not data, and including it can disrupt how computer -programs interpret your data file. Rather, metadata should be stored -as a separate file in the same directory as your data file, preferably -in plain text format with a name that clearly associates it with your -data file. Because metadata files are free text format, they also +Además, hay muchas razones por las que otras personas pueden querer examinar o +usar sus datos: para comprender sus hallazgos, para verificar sus hallazgos, +para revisar la publicación enviada, para replicar sus resultados, para +diseñar un estudio similar, o incluso archivar sus datos para que otros puedan acceder a ellos y +reutilizarlos. Si bien los datos digitales, por definición, son +legibles por máquina, comprender su significado es una tarea para los seres humanos +. No se puede sobrestimar la importancia de documentar sus datos durante la fase de recopilación +y análisis de su investigación, +, especialmente si su investigación va a ser parte del registro académico +. + +Sin embargo, los metadatos no deben estar contenidos en el propio archivo de datos +. A diferencia de una tabla en un documento o un archivo complementario, los metadatos (en +la forma de leyendas) no deben incluirse en un archivo de datos ya que esta información +no son datos, y su inclusión puede alterar la forma Los programas +de computadora interpretan su archivo de datos. Más bien, los metadatos deben almacenarse +como un archivo separado en el mismo directorio que su archivo de datos, preferiblemente +en formato de texto sin formato con un nombre que lo asocie claramente con su archivo de datos +. . Because metadata files are free text format, they also allow you to encode comments, units, information about how null values are encoded, etc. that are important to document but can disrupt the formatting of your data file. -Additionally, file or database level metadata describes how files that -make up the dataset relate to each other; what format they are in; and -whether they supercede or are superceded by previous files. A -folder-level readme.txt file is the classic way of accounting for all -the files and folders in a project. +Además, los metadatos a nivel de archivo o base de datos describen cómo los archivos que +componen el conjunto de datos se relacionan entre sí; en qué formato están; y +si reemplazan o son reemplazados por archivos anteriores. Un archivo readme.txt a nivel de carpeta +es la forma clásica de contabilizar todos +los archivos y carpetas de un proyecto. -(Text on metadata adapted from the online course Research Data -[MANTRA](https://datalib.edina.ac.uk/mantra) by EDINA and Data Library, -University of Edinburgh. MANTRA is licensed under a Creative Commons +(Texto sobre metadatos adaptado del curso en línea Research Data +[MANTRA](https://datalib.edina.ac.uk/mantra) de EDINA y Data Library, +Universidad de Edimburgo. MANTRA tiene una licencia Creative Commons Attribution 4.0 International -License.) +.) -## Exporting data +## Exportar datos -**Question** +**Pregunta** -- How can we export data from spreadsheets in a way that is useful for - downstream applications? +- ¿Cómo podemos exportar datos de hojas de cálculo de una manera que sea útil para + aplicaciones posteriores? -**Objectives** +**Objetivos** -- Store spreadsheet data in universal file formats. -- Export data from a spreadsheet to a CSV file. +- Almacene datos de hojas de cálculo en formatos de archivo universales. +- Exporte datos de una hoja de cálculo a un archivo CSV. -**Keypoints** +**Puntos clave** -- Data stored in common spreadsheet formats will often not be read - correctly into data analysis software, introducing errors into your - data. +- Los datos almacenados en formatos comunes de hojas de cálculo a menudo no se leerán + correctamente en el software de análisis de datos, lo que introducirá errores en sus datos + . -- Exporting data from spreadsheets to formats like CSV or TSV puts it - in a format that can be used consistently by most programs. +- Exportar datos de hojas de cálculo a formatos como CSV o TSV los coloca + en un formato que la mayoría de los programas pueden usar de manera consistente. -Storing the data you're going to work with for your analyses in Excel -default file format (`*.xls` or `*.xlsx` - depending on the Excel -version) isn't a good idea. Why? +Almacenar los datos con los que va a trabajar para sus análisis en el formato de archivo predeterminado de Excel +(`*.xls` o `*.xlsx`, dependiendo de la versión de Excel +) no es una buena idea. ¿Por qué? -- Because it is a proprietary format, and it is possible that in the - future, technology won't exist (or will become sufficiently rare) to - make it inconvenient, if not impossible, to open the file. +- Debido a que es un formato propietario, y es posible que en el + futuro, la tecnología no exista (o se vuelva lo suficientemente rara) como para + hacer que sea inconveniente, si no imposible, abrir el formato. archivo. -- Other spreadsheet software may not be able to open files saved in a - proprietary Excel format. +- Es posible que otro software de hoja de cálculo no pueda abrir archivos guardados en un formato propietario de Excel + . -- Different versions of Excel may handle data differently, leading to - inconsistencies. [Dates](https://datacarpentry.org/spreadsheet-ecology-lesson/03-dates-as-data/index.html) - is a well-documented example of inconsistencies in data storage. +- Las diferentes versiones de Excel pueden manejar los datos de manera diferente, lo que genera + inconsistencias. [Fechas](https://datacarpentry.org/spreadsheet-ecology-lesson/03-dates-as-data/index.html) + es un ejemplo bien documentado de inconsistencias en el almacenamiento de datos. -- Finally, more journals and grant agencies are requiring you to - deposit your data in a data repository, and most of them don't - accept Excel format. It needs to be in one of the formats discussed - below. +- Finalmente, más revistas y agencias de subvenciones exigen que + deposite sus datos en un repositorio de datos, y la mayoría de ellos no + aceptan el formato Excel. Debe estar en uno de los formatos que se analizan + a continuación. -- The above points also apply to other formats such as open data - formats used by LibreOffice / Open Office. These formats are not - static and do not get parsed the same way by different software - packages. +- Los puntos anteriores también se aplican a otros formatos, como los formatos de datos abiertos + utilizados por LibreOffice/Open Office. Estos formatos no son + estáticos y no se analizan de la misma manera por diferentes paquetes de software + . -Storing data in a universal, open, and static format will help deal -with this problem. Try tab-delimited (tab separated values or TSV) or -comma-delimited (comma separated values or CSV). CSV files are plain -text files where the columns are separated by commas, hence 'comma -separated values' or CSV. The advantage of a CSV file over an -Excel/SPSS/etc. file is that we can open and read a CSV file using -just about any software, including plain text editors like TextEdit or -NotePad. Data in a CSV file can also be easily imported into other -formats and environments, such as SQLite and R. We're not tied to a -certain version of a certain expensive program when we work with CSV -files, so it's a good format to work with for maximum portability and -endurance. Most spreadsheet programs can save to delimited text -formats like CSV easily, although they may give you a warning during -the file export. +Almacenar datos en un formato universal, abierto y estático ayudará a solucionar +este problema. Pruebe delimitado por tabulaciones (valores separados por tabulaciones o TSV) o +delimitado por comas (valores separados por comas o CSV). Los archivos CSV son archivos de texto simple +donde las columnas están separadas por comas, de ahí 'valores separados por coma +' o CSV. La ventaja de un archivo CSV sobre un +Excel/SPSS/etc. El archivo es que podemos abrir y leer un archivo CSV usando +casi cualquier software, incluidos editores de texto sin formato como TextEdit o +NotePad. Los datos en un archivo CSV también se pueden importar fácilmente a otros +formatos y entornos, como SQLite y R. No estamos atados a una +determinada versión de un determinado programa costoso cuando trabajamos con Archivos CSV +, por lo que es un buen formato para trabajar para máxima portabilidad y +resistencia. La mayoría de los programas de hojas de cálculo pueden guardar fácilmente en formatos de texto delimitado +como CSV, aunque pueden darte una advertencia durante +la exportación del archivo. -To save a file you have opened in Excel in CSV format: +Para guardar un archivo que ha abierto en Excel en formato CSV: -1. From the top menu select 'File' and 'Save as'. -2. In the 'Format' field, from the list, select 'Comma Separated - Values' (`*.csv`). -3. Double check the file name and the location where you want to save - it and hit 'Save'. +1. En el menú superior, seleccione "Archivo" y "Guardar como". +2. En el campo 'Formato', de la lista, seleccione 'Valores + separados por comas' (`*.csv`). +3. Verifique dos veces el nombre del archivo y la ubicación donde desea guardarlo + y presione 'Guardar'. -An important note for backwards compatibility: you can open CSV files -in Excel! +Una nota importante para la compatibilidad con versiones anteriores: ¡puedes abrir archivos CSV +en Excel! ```{r, results="markup", fig.cap="Saving an Excel file to CSV.", echo=FALSE, purl=FALSE, out.width="70%", fig.align="center"} knitr::include_graphics("fig/excel-to-csv.png") ``` -**A note on R and `xls`**: There are R packages that can read `xls` -files (as well as Google spreadsheets). It is even possible to access -different worksheets in the `xls` documents. +**Una nota sobre R y `xls`**: Hay paquetes de R que pueden leer archivos `xls` +(así como hojas de cálculo de Google). Incluso es posible acceder a +diferentes hojas de trabajo en los documentos `xls`. -**But** +**Pero** -- some of these only work on Windows. -- this equates to replacing a (simple but manual) export to `csv` with - additional complexity/dependencies in the data analysis R code. -- data formatting best practice still apply. -- Is there really a good reason why `csv` (or similar) is not - adequate? +- algunos de estos sólo funcionan en Windows. +- esto equivale a reemplazar una exportación (simple pero manual) a `csv` con + complejidad/dependencias adicionales en el código R de análisis de datos. +- Las mejores prácticas de formato de datos aún se aplican. +- ¿Existe realmente una buena razón por la cual `csv` (o similar) no es + adecuado? -### Caveats on commas +### Advertencias sobre las comas -In some datasets, the data values themselves may include commas -(,). In that case, the software which you use (including Excel) will -most likely incorrectly display the data in columns. This is because -the commas which are a part of the data values will be interpreted as -delimiters. +En algunos conjuntos de datos, los propios valores de los datos pueden incluir comas +(,). En ese caso, lo más probable es que el software que utilice (incluido Excel) +muestre incorrectamente los datos en columnas. Esto se debe a que +las comas que forman parte de los valores de datos se interpretarán como +delimitadores. -For example, our data might look like this: +Por ejemplo, nuestros datos podrían verse así: ``` species_id,genus,species,taxa @@ -746,79 +746,79 @@ AS,Ammodramus,savannarum,Bird BA,Baiomys,taylori,Rodent ``` -In the record `AH,Ammospermophilus,harrisi,Rodent, not censused` the -value for `taxa` includes a comma (`Rodent, not censused`). If we try -to read the above into Excel (or other spreadsheet program), we will -get something like this: +En el registro `AH,Ammospermophilus,harrisi,Roedor, no censado` el valor +para `taxa` incluye una coma (`Roedor, no censado`). Si intentamos +leer lo anterior en Excel (u otro programa de hoja de cálculo), +obtendremos algo como esto: ```{r, results="markup", fig.cap="The risks of having commas inside comma-separated data.", echo=FALSE, purl=FALSE, out.width="80%", fig.align="center"} knitr::include_graphics("fig/csv-mistake.png") ``` -The value for `taxa` was split into two columns (instead of being put -in one column `D`). This can propagate to a number of further -errors. For example, the extra column will be interpreted as a column -with many missing values (and without a proper header). In addition to -that, the value in column `D` for the record in row 3 (so the one -where the value for 'taxa' contained the comma) is now incorrect. +El valor de `taxa` se dividió en dos columnas (en lugar de colocarse +en una columna `D`). Esto puede propagarse a una serie de errores +adicionales. Por ejemplo, la columna adicional se interpretará como una columna +con muchos valores faltantes (y sin un encabezado adecuado). Además de +eso, el valor en la columna `D` para el registro en la fila 3 (por lo tanto, el +donde el valor de 'taxa' contenía la coma) ahora es incorrecto. -If you want to store your data in `csv` format and expect that your -data values may contain commas, you can avoid the problem discussed -above by putting the values in quotes (""). Applying this rule, our -data might look like this: +Si desea almacenar sus datos en formato `csv` y espera que sus valores de datos +puedan contener comas, puede evitar el problema discutido +anteriormente poniendo los valores entre comillas (""). Aplicando esta regla, nuestros datos +podrían verse así: ``` -species_id,genus,species,taxa +spec_id,género,especie,taxones "AB","Amphispiza","bilineata","Bird" -"AH","Ammospermophilus","harrisi","Rodent, not censused" -"AS","Ammodramus","savannarum","Bird" -"BA","Baiomys","taylori","Rodent" +"AH","Ammospermophilus","harrisi","Roedor, no censado" +"AS","Ammodramus","savannarum","Pájaro" +"BA","Baiomys","taylori","Roedor" ``` -Now opening this file as a `csv` in Excel will not lead to an extra -column, because Excel will only use commas that fall outside of -quotation marks as delimiting characters. - -Alternatively, if you are working with data that contains commas, you -likely will need to use another delimiter when working in a -spreadsheet[^decsep]. In this case, consider using tabs as your delimiter and -working with TSV files. TSV files can be exported from spreadsheet -programs in the same way as CSV files. - -[^decsep]: This is particularly relevant in European - countries where the comma is used as a decimal - separator. In such cases, the default value separator in a - csv file will be the semi-colon (;), or values will be - systematically quoted. - -If you are working with an already existing dataset in which the data -values are not included in "" but which have commas as both delimiters -and parts of data values, you are potentially facing a major problem -with data cleaning. If the dataset you're dealing with contains -hundreds or thousands of records, cleaning them up manually (by either -removing commas from the data values or putting the values into -quotes - "") is not only going to take hours and hours but may -potentially end up with you accidentally introducing many errors. - -Cleaning up datasets is one of the major problems in many scientific -disciplines. The approach almost always depends on the particular -context. However, it is a good practice to clean the data in an -automated fashion, for example by writing and running a script. The -Python and R lessons will give you the basis for developing skills to -build relevant scripts. - -## Summary +Ahora, abrir este archivo como `csv` en Excel no generará una columna +adicional, porque Excel solo usará comas que queden fuera de las comillas +como caracteres delimitadores. + +Alternativamente, si está trabajando con datos que contienen comas, +probablemente necesitará usar otro delimitador cuando trabaje en una hoja de cálculo +[^decsep]. En este caso, considere usar pestañas como delimitador y +trabajar con archivos TSV. Los archivos TSV se pueden exportar desde programas de hoja de cálculo +de la misma manera que los archivos CSV. + +[^decsep]: Esto es particularmente relevante en los países europeos + donde la coma se usa como separador decimal + . En tales casos, el separador de valores predeterminado en un archivo csv + será el punto y coma (;), o los valores se citarán + sistemáticamente. + +Si está trabajando con un conjunto de datos ya existente en el que los valores de datos +no están incluidos en "" pero que tienen comas como delimitadores +y partes de los valores de datos, potencialmente se enfrenta a un problema importante. +con limpieza de datos. Si el conjunto de datos con el que está tratando contiene +cientos o miles de registros, límpielos manualmente (ya sea +eliminando comas de los valores de datos o poniendo los valores entre +comillas - "") no solo llevará horas y horas, sino que +potencialmente terminará introduciendo accidentalmente muchos errores. + +La limpieza de conjuntos de datos es uno de los principales problemas en muchas disciplinas científicas +. El enfoque casi siempre depende del contexto +particular. Sin embargo, es una buena práctica limpiar los datos de forma +automatizada, por ejemplo escribiendo y ejecutando un script. Las +lecciones de Python y R le brindarán la base para desarrollar habilidades para +crear scripts relevantes. + +## Resumen ```{r analysis, results="asis", fig.margin=TRUE, fig.cap="A typical data analysis workflow.", fig.width=7, fig.height=4, echo=FALSE, purl=FALSE} knitr::include_graphics("fig/analysis.png") ``` -A typical data analysis workflow is illustrated in the figure above, -where data is repeatedly transformed, visualised, and modelled. This -iteration is repeated multiple times until the data is understood. In -many real-life cases, however, most time is spent cleaning up and -preparing the data, rather than actually analysing and understanding -it. +En la figura anterior, +, se ilustra un flujo de trabajo de análisis de datos típico, donde los datos se transforman, visualizan y modelan repetidamente. Esta iteración +se repite varias veces hasta que se comprenden los datos. Sin embargo, en +muchos casos de la vida real, la mayor parte del tiempo se dedica a limpiar y +preparar los datos, en lugar de analizarlos y comprenderlos +. An agile data analysis workflow, with several fast iterations of the transform/visualise/model cycle is only feasible if the data is @@ -827,6 +827,6 @@ without having to look at it and/or fix it. :::::::::::::::::::::::::::::::::::::::: keypoints -- Good data organization is the foundation of any research project. +- Una buena organización de los datos es la base de cualquier proyecto de investigación. -:::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::::::::: From 08c203721c2f369e248c2cc5b1cfcc8dd3ace912 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Tue, 14 May 2024 00:29:51 +0900 Subject: [PATCH 169/334] New translations 20-r-rstudio.md (Spanish) --- locale/es/episodes/20-r-rstudio.Rmd | 920 ++++++++++++++-------------- 1 file changed, 460 insertions(+), 460 deletions(-) diff --git a/locale/es/episodes/20-r-rstudio.Rmd b/locale/es/episodes/20-r-rstudio.Rmd index 9edb0bc1e..5edbfded3 100644 --- a/locale/es/episodes/20-r-rstudio.Rmd +++ b/locale/es/episodes/20-r-rstudio.Rmd @@ -1,6 +1,6 @@ --- source: Rmd -title: R and RStudio +title: R y RStudio teaching: 30 exercises: 0 --- @@ -10,329 +10,329 @@ exercises: 0 ::::::::::::::::::::::::::::::::::::::: objetivos -- Describe the purpose of the RStudio Script, Console, Environment, and Plots panes. -- Organise files and directories for a set of analyses as an R project, and understand the purpose of the working directory. -- Use the built-in RStudio help interface to search for more information on R functions. -- Demonstrate how to provide sufficient information for troubleshooting with the R user community. +- Describa el propósito de los paneles RStudio Script, Consola, Entorno y Gráficos. +- Organice archivos y directorios para un conjunto de análisis como un proyecto de R y comprenda el propósito del directorio de trabajo. +- Utilice la interfaz de ayuda integrada de RStudio para buscar más información sobre las funciones de R. +- Demuestre cómo proporcionar suficiente información para la resolución de problemas con la comunidad de usuarios de R. -:::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::::::::: :::::::::::::::::::::::::::::::::::::::: questions -- What are R and RStudio? +- ¿Qué son R y RStudio? -:::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::::::::: -> This episode is based on the Data Carpentries's _Data Analysis and -> Visualisation in R for Ecologists_ lesson. +> Este episodio se basa en la lección _Análisis de datos y +> Visualización en R para ecologistas_ de Data Carpentries. -## What is R? What is RStudio? +## ¿Qué es R? ¿Qué es RStudio? -The term [R](https://www.r-project.org/) is used to refer to the -_programming language_, the _environment for statistical computing_ -and _the software_ that interprets the scripts written using it. +El término [R](https://www.r-project.org/) se utiliza para referirse al +_lenguaje de programación_, el _entorno para la computación estadística_ +y _el software_ que interpreta los scripts escritos con él. -[RStudio](https://rstudio.com) is currently a very popular way to not -only write your R scripts but also to interact with the R -software[^plainr]. To function correctly, RStudio needs R and -therefore both need to be installed on your computer. +[RStudio](https://rstudio.com) es actualmente una forma muy popular de +no sólo escribir tus scripts R sino también de interactuar con el software R +[^plainr]. Para funcionar correctamente, RStudio necesita R y +, por lo que ambos deben estar instalados en su computadora. -[^plainr]: As opposed to using R directly from the command line - console. There exist other software that interface and integrate - with R, but RStudio is particularly well suited for beginners - while providing numerous very advanced features. +[^plainr]: A diferencia de usar R directamente desde la línea de comando + consola. Existe otro software que interactúa e integra + con R, pero RStudio es particularmente adecuado para principiantes + y al mismo tiempo proporciona numerosas funciones muy avanzadas. -The RStudio IDE Cheat -Sheet -provides much more information than will be covered here, but can be -useful to learn keyboard shortcuts and discover new features. +La Hoja de trucos de RStudio IDE -## Why learn R? +proporciona mucha más información de la que se cubrirá aquí. pero puede ser +útil para aprender atajos de teclado y descubrir nuevas funciones. -### R does not involve lots of pointing and clicking, and that's a good thing +## ¿Por qué aprender R? + +### R no implica mucho apuntar y hacer clic, y eso es bueno The learning curve might be steeper than with other software, but with R, the results of your analysis do not rely on remembering a succession of pointing and clicking, but instead on a series of -written commands, and that's a good thing! So, if you want to redo -your analysis because you collected more data, you don't have to -remember which button you clicked in which order to obtain your -results; you just have to run your script again. +written commands, and that's a good thing! Por lo tanto, si desea rehacer +su análisis porque recopiló más datos, no tiene que +recordar en qué botón hizo clic y en qué orden para obtener sus +resultados; solo tienes que ejecutar tu script nuevamente. -Working with scripts makes the steps you used in your analysis clear, -and the code you write can be inspected by someone else who can give -you feedback and spot mistakes. +Trabajar con scripts aclara los pasos que utilizó en su análisis, +y el código que escribe puede ser inspeccionado por otra persona que puede brindarle +comentarios y detectar errores. -Working with scripts forces you to have a deeper understanding of what -you are doing, and facilitates your learning and comprehension of the -methods you use. +Trabajar con scripts te obliga a tener una comprensión más profunda de lo +que estás haciendo y facilita tu aprendizaje y comprensión de los +métodos que utilizas. -### R code is great for reproducibility +### El código R es excelente para la reproducibilidad -Reproducibility means that someone else (including your future self) can -obtain the same results from the same dataset when using the same -analysis code. +Reproducibilidad significa que otra persona (incluido su yo futuro) puede +obtener los mismos resultados del mismo conjunto de datos cuando usa el mismo código de análisis +. -R integrates with other tools to generate manuscripts or reports from your -code. If you collect more data, or fix a mistake in your dataset, the -figures and the statistical tests in your manuscript or report are updated -automatically. +R se integra con otras herramientas para generar manuscritos o informes a partir de su código +. Si recopila más datos o corrige un error en su conjunto de datos, las +cifras y las pruebas estadísticas de su manuscrito o informe se actualizan +automáticamente. -An increasing number of journals and funding agencies expect analyses -to be reproducible, so knowing R will give you an edge with these -requirements. +Un número cada vez mayor de revistas y agencias de financiación esperan que los análisis +sean reproducibles, por lo que conocer R le dará una ventaja con estos requisitos de +. -### R is interdisciplinary and extensible +### R es interdisciplinario y extensible. -With 10000+ packages[^whatarepkgs] that can be installed to extend its -capabilities, R provides a framework that allows you to combine -statistical approaches from many scientific disciplines to best suit -the analytical framework you need to analyse your data. For instance, -R has packages for image analysis, GIS, time series, population -genetics, and a lot more. +Con más de 10000 paquetes[^whatarepkgs] que se pueden instalar para ampliar sus +capacidades, R proporciona un marco que le permite combinar +enfoques estadísticos de muchas disciplinas científicas para adaptarse mejor a +el marco analítico que necesita para analizar sus datos. Por ejemplo, +R tiene paquetes para análisis de imágenes, SIG, series temporales, genética de poblaciones +y mucho más. -[^whatarepkgs]: i.e. add-ons that confer R with new functionality, - such as bioinformatics data analysis. +[^whatarepkgs]: es decir, complementos que confieren a R nuevas funciones, + , como el análisis de datos bioinformáticos. ```{r, fig.cap="Exponential increase of the number of packages available on [CRAN](https://cran.r-project.org/), the Comprehensive R Archive Network. From the R Journal, Volume 10/2, December 2018.", echo=FALSE, message=FALSE} knitr::include_graphics("fig/cran.png") ``` -### R works on data of all shapes and sizes +### R trabaja con datos de todas las formas y tamaños -The skills you learn with R scale easily with the size of your -dataset. Whether your dataset has hundreds or millions of lines, it -won't make much difference to you. +Las habilidades que aprende con R se escalan fácilmente con el tamaño de su conjunto de datos +. Ya sea que su conjunto de datos tenga cientos o millones de líneas, +no hará mucha diferencia para usted. -R is designed for data analysis. It comes with special data structures -and data types that make handling of missing data and statistical -factors convenient. +R está diseñado para el análisis de datos. Viene con estructuras de datos especiales +y tipos de datos que hacen conveniente el manejo de datos faltantes y factores estadísticos +. -R can connect to spreadsheets, databases, and many other data formats, -on your computer or on the web. +R puede conectarse a hojas de cálculo, bases de datos y muchos otros formatos de datos, +en su computadora o en la web. -### R produces high-quality graphics +### R produce gráficos de alta calidad -The plotting functionalities in R are extensive, and allow you to adjust -any aspect of your graph to convey most effectively the message from -your data. +Las funcionalidades de trazado en R son extensas y le permiten ajustar +cualquier aspecto de su gráfico para transmitir de manera más efectiva el mensaje de +sus datos. -### R has a large and welcoming community +### R tiene una comunidad grande y acogedora -Thousands of people use R daily. Many of them are willing to help you -through mailing lists and websites such as Stack -Overflow, or on the RStudio -community. These broad user communities -extend to specialised areas such as bioinformatics. One such subset of the R community is [Bioconductor](https://bioconductor.org/), a scientific project for analysis and comprehension "of data from current and emerging biological assays." This workshop was developed by members of the Bioconductor community; for more information on Bioconductor, please see the companion workshop ["The Bioconductor Project"](https://carpentries-incubator.github.io/bioc-project/). +Miles de personas utilizan R a diario. Muchos de ellos están dispuestos a ayudarte +a través de listas de correo y sitios web como Stack +Overflow, o en RStudio +comunidad. Estas amplias comunidades de usuarios +se extienden a áreas especializadas como la bioinformática. Uno de esos subconjuntos de la comunidad R es [Bioconductor](https://bioconductor.org/), un proyecto científico para el análisis y la comprensión "de datos de ensayos biológicos actuales y emergentes". Este taller fue desarrollado por miembros de la comunidad Bioconductor; Para obtener más información sobre Bioconductor, consulte el taller complementario ["The Bioconductor Project"](https://carpentries-incubator.github.io/bioc-project/). -### Not only is R free, but it is also open-source and cross-platform +### R no sólo es gratuito, sino que también es de código abierto y multiplataforma -Anyone can inspect the source code to see how R works. Because of this -transparency, there is less chance for mistakes, and if you (or -someone else) find some, you can report and fix bugs. +Cualquiera puede inspeccionar el código fuente para ver cómo funciona R. Debido a esta +transparencia, hay menos posibilidades de cometer errores, y si usted (o +otra persona) encuentra algunos, puede informar y corregir errores. -## Knowing your way around RStudio +## Conociendo RStudio -Let's start by learning about [RStudio](https://www.rstudio.com/), -which is an Integrated Development Environment (IDE) for working with +Comencemos aprendiendo sobre [RStudio](https://www.rstudio.com/), +, que es un entorno de desarrollo integrado (IDE) para trabajar con R. -The RStudio IDE open-source product is free under the Affero General -Public License (AGPL) v3. -The RStudio IDE is also available with a commercial license and -priority email support from Posit, Inc. +El producto de código abierto RStudio IDE es gratuito bajo la [Licencia pública Affero General +(AGPL) v3] (https://www.gnu.org/licenses/agpl-3.0.en.html). +El IDE de RStudio también está disponible con una licencia comercial y +soporte prioritario por correo electrónico de Posit, Inc. -We will use the RStudio IDE to write code, navigate the files on our -computer, inspect the variables we are going to create, and visualise -the plots we will generate. RStudio can also be used for other things -(e.g., version control, developing packages, writing Shiny apps) that -we will not cover during the workshop. +Usaremos el IDE de RStudio para escribir código, navegar por los archivos en nuestra +computadora, inspeccionar las variables que vamos a crear y visualizar +los gráficos que generaremos. RStudio también se puede utilizar para otras cosas +(por ejemplo, control de versiones, desarrollo de paquetes, escritura de aplicaciones Shiny) que +no cubriremos durante el taller. ```{r, results="markup", fig.cap="RStudio interface screenshot. Clockwise from top left: Source, Environment/History, Files/Plots/Packages/Help/Viewer, Console.", echo=FALSE, purl=FALSE, out.width="100%", fig.align="center"} knitr::include_graphics("fig/rstudio-screenshot.png") ``` -The RStudio window is divided into 4 "Panes": - -- the **Source** for your scripts and documents (top-left, in the - default layout) -- your **Environment/History** (top-right), -- your **Files/Plots/Packages/Help/Viewer** (bottom-right), and -- the R **Console** (bottom-left). - -The placement of these panes and their content can be customised (see -menu, `Tools -> Global Options -> Pane Layout`). - -One of the advantages of using RStudio is that all the information you -need to write code is available in a single window. Additionally, with -many shortcuts, **autocompletion**, and **highlighting** for the major -file types you use while developing in R, RStudio will make typing -easier and less error-prone. - -## Getting set up - -It is good practice to keep a set of related data, analyses, and text -self-contained in a single folder, called the **working -directory**. All of the scripts within this folder can then use -**relative paths** to files that indicate where inside the project a -file is located (as opposed to absolute paths, which point to where a -file is on a specific computer). Working this way makes it a lot -easier to move your project around on your computer and share it with -others without worrying about whether or not the underlying scripts -will still work. - -RStudio provides a helpful set of tools to do this through its "Projects" -interface, which not only creates a working directory for you, but also remembers -its location (allowing you to quickly navigate to it) and optionally preserves -custom settings and open files to make it easier to resume work after a -break. Go through the steps for creating an "R Project" for this -tutorial below. - -1. Start RStudio. -2. Under the `File` menu, click on `New project`. Choose `New directory`, then - `New project`. -3. Enter a name for this new folder (or "directory"), and choose a - convenient location for it. This will be your **working directory** - for this session (or whole course) (e.g., `bioc-intro`). -4. Click on `Create project`. -5. (Optional) Set Preferences to 'Never' save workspace in RStudio. - -RStudio's default preferences generally work well, but saving a workspace to -.RData can be cumbersome, especially if you are working with larger datasets. -To turn that off, go to Tools --> 'Global Options' and select the 'Never' option -for 'Save workspace to .RData' on exit. +La ventana de RStudio está dividida en 4 "Paneles": + +- la **Fuente** de tus guiones y documentos (arriba a la izquierda, en el diseño predeterminado + ) +- su **Entorno/Historia** (arriba a la derecha), +- sus **Archivos/Tramas/Paquetes/Ayuda/Visor** (abajo a la derecha), y +- la **Consola** R (abajo a la izquierda). + +La ubicación de estos paneles y su contenido se pueden personalizar (consulte el menú +, `Herramientas -> Opciones globales -> Diseño del panel`). + +Una de las ventajas de usar RStudio es que toda la información que +necesitas para escribir código está disponible en una sola ventana. Además, con +muchos atajos, **autocompletado** y **resaltado** para los principales tipos de archivos +que utiliza mientras desarrolla en R, RStudio hará que escribir +sea más fácil. y menos propenso a errores. + +## Preparándose + +Es una buena práctica mantener un conjunto de datos, análisis y texto relacionados +independientes en una sola carpeta, llamada \*\*directorio de trabajo +\*\*. Todos los scripts dentro de esta carpeta pueden usar +**rutas relativas** a archivos que indican dónde dentro del proyecto se encuentra un archivo +(a diferencia de rutas absolutas, que apuntan a donde se encuentra un archivo +). +archivo está en una computadora específica). Trabajar de esta manera hace que sea mucho +más fácil mover su proyecto en su computadora y compartirlo con +otras personas sin preocuparse de si los scripts subyacentes +seguirán funcionando o no. + +RStudio proporciona un útil conjunto de herramientas para hacer esto a través de su interfaz "Proyectos" +, que no solo crea un directorio de trabajo para usted, sino que también recuerda +su ubicación (lo que le permite navegar rápidamente hasta él). ) y, opcionalmente, conserva +configuraciones personalizadas y archivos abiertos para que sea más fácil reanudar el trabajo después de un descanso de +. Siga los pasos para crear un "Proyecto R" para este +tutorial a continuación. + +1. Inicie RStudio. +2. En el menú "Archivo", haga clic en "Nuevo proyecto". Elija `Nuevo directorio`, luego + `Nuevo proyecto`. +3. Ingrese un nombre para esta nueva carpeta (o "directorio") y elija una + ubicación conveniente para ella. Este será su **directorio de trabajo** + para esta sesión (o el curso completo) (por ejemplo, `bioc-intro`). +4. Haga clic en `Crear proyecto`. +5. (Opcional) Establezca las Preferencias en "Nunca" guardar el espacio de trabajo en RStudio. + +Las preferencias predeterminadas de RStudio generalmente funcionan bien, pero guardar un espacio de trabajo en +.RData puede ser engorroso, especialmente si trabaja con conjuntos de datos más grandes. +Para desactivarlo, vaya a Herramientas --> 'Opciones globales' y seleccione la opción 'Nunca' +para 'Guardar espacio de trabajo en .RData' al salir. ```{r, results="markup", fig.cap="Set 'Save workspace to .RData on exit' to 'Never'", echo=FALSE, purl=FALSE, out.width="70%", fig.align="center"} knitr::include_graphics("fig/rstudio-preferences.png") ``` -To avoid character encoding issues between Windows and other operating -systems, we are -going to set UTF-8 by default: +Para evitar problemas de codificación de caracteres entre Windows y otros sistemas operativos +, vamos +a configure UTF-8 de forma predeterminada: ```{r, results="markup", fig.cap="Set the default text encoding to UTF-8 to save us headache in the coming future. (Figure from the link above).", echo=FALSE, purl=FALSE, out.width="70%", fig.align="center"} knitr::include_graphics("fig/utf8.png") ``` -### Organizing your working directory - -Using a consistent folder structure across your projects will help keep things -organised, and will also make it easy to find/file things in the future. This -can be especially helpful when you have multiple projects. In general, you may -create directories (folders) for **scripts**, **data**, and **documents**. - -- **`data/`** Use this folder to store your raw data and intermediate - datasets you may create for the need of a particular analysis. For - the sake of transparency and - [provenance](https://en.wikipedia.org/wiki/Provenance), you should - _always_ keep a copy of your raw data accessible and do as much of - your data cleanup and preprocessing programmatically (i.e., with - scripts, rather than manually) as possible. Separating raw data - from processed data is also a good idea. For example, you could - have files `data/raw/tree_survey.plot1.txt` and `...plot2.txt` kept - separate from a `data/processed/tree.survey.csv` file generated by - the `scripts/01.preprocess.tree_survey.R` script. -- **`documents/`** This would be a place to keep outlines, drafts, - and other text. -- **`scripts/`** (or `src`) This would be the location to keep your R - scripts for different analyses or plotting, and potentially a - separate folder for your functions (more on that later). - -You may want additional directories or subdirectories depending on -your project needs, but these should form the backbone of your working -directory. +### Organizando su directorio de trabajo + +Usar una estructura de carpetas coherente en todos tus proyectos te ayudará a mantener las cosas +organizadas y también facilitará la búsqueda y el archivo de cosas en el futuro. Este +puede resultar especialmente útil cuando tienes varios proyectos. En general, puedes +crear directorios (carpetas) para **scripts**, **datos** y **documentos**. + +- **`data/`** Utilice esta carpeta para almacenar sus datos sin procesar y conjuntos de datos + intermedios que pueda crear para la necesidad de un análisis particular. Por + en aras de la transparencia y + [procedencia](https://en.wikipedia.org/wiki/Provenance), debes + _siempre_ guardar una copia de tu datos sin procesar accesibles y haga la mayor cantidad de + limpieza y preprocesamiento de sus datos mediante programación (es decir, con + scripts, en lugar de manualmente) como sea posible. También es una buena idea separar los datos sin procesar + de los datos procesados. Por ejemplo, podría + tener los archivos `data/raw/tree_survey.plot1.txt` y `...plot2.txt` mantenidos + separados de `data/processed/tree.survey. csv` generado por + el script `scripts/01.preprocess.tree_survey.R`. +- **`documents/`** Este sería un lugar para guardar esquemas, borradores, + y otro texto. +- **`scripts/`** (o `src`) Esta sería la ubicación para guardar tus scripts R + para diferentes análisis o trazados, y potencialmente una carpeta + separada para tus funciones (más sobre eso más adelante). + +Es posible que desee directorios o subdirectorios adicionales dependiendo de +las necesidades de su proyecto, pero estos deberían formar la columna vertebral de su directorio de trabajo +. ```{r, results="markup", fig.cap="Example of a working directory structure.", echo=FALSE, purl=FALSE, out.width="100%", fig.align="center"} -knitr::include_graphics("fig/working-directory-structure.png") +knitr::include_graphics("fig/estructura-directorio-de-trabajo.png") ``` -For this course, we will need a `data/` folder to store our raw data, -and we will use `data_output/` for when we learn how to export data as -CSV files, and `fig_output/` folder for the figures that we will save. +Para este curso, necesitaremos una carpeta `data/` para almacenar nuestros datos sin procesar, +y usaremos `data_output/` cuando aprendamos a exportar datos como +archivos CSV, y Carpeta `fig_output/` para las figuras que guardaremos. ::::::::::::::::::::::::::::::::::::::: challenge -## Challenge: create your project directory structure +## Desafío: crea la estructura de directorios de tu proyecto -Under the `Files` tab on the right of the screen, click on `New Folder` and -create a folder named `data` within your newly created working directory -(e.g., `~/bioc-intro/data`). (Alternatively, type `dir.create("data")` at -your R console.) Repeat these operations to create a `data_output/` and a -`fig_output` folders. +En la pestaña `Archivos` a la derecha de la pantalla, haga clic en `Nueva carpeta` y +cree una carpeta llamada `data` dentro de su directorio de trabajo recién creado +(por ejemplo, `~/bioc -introducción/datos`). (Como alternativa, escriba `dir.create("data")` en +su consola R). Repita estas operaciones para crear una carpeta `data_output/` y +`fig_output`. -:::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::::::::: -We are going to keep the script in the root of our working directory -because we are only going to use one file and it will make things -easier. +Mantendremos el script en la raíz de nuestro directorio de trabajo +porque solo usaremos un archivo y facilitará las cosas +. -Your working directory should now look like this: +Su directorio de trabajo ahora debería verse así: ```{r, results="markup", fig.cap="How it should look like at the beginning of this lesson", echo=FALSE, purl=FALSE, out.width="100%", fig.align="center"} -knitr::include_graphics("fig/r-starting-how-it-should-look-like.png") +knitr::include_graphics("fig/r-empezando-cómo-debería-verse-como.png") ``` -**Project management** is also applicable to bioinformatics projects, -of course[^bioindatascience]. William Noble (@Noble:2009) proposes the -following directory structure: - -[^bioindatascience]: In this course, we consider bioinformatics as - data science applied to biological or bio-medical data. - -> Directory names are in large typeface, and filenames are in smaller -> typeface. Only a subset of the files are shown here. Note that the -> dates are formatted `<year>-<month>-<day>` so that they can be -> sorted in chronological order. The source code `src/ms-analysis.c` -> is compiled to create `bin/ms-analysis` and is documented in -> `doc/ms-analysis.html`. The `README` files in the data directories -> specify who downloaded the data files from what URL on what -> date. The driver script `results/2009-01-15/runall` automatically -> generates the three subdirectories split1, split2, and split3, -> corresponding to three cross-validation splits. The -> `bin/parse-sqt.py` script is called by both of the `runall` driver -> scripts. +**La gestión de proyectos** también es aplicable a proyectos de bioinformática, +por supuesto[^bioindatascience]. William Noble (@Noble:2009) propone la +siguiente estructura de directorios: + +[^bioindatascience]: En este curso, consideramos la bioinformática como + ciencia de datos aplicada a datos biológicos o biomédicos. + +> Los nombres de los directorios están en tipo de letra grande y los nombres de archivos están en tipo de letra +> más pequeño. Aquí solo se muestra un subconjunto de los archivos. Tenga en cuenta que las fechas +> tienen el formato `<year>-<month>-<day>` para que puedan ordenarse +> en orden cronológico. El código fuente `src/ms-analysis.c` +> se compila para crear `bin/ms-analysis` y está documentado en +> `doc/ms-analysis.html`. Los archivos `README` en los directorios de datos +> especifican quién descargó los archivos de datos de qué URL en qué fecha +> . El script del controlador `results/2009-01-15/runall` genera automáticamente +> los tres subdirectorios split1, split2 y split3, +> correspondientes a tres divisiones de validación cruzada. El script +> `bin/parse-sqt.py` es llamado por ambos scripts del controlador `runall` +> . ```{r bioinfoproj, fig.cap="Directory structure for a sample bioinformatics project.", out.width="100%", echo=FALSE} knitr::include_graphics("fig/noble-bioinfo-project.png") ``` -The most important aspect of a well defined and well documented -project directory is to enable someone unfamiliar with the -project[^futureself] to +El aspecto más importante de un directorio de proyecto +bien definido y bien documentado es permitir que alguien que no esté familiarizado con el proyecto +[^futureself] pueda -1. understand what the project is about, what data are available, what - analyses were run, and what results were produced and, most - importantly to +1. comprender de qué se trata el proyecto, qué datos están disponibles, qué + análisis se realizaron y qué resultados se produjeron y, lo más importante -2. repeat the analysis over again - with new data, or changing some - analysis parameters. +2. repita el análisis nuevamente, con nuevos datos o cambiando algunos + parámetros de análisis. -[^futureself]: That someone could be, and very likely will be your - future self, a couple of months or years after the analyses were - run. +[^futureself]: Ese alguien podría ser, y muy probablemente será tu + yo futuro, un par de meses o años después de que se realizaron + los análisis. -### The working directory +### El directorio de trabajo -The working directory is an important concept to understand. It is the -place from where R will be looking for and saving the files. When you -write code for your project, it should refer to files in relation to -the root of your working directory and only need files within this -structure. +El directorio de trabajo es un concepto importante que hay que comprender. Es el +lugar desde donde R buscará y guardará los archivos. Cuando +escribe código para su proyecto, debe hacer referencia a archivos en relación con +la raíz de su directorio de trabajo y solo necesita archivos dentro de esta estructura +. -Using RStudio projects makes this easy and ensures that your working -directory is set properly. If you need to check it, you can use +El uso de proyectos de RStudio hace que esto sea fácil y garantiza que su directorio de trabajo +esté configurado correctamente. Si necesita verificarlo, puede usar `getwd()`. If for some reason your working directory is not what it should be, you can change it in the RStudio interface by navigating in the file browser where your working directory should be, and clicking on the blue gear icon `More`, and select `Set As Working Directory`. -Alternatively you can use `setwd("/path/to/working/directory")` to -reset your working directory. However, your scripts should not include -this line because it will fail on someone else's computer. +Alternativamente, puede usar `setwd("/path/to/working/directory")` para +restablecer su directorio de trabajo. Sin embargo, sus scripts no deben incluir +esta línea porque fallará en la computadora de otra persona. -**Example** +**Ejemplo** -The schema below represents the working directory `bioc-intro` with the -`data` and `fig_output` sub-directories, and 2 files in the latter: +El siguiente esquema representa el directorio de trabajo `bioc-intro` con los subdirectorios +`data` y `fig_output`, y 2 archivos en este último: ``` bioc-intro/data/ @@ -340,155 +340,155 @@ bioc-intro/data/ /fig_output/fig2.png ``` -If we were in the working directory, we could refer to the `fig1.pdf` -file using the relative path `bioc-intro/fig_output/fig1.pdf` or the -absolute path `/home/user/bioc-intro/fig_output/fig1.pdf`. - -If we were in the `data` directory, we would use the relative path -`../fig_output/fig1.pdf` or the same absolute path -`/home/user/bioc-intro/fig_output/fig1.pdf`. - -## Interacting with R - -The basis of programming is that we write down instructions for the -computer to follow, and then we tell the computer to follow those -instructions. We write, or _code_, instructions in R because it is a -common language that both the computer and we can understand. We call -the instructions _commands_ and we tell the computer to follow the -instructions by _executing_ (also called _running_) those commands. - -There are two main ways of interacting with R: by using the -**console** or by using **scripts** (plain text files that contain -your code). The console pane (in RStudio, the bottom left panel) is -the place where commands written in the R language can be typed and -executed immediately by the computer. It is also where the results -will be shown for commands that have been executed. You can type -commands directly into the console and press `Enter` to execute those -commands, but they will be forgotten when you close the session. - -Because we want our code and workflow to be reproducible, it is better -to type the commands we want in the script editor, and save the -script. This way, there is a complete record of what we did, and -anyone (including our future selves!) can easily replicate the -results on their computer. Note, however, that merely typing the commands -in the script does not automatically _run_ them - they still need to -be sent to the console for execution. - -RStudio allows you to execute commands directly from the script editor -by using the `Ctrl` + `Enter` shortcut (on Macs, `Cmd` + `Return` will -work, too). The command on the current line in the script (indicated -by the cursor) or all of the commands in the currently selected text -will be sent to the console and executed when you press `Ctrl` + -`Enter`. You can find other keyboard shortcuts in this RStudio -cheatsheet about the RStudio -IDE. - -At some point in your analysis you may want to check the content of a -variable or the structure of an object, without necessarily keeping a -record of it in your script. You can type these commands and execute -them directly in the console. RStudio provides the `Ctrl` + `1` and -`Ctrl` + `2` shortcuts allow you to jump between the script and the -console panes. - -If R is ready to accept commands, the R console shows a `>` prompt. If +Si estuviéramos en el directorio de trabajo, podríamos referirnos al archivo `fig1.pdf` +usando la ruta relativa `bioc-intro/fig_output/fig1.pdf` o la ruta absoluta +`/ inicio/usuario/bioc-intro/fig_output/fig1.pdf`. + +Si estuviéramos en el directorio `data`, usaríamos la ruta relativa +`../fig_output/fig1.pdf` o la misma ruta absoluta +`/home/user/bioc-intro /fig_output/fig1.pdf`. + +## Interactuando con R + +La base de la programación es que escribimos instrucciones para que las siga la computadora +, y luego le decimos a la computadora que siga esas instrucciones +. Escribimos, o _codificamos_, instrucciones en R porque es un +lenguaje común que tanto la computadora como nosotros podemos entender. Llamamos +a las instrucciones _comandos_ y le decimos a la computadora que siga las instrucciones +_ejecutando_ (también llamado _ejecutando_) esos comandos. + +Hay dos formas principales de interactuar con R: usando la +**consola** o usando **scripts** (archivos de texto sin formato que contienen +su código). El panel de la consola (en RStudio, el panel inferior izquierdo) es +el lugar donde los comandos escritos en lenguaje R se pueden escribir y +ejecutar inmediatamente en la computadora. También es donde se mostrarán los resultados +de los comandos que se han ejecutado. Puede escribir +comandos directamente en la consola y presionar `Enter` para ejecutar esos comandos +, pero se olvidarán cuando cierre la sesión. + +Como queremos que nuestro código y flujo de trabajo sean reproducibles, es mejor +escribir los comandos que queremos en el editor de scripts y guardar el script +. De esta manera, hay un registro completo de lo que hicimos, y +cualquiera (¡incluido nuestro yo futuro!) pueden replicar fácilmente los +resultados en su computadora. Sin embargo, tenga en cuenta que simplemente escribir los comandos +en el script no los _ejecuta_ automáticamente; aún deben enviarse +a la consola para su ejecución. + +RStudio le permite ejecutar comandos directamente desde el editor de scripts +usando el acceso directo `Ctrl` + `Enter` (en Mac, `Cmd` + `Return` también funcionará +). El comando en la línea actual en el script (indicado +por el cursor) o todos los comandos en el texto actualmente seleccionado +se enviarán a la consola y se ejecutarán cuando presione `Ctrl` + +`Entrar`. Puede encontrar otros atajos de teclado en esta [RStudio +hoja de referencia sobre RStudio +IDE] (https://raw.githubusercontent.com/rstudio/cheatsheets/main/rstudio-ide.pdf). + +En algún momento de su análisis, es posible que desee verificar el contenido de una variable +o la estructura de un objeto, sin necesariamente mantener un registro +del mismo en su secuencia de comandos. Puede escribir estos comandos y ejecutarlos +directamente en la consola. RStudio proporciona los atajos `Ctrl` + `1` y +`Ctrl` + `2` que le permiten saltar entre el script y los paneles de la consola +. + +Si R está listo para aceptar comandos, la consola de R muestra un mensaje `>`. If it receives a command (by typing, copy-pasting or sending from the script editor using `Ctrl` + `Enter`), R will try to execute it, and when ready, will show the results and come back with a new `>` prompt to wait for new commands. -If R is still waiting for you to enter more data because it isn't -complete yet, the console will show a `+` prompt. It means that you -haven't finished entering a complete command. This is because you have +Si R todavía está esperando que ingrese más datos porque aún no está +completo, la consola mostrará un mensaje `+`. Significa que +no has terminado de ingresar un comando completo. This is because you have not 'closed' a parenthesis or quotation, i.e. you don't have the same number of left-parentheses as right-parentheses, or the same number of -opening and closing quotation marks. When this happens, and you -thought you finished typing your command, click inside the console -window and press `Esc`; this will cancel the incomplete command and -return you to the `>` prompt. - -## How to learn more during and after the course? - -The material we cover during this course will give you an initial -taste of how you can use R to analyse data for your own -research. However, you will need to learn more to do advanced -operations such as cleaning your dataset, using statistical methods, -or creating beautiful graphics[^inthiscoure]. The best way to become -proficient and efficient at R, as with any other tool, is to use it to -address your actual research questions. As a beginner, it can feel -daunting to have to write a script from scratch, and given that many -people make their code available online, modifying existing code to -suit your purpose might make it easier for you to get started. - -[^inthiscoure]: We will introduce most of these (except statistics) - here, but will only manage to scratch the surface of the wealth of - what is possible to do with R. +opening and closing quotation marks. Cuando esto suceda, y usted +pensó que había terminado de escribir su comando, haga clic dentro de la ventana de la consola +y presione `Esc`; esto cancelará el comando incompleto y +lo devolverá al mensaje `>`. + +## ¿Cómo aprender más durante y después del curso? + +El material que cubrimos durante este curso le dará una +muestra inicial de cómo puede usar R para analizar datos para su propia +investigación. Sin embargo, necesitarás aprender más para realizar +operaciones avanzadas, como limpiar tu conjunto de datos, usar métodos estadísticos, +o crear hermosos gráficos[^inthiscoure]. La mejor manera de volverse +competente y eficiente en R, como con cualquier otra herramienta, es utilizarlo para +abordar sus preguntas de investigación reales. Como principiante, puede resultar +desalentador tener que escribir un script desde cero y, dado que muchas +personas ponen su código a disposición en línea, modifican el código existente para +que se adapte a su propósito. podría facilitarle el comienzo. + +[^inthiscoure]: Introduciremos la mayoría de estos (excepto las estadísticas) + aquí, pero solo lograremos arañar la superficie de la riqueza de + lo que es posible hacer con R. ```{r kitten, results="markup", echo=FALSE, purl=FALSE, out.width="400px", fig.align="center"} -knitr::include_graphics("fig/kitten-try-things.jpg") +knitr::include_graphics("fig/gatito-prueba-cosas.jpg") ``` -## Seeking help +## Buscando ayuda -### Use the built-in RStudio help interface to search for more information on R functions +### Utilice la interfaz de ayuda integrada de RStudio para buscar más información sobre las funciones de R ```{r rstudiohelp, fig.cap="RStudio help interface.", results="markup", echo=FALSE, purl=FALSE, out.width="70%", fig.align="center"} knitr::include_graphics("fig/rstudiohelp.png") ``` -One of the fastest ways to get help, is to use the RStudio help -interface. This panel by default can be found at the lower right hand -panel of RStudio. As seen in the screenshot, by typing the word -"Mean", RStudio tries to also give a number of suggestions that you -might be interested in. The description is then shown in the display -window. +Una de las formas más rápidas de obtener ayuda es utilizar la interfaz de ayuda +de RStudio. Este panel por defecto se puede encontrar en el panel inferior derecho +de RStudio. Como se ve en la captura de pantalla, al escribir la palabra +"Mean", RStudio intenta dar también una serie de sugerencias que podrían interesarle a +. La descripción se muestra luego en la ventana de visualización +. -### I know the name of the function I want to use, but I'm not sure how to use it +### Sé el nombre de la función que quiero usar, pero no estoy seguro de cómo usarla -If you need help with a specific function, let's say `barplot()`, you -can type: +Si necesita ayuda con una función específica, digamos `barplot()`, +puede escribir: ```{r, eval=FALSE, purl=TRUE} -?barplot +diagrama de barras ``` -If you just need to remind yourself of the names of the arguments, you can use: +Si sólo necesita recordar los nombres de los argumentos, puede utilizar: ```{r, eval=FALSE, purl=TRUE} -args(lm) +argumentos (lm) ``` -### I want to use a function that does X, there must be a function for it but I don't know which one... +### Quiero usar una función que haga X, debe haber una función para ello pero no sé cuál... -If you are looking for a function to do a particular task, you can use the -`help.search()` function, which is called by the double question mark `??`. -However, this only looks through the installed packages for help pages with a -match to your search request +Si está buscando una función para realizar una tarea en particular, puede usar la función +`help.search()`, que se llama mediante el doble signo de interrogación `??`. +Sin embargo, esto solo busca en los paquetes instalados páginas de ayuda que coincidan +con su solicitud de búsqueda. ```{r, eval=FALSE, purl=TRUE} ??kruskal ``` -If you can't find what you are looking for, you can use -the [rdocumentation.org](https://www.rdocumentation.org) website that searches -through the help files across all packages available. +Si no puede encontrar lo que busca, puede utilizar +el sitio web [rdocumentation.org](https://www.rdocumentation.org) que busca +a través de los archivos de ayuda en todos los paquetes disponibles. -Finally, a generic Google or internet search "R \<task>" will often either send -you to the appropriate package documentation or a helpful forum where someone -else has already asked your question. +Finalmente, una búsqueda genérica en Google o en Internet "R \<task>" a menudo lo enviará +a la documentación del paquete correspondiente o a un foro útil donde alguien +más ya haya hecho su pregunta. -### I am stuck... I get an error message that I don't understand +### Estoy atascado... Me sale un mensaje de error que no entiendo -Start by googling the error message. However, this doesn't always work very well -because often, package developers rely on the error catching provided by R. You -end up with general error messages that might not be very helpful to diagnose a -problem (e.g. "subscript out of bounds"). If the message is very generic, you -might also include the name of the function or package you're using in your -query. +Comience buscando en Google el mensaje de error. Sin embargo, esto no siempre funciona muy bien +porque a menudo, los desarrolladores de paquetes confían en la detección de errores proporcionada por R. Usted +termina con mensajes de error generales que pueden no ser muy útiles para diagnosticar un +problema (por ejemplo, "subíndice fuera de límites"). Si el mensaje es muy genérico, +también podrías incluir el nombre de la función o paquete que estás usando en tu consulta +. -However, you should check Stack Overflow. Search using the `[r]` tag. Most -questions have already been answered, but the challenge is to use the right -words in the search to find the -answers: +Sin embargo, deberías comprobar Stack Overflow. Busque usando la etiqueta `[r]`. La mayoría de las +preguntas ya han sido respondidas, pero el desafío es usar las +palabras correctas en la búsqueda para encontrar las +respuestas: [http://stackoverflow.com/questions/tagged/r](https://stackoverflow.com/questions/tagged/r) @@ -496,173 +496,173 @@ The [Introduction to R](https://cran.r-project.org/doc/manuals/R-intro.pdf) can also be dense for people with little programming experience but it is a good place to understand the underpinnings of the R language. -The [R FAQ](https://cran.r-project.org/doc/FAQ/R-FAQ.html) is dense and technical -but it is full of useful information. - -### Asking for help - -The key to receiving help from someone is for them to rapidly grasp -your problem. You should make it as easy as possible to pinpoint where -the issue might be. - -Try to use the correct words to describe your problem. For instance, a -package is not the same thing as a library. Most people will -understand what you meant, but others have really strong feelings -about the difference in meaning. The key point is that it can make -things confusing for people trying to help you. Be as precise as -possible when describing your problem. - -If possible, try to reduce what doesn't work to a simple _reproducible -example_. If you can reproduce the problem using a very small data -frame instead of your 50000 rows and 10000 columns one, provide the -small one with the description of your problem. When appropriate, try -to generalise what you are doing so even people who are not in your -field can understand the question. For instance instead of using a -subset of your real dataset, create a small (3 columns, 5 rows) -generic one. For more information on how to write a reproducible -example see this article by Hadley +Las [Preguntas frecuentes de R](https://cran.r-project.org/doc/FAQ/R-FAQ.html) son densas y técnicas +pero están llenas de información útil. + +### Pidiendo ayuda + +La clave para recibir ayuda de alguien es que comprenda rápidamente +tu problema. Deberías hacer que sea lo más fácil posible identificar dónde +podría estar el problema. + +Intente utilizar las palabras correctas para describir su problema. Por ejemplo, un paquete +no es lo mismo que una biblioteca. La mayoría de las personas +entenderán lo que quisiste decir, pero otros tienen sentimientos muy fuertes +sobre la diferencia de significado. El punto clave es que puede hacer que +las cosas sean confusas para las personas que intentan ayudarte. Sea lo más preciso +posible al describir su problema. + +Si es posible, intente reducir lo que no funciona a un simple ejemplo \*reproducible +\*. Si puede reproducir el problema usando un marco de datos muy pequeño +en lugar del de 50000 filas y 10000 columnas, proporcione el +pequeño con la descripción de su problema. Cuando sea apropiado, intente +para generalizar lo que está haciendo, de modo que incluso las personas que no están en su campo +puedan entender la pregunta. Por ejemplo, en lugar de utilizar un subconjunto +de su conjunto de datos real, cree uno pequeño (3 columnas, 5 filas) +genérico. Para obtener más información sobre cómo escribir un ejemplo +reproducible, consulte este artículo de Hadley Wickham. -To share an object with someone else, if it's relatively small, you -can use the function `dput()`. It will output R code that can be used -to recreate the exact same object as the one in memory: +Para compartir un objeto con otra persona, si es relativamente pequeño, +puedes usar la función `dput()`. Generará código R que se puede usar +para recrear exactamente el mismo objeto que el que está en la memoria: ```{r, results="show", purl=TRUE} -## iris is an example data frame that comes with R and head() is a -## function that returns the first part of the data frame +## iris es un marco de datos de ejemplo que viene con R y head() es una +## función que devuelve la primera parte del marco de datos dput(head(iris)) ``` If the object is larger, provide either the raw file (i.e., your CSV file) with your script up to the point of the error (and after removing everything that is not relevant to your -issue). Alternatively, in particular if your question is not related -to a data frame, you can save any R object to a file[^export]: +issue). Alternativamente, en particular si su pregunta no está relacionada +con un marco de datos, puede guardar cualquier objeto R en un archivo[^export]: ```{r, eval=FALSE, purl=FALSE} -saveRDS(iris, file="/tmp/iris.rds") +guardarRDS(iris, archivo="/tmp/iris.rds") ``` -The content of this file is however not human readable and cannot be -posted directly on Stack Overflow. Instead, it can be sent to someone -by email who can read it with the `readRDS()` command (here it is -assumed that the downloaded file is in a `Downloads` folder in the -user's home directory): +Sin embargo, el contenido de este archivo no es legible por humanos y no se puede +publicar directamente en Stack Overflow. En su lugar, se puede enviar a alguien +por correo electrónico que pueda leerlo con el comando `readRDS()` (aquí +se supone que el archivo descargado está en una carpeta `Descargas` en el +directorio de inicio del usuario): ```{r, eval=FALSE, purl=FALSE} -some_data <- readRDS(file="~/Downloads/iris.rds") +algunos_datos <- readRDS(file="~/Downloads/iris.rds") ``` -Last, but certainly not least, **always include the output of `sessionInfo()`** -as it provides critical information about your platform, the versions of R and -the packages that you are using, and other information that can be very helpful -to understand your problem. +Por último, pero no menos importante, **siempre incluya la salida de `sessionInfo()`** +ya que proporciona información crítica sobre su plataforma, las versiones de R y +los paquetes que está usando y otra información que puede ser muy útil +para comprender su problema. ```{r, results="show", purl=TRUE} -sessionInfo() +información de sesión() ``` -### Where to ask for help? - -- The person sitting next to you during the course. Don't hesitate to - talk to your neighbour during the workshop, compare your answers, - and ask for help. -- Your friendly colleagues: if you know someone with more experience - than you, they might be able and willing to help you. -- [Stack Overflow](https://stackoverflow.com/questions/tagged/r): if - your question hasn't been answered before and is well crafted, - chances are you will get an answer in less than 5 min. Remember to - follow their guidelines on how to ask a good - question. -- The R-help mailing - list: it is read by a - lot of people (including most of the R core team), a lot of people - post to it, but the tone can be pretty dry, and it is not always - very welcoming to new users. If your question is valid, you are - likely to get an answer very fast but don't expect that it will come - with smiley faces. Also, here more than anywhere else, be sure to - use correct vocabulary (otherwise you might get an answer pointing - to the misuse of your words rather than answering your - question). You will also have more success if your question is about - a base function rather than a specific package. -- If your question is about a specific package, see if there is a - mailing list for it. Usually it's included in the DESCRIPTION file - of the package that can be accessed using - `packageDescription("name-of-package")`. You may also want to try to - email the author of the package directly, or open an issue on the - code repository (e.g., GitHub). -- There are also some topic-specific mailing lists (GIS, - phylogenetics, etc...), the complete list is - [here](https://www.r-project.org/mail.html). - -### More resources - -- The [Posting Guide](https://www.r-project.org/posting-guide.html) for - the R mailing lists. - -- How to ask for R - help - useful guidelines. - -- This blog post by Jon +### ¿Dónde pedir ayuda? + +- La persona sentada a tu lado durante el curso. No dudes en + hablar con tu vecino durante el taller, comparar tus respuestas, + y pedir ayuda. +- Tus colegas amigables: si conoces a alguien con más experiencia + que tú, es posible que pueda y esté dispuesto a ayudarte. +- [Desbordamiento de pila](https://stackoverflow.com/questions/tagged/r): si + tu pregunta no ha sido respondida antes y está bien redactada, + es probable que obtengas una respuesta en menos de 5 min. Recuerde + seguir sus pautas sobre cómo hacer una buena + pregunta. +- La lista de correo de R-help + : es leída por + mucha gente (incluida la mayoría de el equipo central de R), mucha gente + publica en él, pero el tono puede ser bastante seco y no siempre + es muy acogedor para los nuevos usuarios. Si su pregunta es válida, es + probable que obtenga una respuesta muy rápido, pero no espere que llegue + con caras sonrientes. Además, aquí más que en cualquier otro lugar, asegúrese de + usar el vocabulario correcto (de lo contrario, podría obtener una respuesta que señale + el mal uso de sus palabras en lugar de responder su pregunta + ). También tendrá más éxito si su pregunta es sobre + una función base en lugar de un paquete específico. +- Si su pregunta es sobre un paquete específico, vea si hay una lista de correo + para él. Por lo general, se incluye en el archivo DESCRIPCIÓN + del paquete al que se puede acceder usando + `packageDescription("nombre-del-paquete")`. También puedes intentar enviar + un correo electrónico directamente al autor del paquete o abrir una incidencia en el repositorio de código + (por ejemplo, GitHub). +- También hay algunas listas de correo de temas específicos (SIG, + filogenética, etc...), la lista completa es + [aquí](https://www.r-project.org/ correo.html). + +### Más recursos + +- La [Guía de publicación](https://www.r-project.org/posting-guide.html) para + las listas de correo de R. + +- Cómo solicitar ayuda de R + + pautas útiles. + +- Esta publicación de blog de Jon Skeet - has quite comprehensive advice on how to ask programming questions. + tiene consejos bastante completos sobre cómo para hacer preguntas sobre programación. -- The [reprex](https://cran.rstudio.com/web/packages/reprex/) package - is very helpful to create reproducible examples when asking for - help. The rOpenSci community call "How to ask questions so they get +- El paquete [reprex](https://cran.rstudio.com/web/packages/reprex/) + es muy útil para crear ejemplos reproducibles cuando se solicita ayuda + . The rOpenSci community call "How to ask questions so they get answered" (Github link and video recording) includes a presentation of the reprex package and of its philosophy. -## R packages +## paquetes R -### Loading packages +### Cargando paquetes -As we have seen above, R packages play a fundamental role in R. The -make use of a package's functionality, assuming it is installed, we -first need to load it to be able to use it. This is done with the -`library()` function. Below, we load `ggplot2`. +Como hemos visto anteriormente, los paquetes de R juegan un papel fundamental en R. Los +hacen uso de la funcionalidad de un paquete, suponiendo que esté instalado, +primero debemos cargarlo para poder usarlo. . Esto se hace con la función +`library()`. A continuación, cargamos `ggplot2`. ```{r loadp, eval=FALSE, purl=TRUE} -library("ggplot2") +biblioteca("ggplot2") ``` -### Installing packages +### Instalación de paquetes -The default package repository is The _Comprehensive R Archive -Network_ (CRAN), and any package that is available on CRAN can be -installed with the `install.packages()` function. Below, for example, -we install the `dplyr` package that we will learn about later. +El repositorio de paquetes predeterminado es The _Comprehensive R Archive +Network_ (CRAN), y cualquier paquete que esté disponible en CRAN se puede +instalar con la función `install.packages()`. Debajo, por ejemplo, +instalamos el paquete `dplyr` del que aprenderemos más adelante. ```{r craninstall, eval=FALSE, purl=TRUE} -install.packages("dplyr") +instalar.paquetes("dplyr") ``` -This command will install the `dplyr` package as well as all its -dependencies, i.e. all the packages that it relies on to function. +Este comando instalará el paquete `dplyr` así como todas sus +dependencias, es decir, todos los paquetes de los que depende para funcionar. -Another major R package repository is maintained by Bioconductor. [Bioconductor packages](https://bioconductor.org/packages/release/BiocViews.html#___Software) are managed and installed using a dedicated package, -namely `BiocManager`, that can be installed from CRAN with +Bioconductor mantiene otro importante repositorio de paquetes R. Los [paquetes de bioconductores](https://bioconductor.org/packages/release/BiocViews.html#___Software) se administran e instalan mediante un paquete dedicado, +, concretamente `BiocManager`, que se puede instalar desde CRAN con ```{r, eval=FALSE, purl=TRUE} -install.packages("BiocManager") +instalar.paquetes("BiocManager") ``` -Individual packages such as `SummarizedExperiment` (we will use it -later), `DESeq2` (for RNA-Seq analysis), and any others from either Bioconductor or CRAN can then be -installed with `BiocManager::install`. +Paquetes individuales como `SummarizedExperiment` (lo usaremos +más adelante), `DESeq2` (para análisis RNA-Seq) y cualquier otro de Bioconductor o CRAN se pueden instalar +con ` BiocManager::instalar`. ```{r, eval=FALSE, purl=TRUE} -BiocManager::install("SummarizedExperiment") +BiocManager::install("Experimento resumido") BiocManager::install("DESeq2") ``` -By default, `BiocManager::install()` will also check all your installed packages and see if there are newer versions available. If there are, it will show them to you and ask you if you want to `Update all/some/none? [a/s/n]:` and then wait for your answer. While you should strive to have the most up-to-date package versions, in practice we recommend only updating packages in a fresh R session before any packages are loaded. +De forma predeterminada, `BiocManager::install()` también verificará todos los paquetes instalados y verá si hay versiones más nuevas disponibles. Si los hay, se los mostrará y le preguntará si desea "Actualizar todo/algo/ninguno". [a/s/n]:\` y luego espera tu respuesta. Si bien debe esforzarse por tener las versiones de paquetes más actualizadas, en la práctica recomendamos actualizar los paquetes solo en una nueva sesión de R antes de cargar cualquier paquete. :::::::::::::::::::::::::::::::::::::::: keypoints -- Start using R and RStudio +- Comience a usar R y RStudio -:::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::::::::: From 3e4b9b4fddf48dcd448f14157be1681b25c20423 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Tue, 14 May 2024 00:29:59 +0900 Subject: [PATCH 170/334] New translations 23-starting-with-r.md (Spanish) --- locale/es/episodes/23-starting-with-r.Rmd | 980 +++++++++++----------- 1 file changed, 490 insertions(+), 490 deletions(-) diff --git a/locale/es/episodes/23-starting-with-r.Rmd b/locale/es/episodes/23-starting-with-r.Rmd index 49ba99f09..267cbe47a 100644 --- a/locale/es/episodes/23-starting-with-r.Rmd +++ b/locale/es/episodes/23-starting-with-r.Rmd @@ -1,6 +1,6 @@ --- source: Rmd -title: Introduction to R +title: Introducción a R teaching: 60 exercises: 60 --- @@ -10,374 +10,374 @@ exercises: 60 ::::::::::::::::::::::::::::::::::::::: objetivos -- Define the following terms as they relate to R: object, assign, call, function, arguments, options. -- Assign values to objects in R. -- Learn how to _name_ objects -- Use comments to inform script. -- Solve simple arithmetic operations in R. -- Call functions and use arguments to change their default options. -- Inspect the content of vectors and manipulate their content. -- Subset and extract values from vectors. -- Analyze vectors with missing data. +- Defina los siguientes términos en relación con R: objeto, asignación, llamada, función, argumentos, opciones. +- Asignar valores a objetos en R. +- Aprende a _nombrar_ objetos +- Utilice comentarios para informar el guión. +- Resolver operaciones aritméticas simples en R. +- Llame a funciones y use argumentos para cambiar sus opciones predeterminadas. +- Inspeccionar el contenido de los vectores y manipular su contenido. +- Subconjunto y extracción de valores de vectores. +- Analizar vectores con datos faltantes. -:::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::: questions +:::::::::::::::::::::::::::::::::::::::: preguntas -- First commands in R +- Primeros comandos en R -:::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::::::::: -> This episode is based on the Data Carpentries's _Data Analysis and -> Visualisation in R for Ecologists_ lesson. +> Este episodio se basa en la lección _Análisis de datos y +> Visualización en R para ecologistas_ de Data Carpentries. -## Creating objects in R +## Creando objetos en R -You can get output from R simply by typing math in the console: +Puede obtener resultados de R simplemente escribiendo matemáticas en la consola: ```{r, purl=TRUE} 3 + 5 -12 / 7 +12/7 ``` -However, to do useful and interesting things, we need to assign _values_ to -_objects_. To create an object, we need to give it a name followed by the -assignment operator `<-`, and the value we want to give it: +Sin embargo, para hacer cosas útiles e interesantes, necesitamos asignar _valores_ a +_objetos_. Para crear un objeto, debemos darle un nombre seguido del operador de asignación +`<-` y el valor que queremos darle: ```{r, purl=TRUE} -weight_kg <- 55 +peso_kg <- 55 ``` -`<-` is the assignment operator. It assigns values on the right to -objects on the left. So, after executing `x <- 3`, the value of `x` is -`3`. The arrow can be read as 3 **goes into** `x`. For historical -reasons, you can also use `=` for assignments, but not in every -context. Because of the -[slight differences](https://blog.revolutionanalytics.com/2008/12/use-equals-or-arrow-for-assignment.html) -in syntax, it is good practice to always use `<-` for assignments. +`<-` es el operador de asignación. Asigna valores a la derecha a +objetos a la izquierda. Entonces, después de ejecutar `x <- 3`, el valor de `x` es +`3`. La flecha se puede leer como 3 **entra** `x`. Por razones históricas +, también puedes usar `=` para tareas, pero no en todos los contextos +. Debido a las +[ligeras diferencias](https://blog.revolutionanalytics.com/2008/12/use-equals-or-arrow-for-assignment.html) +en la sintaxis, Es una buena práctica utilizar siempre `<-` para las tareas. In RStudio, typing <kbd>Alt</kbd> + <kbd>\-</kbd> (push <kbd>Alt</kbd> at the same time as the <kbd>\-</kbd> key) will write `<-` in a single keystroke in a PC, while typing <kbd>Option</kbd> + <kbd>\-</kbd> (push <kbd>Option</kbd> at the same time as the <kbd>\-</kbd> key) does the same in a Mac. -### Naming variables - -Objects can be given any name such as `x`, `current_temperature`, or -`subject_id`. You want your object names to be explicit and not too -long. They cannot start with a number (`2x` is not valid, but `x2` -is). R is case sensitive (e.g., `weight_kg` is different from -`Weight_kg`). There are some names that cannot be used because they -are the names of fundamental functions in R (e.g., `if`, `else`, -`for`, see -[here](https://stat.ethz.ch/R-manual/R-devel/library/base/html/Reserved.html) -for a complete list). In general, even if it's allowed, it's best to -not use other function names (e.g., `c`, `T`, `mean`, `data`, `df`, -`weights`). If in doubt, check the help to see if the name is already -in use. It's also best to avoid dots (`.`) within an object name as in +### Nombrar variables + +A los objetos se les puede dar cualquier nombre, como `x`, `current_temperature` o +`subject_id`. Quieres que los nombres de tus objetos sean explícitos y no demasiado +largos. No pueden comenzar con un número (`2x` no es válido, pero `x2` +sí lo es). R distingue entre mayúsculas y minúsculas (por ejemplo, `weight_kg` es diferente de +`Weight_kg`). Hay algunos nombres que no se pueden usar porque +son los nombres de funciones fundamentales en R (por ejemplo, `if`, `else`, +`for`, consulte +[ aquí](https://stat.ethz.ch/R-manual/R-devel/library/base/html/Reserved.html) +para obtener una lista completa). En general, incluso si está permitido, es mejor +no usar otros nombres de funciones (por ejemplo, `c`, `T`, `mean`, `data`, `df`, +` pesos`). En caso de duda, consulte la ayuda para ver si el nombre ya está +en uso. También es mejor evitar los puntos (`.`) dentro del nombre de un objeto como en `my.dataset`. There are many functions in R with dots in their names for historical reasons, but because dots have a special meaning in R (for methods) and other programming languages, it's best to avoid -them. It is also recommended to use nouns for object names, and verbs -for function names. It's important to be consistent in the styling of -your code (where you put spaces, how you name objects, etc.). Using a -consistent coding style makes your code clearer to read for your -future self and your collaborators. In R, some popular style guides -are [Google's](https://google.github.io/styleguide/Rguide.xml), the -[tidyverse's](https://style.tidyverse.org/) style and the Bioconductor +them. También se recomienda utilizar sustantivos para nombres de objetos y verbos +para nombres de funciones. Es importante ser coherente en el estilo de +tu código (dónde colocas los espacios, cómo nombras los objetos, etc.). El uso de un estilo de codificación +consistente hace que su código sea más claro de leer para su +futuro yo y sus colaboradores. En R, algunas guías de estilo populares +son las de [Google](https://google.github.io/styleguide/Rguide.xml), las +[de tidyverse](https://style. tidyverse.org/) y la guía Bioconductor style -guide. The -tidyverse's is very comprehensive and may seem overwhelming at -first. You can install the -[**`lintr`**](https://github.com/jimhester/lintr) package to -automatically check for issues in the styling of your code. +. El +tidyverse es muy completo y puede parecer abrumador al principio +. Puede instalar el paquete +[**`lintr`**](https://github.com/jimhester/lintr) para +verificar automáticamente si hay problemas en el estilo de su código. -> **Objects vs. variables**: What are known as `objects` in `R` are -> known as `variables` in many other programming languages. Depending -> on the context, `object` and `variable` can have drastically -> different meanings. However, in this lesson, the two words are used -> synonymously. For more information -> [see here.](https://cran.r-project.org/doc/manuals/r-release/R-lang.html#Objects) +> **Objetos versus variables**: Lo que se conoce como `objetos` en `R` se +> se conoce como `variables` en muchos otros lenguajes de programación. Dependiendo +> del contexto, `objeto` y `variable` pueden tener significados drásticamente +> diferentes. Sin embargo, en esta lección, las dos palabras se usan +> como sinónimos. Para obtener más información +> [ver aquí.](https://cran.r-project.org/doc/manuals/r-release/R-lang.html#Objects) -When assigning a value to an object, R does not print anything. You -can force R to print the value by using parentheses or by typing the -object name: +Al asignar un valor a un objeto, R no imprime nada. Usted +puede forzar a R a imprimir el valor usando paréntesis o escribiendo el nombre del objeto +: ```{r, purl=TRUE} -weight_kg <- 55 # doesn't print anything -(weight_kg <- 55) # but putting parenthesis around the call prints the value of `weight_kg` -weight_kg # and so does typing the name of the object +peso_kg <- 55 # no imprime nada +(peso_kg <- 55) # pero al poner paréntesis alrededor de la llamada imprime el valor de `weight_kg` +peso_kg # y también lo hace al escribir el nombre del objeto ``` -Now that R has `weight_kg` in memory, we can do arithmetic with it. For -instance, we may want to convert this weight into pounds (weight in pounds is 2.2 times the weight in kg): +Ahora que R tiene `weight_kg` en la memoria, podemos hacer aritmética con él. Por ejemplo +, es posible que deseemos convertir este peso en libras (el peso en libras es 2,2 veces el peso en kg): ```{r, purl=TRUE} -2.2 * weight_kg +2.2 * peso_kg ``` -We can also change an object's value by assigning it a new one: +También podemos cambiar el valor de un objeto asignándole uno nuevo: ```{r, purl=TRUE} -weight_kg <- 57.5 -2.2 * weight_kg +peso_kg <- 57,5 +2,2 * peso_kg ``` -This means that assigning a value to one object does not change the values of -other objects For example, let's store the animal's weight in pounds in a new -object, `weight_lb`: +Esto significa que asignar un valor a un objeto no cambia los valores de +otros objetos. Por ejemplo, almacenemos el peso del animal en libras en un nuevo objeto +, `weight_lb`: ```{r, purl=TRUE} -weight_lb <- 2.2 * weight_kg +peso_lb <- 2,2 * peso_kg ``` -and then change `weight_kg` to 100. +y luego cambie `weight_kg` a 100. ```{r} -weight_kg <- 100 +peso_kg <- 100 ``` -::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::: desafío -## Challenge: +## Desafío: -What do you think is the current content of the object `weight_lb`? -126\.5 or 220? +¿Cuál crees que es el contenido actual del objeto `weight_lb`? +¿126\.5 o 220? -:::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::::::::: -## Comments +## Comentarios -The comment character in R is `#`, anything to the right of a `#` in a -script will be ignored by R. It is useful to leave notes, and -explanations in your scripts. +El carácter de comentario en R es `#`, cualquier cosa a la derecha de `#` en un script +será ignorada por R. Es útil dejar notas y +explicaciones en sus scripts. . -RStudio makes it easy to comment or uncomment a paragraph: after -selecting the lines you want to comment, press at the same time on -your keyboard <kbd>Ctrl</kbd> + <kbd>Shift</kbd> + <kbd>C</kbd>. If +RStudio hace que sea fácil comentar o descomentar un párrafo: después de +seleccionar las líneas que desea comentar, presione al mismo tiempo en +su teclado <kbd>Ctrl</kbd> + <kbd>Mayús</kbd> + <kbd>C</kbd>. If you only want to comment out one line, you can put the cursor at any location of that line (i.e. no need to select the whole line), then press <kbd>Ctrl</kbd> + <kbd>Shift</kbd> + <kbd>C</kbd>. -::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::: desafío -## Challenge +## Desafío -What are the values after each statement in the following? +¿Cuáles son los valores después de cada declaración en lo siguiente? ```{r, purl=TRUE} -mass <- 47.5 # mass? -age <- 122 # age? -mass <- mass * 2.0 # mass? -age <- age - 20 # age? -mass_index <- mass/age # mass_index? +masa <- 47,5 # masa? +edad <- 122 # edad? +masa <- masa * 2.0 # masa? +edad <- edad - 20 # edad? +índice_masa <- masa/edad # índice_masa? ``` -:::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::::::::: -## Functions and their arguments +## Funciones y sus argumentos. -Functions are "canned scripts" that automate more complicated sets of commands -including operations assignments, etc. Many functions are predefined, or can be -made available by importing R _packages_ (more on that later). A function -usually gets one or more inputs called _arguments_. Functions often (but not -always) return a _value_. A typical example would be the function `sqrt()`. The -input (the argument) must be a number, and the return value (in fact, the -output) is the square root of that number. Executing a function ('running it') -is called _calling_ the function. An example of a function call is: +Las funciones son "scripts predefinidos" que automatizan conjuntos de comandos más complicados +incluyendo asignaciones de operaciones, etc. Muchas funciones están predefinidas o pueden +estar disponibles importando _paquetes_ R (más sobre esto más adelante). Una función +generalmente obtiene una o más entradas llamadas _argumentos_. Las funciones a menudo (pero no +siempre) devuelven un _valor_. Un ejemplo típico sería la función `sqrt()`. La entrada +(el argumento) debe ser un número y el valor de retorno (de hecho, la salida +) es la raíz cuadrada de ese número. Ejecutar una función ('ejecutarla') +se llama _llamar_ a la función. Un ejemplo de llamada a función es: ```{r, eval=FALSE, purl=FALSE} -b <- sqrt(a) +b <- raíz cuadrada (a) ``` -Here, the value of `a` is given to the `sqrt()` function, the `sqrt()` function -calculates the square root, and returns the value which is then assigned to -the object `b`. This function is very simple, because it takes just one argument. +Aquí, el valor de `a` se le da a la función `sqrt()`, la función `sqrt()` +calcula la raíz cuadrada y devuelve el valor que luego se asigna a +el objeto `b`. Esta función es muy sencilla, porque sólo necesita un argumento. The return 'value' of a function need not be numerical (like that of `sqrt()`), and it also does not need to be a single item: it can be a set of things, or -even a dataset. We'll see that when we read data files into R. +even a dataset. Lo veremos cuando leamos archivos de datos en R. -Arguments can be anything, not only numbers or filenames, but also other -objects. Exactly what each argument means differs per function, and must be -looked up in the documentation (see below). Some functions take arguments which -may either be specified by the user, or, if left out, take on a _default_ value: -these are called _options_. Options are typically used to alter the way the -function operates, such as whether it ignores 'bad values', or what symbol to -use in a plot. However, if you want something specific, you can specify a value -of your choice which will be used instead of the default. +Los argumentos pueden ser cualquier cosa, no sólo números o nombres de archivos, sino también otros +objetos. Exactamente lo que significa cada argumento difiere según la función y se debe buscar +en la documentación (ver más abajo). Algunas funciones toman argumentos que +pueden ser especificados por el usuario o, si se omiten, tomar un valor _predeterminado_: +se denominan _opciones_. Las opciones se utilizan normalmente para alterar la forma en que opera la función +, como si ignora los 'valores incorrectos' o qué símbolo usar +en un gráfico. Sin embargo, si desea algo específico, puede especificar un valor +de su elección que se utilizará en lugar del valor predeterminado. -Let's try a function that can take multiple arguments: `round()`. +Probemos una función que pueda tomar múltiples argumentos: `round()`. ```{r, results="show", purl=TRUE} -round(3.14159) +redondo(3.14159) ``` -Here, we've called `round()` with just one argument, `3.14159`, and it has -returned the value `3`. That's because the default is to round to the nearest -whole number. If we want more digits we can see how to do that by getting -information about the `round` function. We can use `args(round)` or look at the -help for this function using `?round`. +Aquí, hemos llamado a `round()` con un solo argumento, `3.14159`, y +devolvió el valor `3`. Esto se debe a que el valor predeterminado es redondear al número entero +más cercano. Si queremos más dígitos, podemos ver cómo hacerlo obteniendo +información sobre la función "redonda". Podemos usar `args(round)` o mirar la ayuda +para esta función usando `?round`. ```{r, results="show", purl=TRUE} -args(round) +argumentos (redondo) ``` ```{r, eval=FALSE, purl=TRUE} -?round +?redondo ``` -We see that if we want a different number of digits, we can -type `digits=2` or however many we want. +Vemos que si queremos una cantidad diferente de dígitos, podemos +escribir `dígitos=2` o la cantidad que queramos. ```{r, results="show", purl=TRUE} -round(3.14159, digits = 2) +ronda(3.14159, dígitos = 2) ``` -If you provide the arguments in the exact same order as they are defined you -don't have to name them: +Si proporciona los argumentos exactamente en el mismo orden en que están definidos, +no tiene que nombrarlos: ```{r, results="show", purl=TRUE} -round(3.14159, 2) +ronda(3.14159, 2) ``` -And if you do name the arguments, you can switch their order: +Y si nombras los argumentos, puedes cambiar su orden: ```{r, results="show", purl=TRUE} -round(digits = 2, x = 3.14159) +redondo(dígitos = 2, x = 3,14159) ``` -It's good practice to put the non-optional arguments (like the number you're -rounding) first in your function call, and to specify the names of all optional -arguments. If you don't, someone reading your code might have to look up the -definition of a function with unfamiliar arguments to understand what you're -doing. By specifying the name of the arguments you are also safeguarding -against possible future changes in the function interface, which may -potentially add new arguments in between the existing ones. +Es una buena práctica poner los argumentos no opcionales (como el número que estás +redondeando) primero en tu llamada de función y especificar los nombres de todos los argumentos opcionales +. Si no lo hace, es posible que alguien que lea su código tenga que buscar la +definición de una función con argumentos desconocidos para comprender lo que está +haciendo. Al especificar el nombre de los argumentos, también protege +contra posibles cambios futuros en la interfaz de la función, que pueden +potencialmente agregar nuevos argumentos entre los existentes. -## Vectors and data types +## Vectores y tipos de datos -A vector is the most common and basic data type in R, and is pretty much -the workhorse of R. A vector is composed by a series of values, such as -numbers or characters. We can assign a series of values to a vector using -the `c()` function. For example we can create a vector of animal weights and assign -it to a new object `weight_g`: +Un vector es el tipo de datos más común y básico en R, y es prácticamente +el caballo de batalla de R. Un vector está compuesto por una serie de valores, como +números o caracteres. Podemos asignar una serie de valores a un vector usando +la función `c()`. Por ejemplo, podemos crear un vector de pesos de animales y asignarlo +a un nuevo objeto `weight_g`: ```{r, purl=TRUE} -weight_g <- c(50, 60, 65, 82) -weight_g +peso_g <- c(50, 60, 65, 82) +peso_g ``` -A vector can also contain characters: +Un vector también puede contener caracteres: ```{r, purl=TRUE} -molecules <- c("dna", "rna", "protein") -molecules +moléculas <- c("adn", "rna", "proteína") +moléculas ``` -The quotes around "dna", "rna", etc. are essential here. Without the -quotes R will assume there are objects called `dna`, `rna` and -`protein`. As these objects don't exist in R's memory, there will be -an error message. +Las citas sobre "adn", "rna", etc. son esenciales aquí. Sin las comillas +, R asumirá que hay objetos llamados `dna`, `rna` y +`protein`. Como estos objetos no existen en la memoria de R, aparecerá +un mensaje de error. -There are many functions that allow you to inspect the content of a -vector. `length()` tells you how many elements are in a particular vector: +Hay muchas funciones que le permiten inspeccionar el contenido de un vector +. `length()` te dice cuántos elementos hay en un vector en particular: ```{r, purl=TRUE} -length(weight_g) -length(molecules) +longitud(peso_g) +longitud(moléculas) ``` -An important feature of a vector, is that all of the elements are the -same type of data. The function `class()` indicates the class (the -type of element) of an object: +Una característica importante de un vector es que todos los elementos son +el mismo tipo de datos. La función `class()` indica la clase (el tipo de elemento +) de un objeto: ```{r, purl=TRUE} -class(weight_g) -class(molecules) +clase(peso_g) +clase(moléculas) ``` -The function `str()` provides an overview of the structure of an -object and its elements. It is a useful function when working with -large and complex objects: +La función `str()` proporciona una descripción general de la estructura de un objeto +y sus elementos. Es una función útil cuando se trabaja con +objetos grandes y complejos: ```{r, purl=TRUE} -str(weight_g) -str(molecules) +str(peso_g) +str(moléculas) ``` -You can use the `c()` function to add other elements to your vector: +Puedes usar la función `c()` para agregar otros elementos a tu vector: ```{r} -weight_g <- c(weight_g, 90) # add to the end of the vector -weight_g <- c(30, weight_g) # add to the beginning of the vector -weight_g +peso_g <- c(peso_g, 90) # agregar al final del vector +peso_g <- c(30, peso_g) # agregar al principio del vector +peso_g ``` -In the first line, we take the original vector `weight_g`, add the -value `90` to the end of it, and save the result back into -`weight_g`. Then we add the value `30` to the beginning, again saving -the result back into `weight_g`. +En la primera línea, tomamos el vector original `weight_g`, agregamos el valor +`90` al final y guardamos el resultado nuevamente en +`weight_g`. Luego agregamos el valor `30` al principio, guardando nuevamente +el resultado nuevamente en `weight_g`. -We can do this over and over again to grow a vector, or assemble a -dataset. As we program, this may be useful to add results that we are -collecting or calculating. +Podemos hacer esto una y otra vez para hacer crecer un vector o ensamblar un conjunto de datos +. Mientras programamos, esto puede ser útil para agregar resultados que estamos +recopilando o calculando. -An **atomic vector** is the simplest R **data type** and is a linear -vector of a single type. Above, we saw 2 of the 6 main **atomic -vector** types that R uses: `"character"` and `"numeric"` (or -`"double"`). These are the basic building blocks that all R objects -are built from. The other 4 **atomic vector** types are: +Un **vector atómico** es el **tipo de datos** R más simple y es un vector +lineal de un solo tipo. Arriba, vimos 2 de los 6 principales tipos de vectores \*\*atómicos +\*\* que usa R: `"carácter"` y `"numérico"` (o +`"doble"`). Estos son los componentes básicos a partir de los cuales se construyen todos los objetos R +. Los otros 4 tipos de **vectores atómicos** son: -- `"logical"` for `TRUE` and `FALSE` (the boolean data type) -- `"integer"` for integer numbers (e.g., `2L`, the `L` indicates to R - that it's an integer) -- `"complex"` to represent complex numbers with real and imaginary - parts (e.g., `1 + 4i`) and that's all we're going to say about them -- `"raw"` for bitstreams that we won't discuss further +- `"lógico"` para `VERDADERO` y `FALSO` (el tipo de datos booleano) +- `"entero"` para números enteros (por ejemplo, `2L`, la `L` indica a R + que es un número entero) +- `"complejo"` para representar números complejos con partes + reales e imaginarias (por ejemplo, `1 + 4i`) y eso es todo lo que vamos a decir sobre ellos +- `"raw"` para flujos de bits que no discutiremos más -You can check the type of your vector using the `typeof()` function -and inputting your vector as the argument. +Puede verificar el tipo de su vector usando la función `typeof()` +e ingresando su vector como argumento. -Vectors are one of the many **data structures** that R uses. Other -important ones are lists (`list`), matrices (`matrix`), data frames -(`data.frame`), factors (`factor`) and arrays (`array`). +Los vectores son una de las muchas **estructuras de datos** que utiliza R. Otros +importantes son listas (`list`), matrices (`matrix`), marcos de datos +(`data.frame`), factores (`factor`) y matrices (`array` ). -::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::: desafío -## Challenge: +## Desafío: -We've seen that atomic vectors can be of type character, numeric (or -double), integer, and logical. But what happens if we try to mix -these types in a single vector? +Hemos visto que los vectores atómicos pueden ser de tipo carácter, numérico (o +doble), entero y lógico. Pero ¿qué pasa si intentamos mezclar +estos tipos en un solo vector? -::::::::::::::: solution +::::::::::::::: solución -## Solution +## Solución -R implicitly converts them to all be the same type +R los convierte implícitamente para que todos sean del mismo tipo -::::::::::::::::::::::::: +:::::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::::::::: -::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::: desafío -## Challenge: +## Desafío: -What will happen in each of these examples? (hint: use `class()` to -check the data type of your objects and type in their names to see what happens): +¿Qué pasará en cada uno de estos ejemplos? (pista: use `class()` para +verificar el tipo de datos de sus objetos y escriba sus nombres para ver qué sucede): ```{r, eval=TRUE} num_char <- c(1, 2, 3, "a") -num_logical <- c(1, 2, 3, TRUE, FALSE) -char_logical <- c("a", "b", "c", TRUE) -tricky <- c(1, 2, 3, "4") +num_logic <- c(1, 2, 3, VERDADERO, FALSO) +char_logic <- c("a", " b", "c", VERDADERO) +complicado <- c(1, 2, 3, "4") ``` -::::::::::::::: solution +::::::::::::::: solución -## Solution +## Solución ```{r, purl=TRUE} class(num_char) @@ -390,151 +390,151 @@ class(tricky) tricky ``` -::::::::::::::::::::::::: +:::::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::::::::: -::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::: desafío -## Challenge: +## Desafío: -Why do you think it happens? +¿Por qué crees que sucede? -::::::::::::::: solution +::::::::::::::: solución -## Solution +## Solución -Vectors can be of only one data type. R tries to convert (coerce) -the content of this vector to find a _common denominator_ that -doesn't lose any information. +Los vectores pueden ser de un solo tipo de datos. R intenta convertir (coaccionar) +el contenido de este vector para encontrar un _denominador común_ que +no pierda ninguna información. -::::::::::::::::::::::::: +:::::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::::::::: -::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::: desafío -## Challenge: +## Desafío: -How many values in `combined_logical` are `"TRUE"` (as a character) -in the following example: +¿Cuántos valores en `combined_logic` son `"VERDADERO"` (como carácter) +en el siguiente ejemplo: ```{r, eval=TRUE} -num_logical <- c(1, 2, 3, TRUE) -char_logical <- c("a", "b", "c", TRUE) -combined_logical <- c(num_logical, char_logical) +num_lógico <- c(1, 2, 3, VERDADERO) +char_lógico <- c("a", "b", "c", VERDADERO) +combinado_lógico <- c(núm_lógico, char_lógico ) ``` -::::::::::::::: solution +::::::::::::::: solución -## Solution +## Solución -Only one. There is no memory of past data types, and the coercion -happens the first time the vector is evaluated. Therefore, the `TRUE` -in `num_logical` gets converted into a `1` before it gets converted -into `"1"` in `combined_logical`. +Sólo uno. No hay memoria de tipos de datos pasados y la coerción +ocurre la primera vez que se evalúa el vector. Por lo tanto, `VERDADERO` +en `num_logic` se convierte en un `1` antes de convertirse +en `"1"` en `combined_logic`. ```{r} -combined_logical +lógico_combinado ``` -::::::::::::::::::::::::: +:::::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::::::::: -::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::: desafío -## Challenge: +## Desafío: -In R, we call converting objects from one class into another class -_coercion_. These conversions happen according to a hierarchy, -whereby some types get preferentially coerced into other types. Can -you draw a diagram that represents the hierarchy of how these data -types are coerced? +En R, llamamos a convertir objetos de una clase a otra clase +_coerción_. Estas conversiones ocurren según una jerarquía, +, por la cual algunos tipos son preferentemente forzados a convertirse en otros tipos. ¿Puedes +dibujar un diagrama que represente la jerarquía de cómo se coaccionan estos tipos de datos +? -::::::::::::::: solution +::::::::::::::: solución -## Solution +## Solución -logical → numeric → character ← logical +lógico → numérico → carácter ← lógico -::::::::::::::::::::::::: +:::::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::::::::: ```{r, echo=FALSE, eval=FALSE, purl=TRUE} -## We've seen that atomic vectors can be of type character, numeric, integer, and -## logical. But what happens if we try to mix these types in a single +## Hemos visto que los vectores atómicos pueden ser de tipo carácter, numérico, entero y +## lógico. Pero, ¿qué pasa si intentamos mezclar estos tipos en un solo ## vector? -## What will happen in each of these examples? (hint: use `class()` to -## check the data type of your object) +## ¿Qué pasará en cada uno de estos ejemplos? (pista: use `class()` para +## verificar el tipo de datos de su objeto) num_char <- c(1, 2, 3, "a") -num_logical <- c(1, 2, 3, TRUE) +num_logic <- c(1, 2, 3, VERDADERO) -char_logical <- c("a", "b", "c", TRUE) +char_logic <- c("a", "b", "c", VERDADERO) -tricky <- c(1, 2, 3, "4") +complicado <- c(1, 2 , 3, "4") -## Why do you think it happens? +## ¿Por qué crees que sucede? -## You've probably noticed that objects of different types get -## converted into a single, shared type within a vector. In R, we call -## converting objects from one class into another class -## _coercion_. These conversions happen according to a hierarchy, -## whereby some types get preferentially coerced into other types. Can -## you draw a diagram that represents the hierarchy of how these data -## types are coerced? +## Probablemente hayas notado que objetos de diferentes tipos se +## convertidos en un tipo único y compartido dentro de un vector. En R, llamamos +## convertir objetos de una clase a otra clase +## _coerción_. Estas conversiones ocurren según una jerarquía, +## mediante la cual algunos tipos son preferentemente forzados a convertirse en otros tipos. ¿Puedes +## dibujar un diagrama que represente la jerarquía de cómo se coaccionan estos tipos de datos +##? ``` -## Subsetting vectors +## Subconjunto de vectores -If we want to extract one or several values from a vector, we must -provide one or several indices in square brackets. For instance: +Si queremos extraer uno o varios valores de un vector, debemos +proporcionar uno o varios índices entre corchetes. Por ejemplo: ```{r, results="show", purl=TRUE} -molecules <- c("dna", "rna", "peptide", "protein") -molecules[2] -molecules[c(3, 2)] +moléculas <- c("adn", "rna", "péptido", "proteína") +moléculas[2] +moléculas[c(3, 2)] ``` -We can also repeat the indices to create an object with more elements -than the original one: +También podemos repetir los índices para crear un objeto con más elementos +que el original: ```{r, results="show", purl=TRUE} -more_molecules <- molecules[c(1, 2, 3, 2, 1, 4)] +more_molecules <- moléculas[c(1, 2, 3, 2, 1, 4)] more_molecules ``` -R indices start at 1. Programming languages like Fortran, MATLAB, -Julia, and R start counting at 1, because that's what human beings -typically do. Languages in the C family (including C++, Java, Perl, -and Python) count from 0 because that's simpler for computers to do. +Los índices R comienzan en 1. Los lenguajes de programación como Fortran, MATLAB, +Julia y R comienzan a contar en 1, porque eso es lo que normalmente hacen los seres humanos +. Los lenguajes de la familia C (incluidos C++, Java, Perl, +y Python) cuentan desde 0 porque es más sencillo de hacer para las computadoras. -Finally, it is also possible to get all the elements of a vector -except some specified elements using negative indices: +Finalmente, también es posible obtener todos los elementos de un vector +excepto algunos elementos específicos que usan índices negativos: ```{r} -molecules ## all molecules -molecules[-1] ## all but the first one -molecules[-c(1, 3)] ## all but 1st/3rd ones -molecules[c(-1, -3)] ## all but 1st/3rd ones +moléculas ## todas las moléculas +moléculas[-1] ## todas menos la primera +moléculas[-c(1, 3)] ## todas menos la primera/tercera +moléculas [c(-1, -3)] ## todas menos la primera/tercera ``` -## Conditional subsetting +## Subconjunto condicional -Another common way of subsetting is by using a logical vector. `TRUE` will -select the element with the same index, while `FALSE` will not: +Otra forma común de subconjunto es mediante el uso de un vector lógico. `TRUE` +seleccionará el elemento con el mismo índice, mientras que `FALSE` no: ```{r, purl=TRUE} -weight_g <- c(21, 34, 39, 54, 55) -weight_g[c(TRUE, FALSE, TRUE, TRUE, FALSE)] +peso_g <- c(21, 34, 39, 54, 55) +peso_g[c(VERDADERO, FALSO, VERDADERO, VERDADERO, FALSO)] ``` -Typically, these logical vectors are not typed by hand, but are the -output of other functions or logical tests. For instance, if you -wanted to select only the values above 50: +Normalmente, estos vectores lógicos no se escriben a mano, sino que son la salida +de otras funciones o pruebas lógicas. Por ejemplo, si +quisieras seleccionar solo los valores superiores a 50: ```{r, purl=TRUE} ## will return logicals with TRUE for the indices that meet @@ -544,24 +544,24 @@ weight_g > 50 weight_g[weight_g > 50] ``` -You can combine multiple tests using `&` (both conditions are true, -AND) or `|` (at least one of the conditions is true, OR): +Puede combinar varias pruebas usando `&` (ambas condiciones son verdaderas, +AND) o `|` (al menos una de las condiciones es verdadera, O): ```{r, results="show", purl=TRUE} -weight_g[weight_g < 30 | weight_g > 50] -weight_g[weight_g >= 30 & weight_g == 21] +peso_g[peso_g < 30 | peso_g > 50] +peso_g[peso_g >= 30 & peso_g == 21] ``` -Here, `<` stands for "less than", `>` for "greater than", `>=` for -"greater than or equal to", and `==` for "equal to". The double equal +Aquí, `<` significa "menor que", `>` para "mayor que", `>=` para +"mayor o igual que" y `==` para "igual a". The double equal sign `==` is a test for numerical equality between the left and right hand sides, and should not be confused with the single `=` sign, which performs variable assignment (similar to `<-`). -A common task is to search for certain strings in a vector. One could -use the "or" operator `|` to test for equality to multiple values, but -this can quickly become tedious. The function `%in%` allows you to -test if any of the elements of a search vector are found: +Una tarea común es buscar determinadas cadenas en un vector. Uno podría +usar el operador "o" `|` para probar la igualdad de múltiples valores, pero +esto puede volverse tedioso rápidamente. La función `%in%` le permite +probar si se encuentra alguno de los elementos de un vector de búsqueda: ```{r, purl=TRUE} molecules <- c("dna", "rna", "protein", "peptide") @@ -570,60 +570,60 @@ molecules %in% c("rna", "dna", "metabolite", "peptide", "glycerol") molecules[molecules %in% c("rna", "dna", "metabolite", "peptide", "glycerol")] ``` -::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::: desafío -## Challenge: +## Desafío: -Can you figure out why `"four" > "five"` returns `TRUE`? +¿Puedes entender por qué "cuatro" > "cinco" devuelve "VERDADERO"? -::::::::::::::: solution +::::::::::::::: solución -## Solution +## Solución ```{r} -"four" > "five" +"cuatro" > "cinco" ``` -When using `>` or `<` on strings, R compares their alphabetical order. -Here `"four"` comes after `"five"`, and therefore is _greater than_ -it. +Cuando se usa `>` o `<` en cadenas, R compara su orden alfabético. +Aquí `"cuatro"` viene después de `"cinco"` y, por lo tanto, es _mayor que_ +. -::::::::::::::::::::::::: +:::::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::::::::: -## Names +## Nombres -It is possible to name each element of a vector. The code chunk below -shows an initial vector without any names, how names are set, and -retrieved. +Es posible nombrar cada elemento de un vector. El fragmento de código siguiente +muestra un vector inicial sin ningún nombre, cómo se configuran los nombres y +se recuperan. ```{r} x <- c(1, 5, 3, 5, 10) -names(x) ## no names -names(x) <- c("A", "B", "C", "D", "E") -names(x) ## now we have names +nombres(x) ## sin nombres +nombres(x) <- c("A", "B", " C", "D", "E") +nombres(x) ## ahora tenemos nombres ``` -When a vector has names, it is possible to access elements by their -name, in addition to their index. +Cuando un vector tiene nombres, es posible acceder a los elementos por su nombre +, además de su índice. ```{r} x[c(1, 3)] x[c("A", "C")] ``` -## Missing data +## Datos perdidos -As R was designed to analyze datasets, it includes the concept of -missing data (which is uncommon in other programming -languages). Missing data are represented in vectors as `NA`. +Como R fue diseñado para analizar conjuntos de datos, incluye el concepto de +datos faltantes (lo cual es poco común en otros lenguajes de programación +). Los datos faltantes se representan en vectores como "NA". -When doing operations on numbers, most functions will return `NA` if -the data you are working with include missing values. This feature -makes it harder to overlook the cases where you are dealing with -missing data. You can add the argument `na.rm = TRUE` to calculate -the result while ignoring the missing values. +Al realizar operaciones con números, la mayoría de las funciones devolverán `NA` si +los datos con los que está trabajando incluyen valores faltantes. Esta característica +hace que sea más difícil pasar por alto los casos en los que se trata de +datos faltantes. Puede agregar el argumento `na.rm = TRUE` para calcular +el resultado ignorando los valores faltantes. ```{r} heights <- c(2, 4, 4, NA, 6) @@ -633,292 +633,292 @@ mean(heights, na.rm = TRUE) max(heights, na.rm = TRUE) ``` -If your data include missing values, you may want to become familiar -with the functions `is.na()`, `na.omit()`, and `complete.cases()`. See -below for examples. +Si sus datos incluyen valores faltantes, es posible que desee familiarizarse +con las funciones `is.na()`, `na.omit()` y `complete.cases()`. Consulte +a continuación para ver ejemplos. ```{r} -## Extract those elements which are not missing values. +## Extrae aquellos elementos a los que no les faltan valores. heights[!is.na(heights)] -## Returns the object with incomplete cases removed. -## The returned object is an atomic vector of type `"numeric"` -## (or `"double"`). +## Devuelve el objeto sin casos incompletos. +## El objeto devuelto es un vector atómico de tipo `"numérico"` +## (o `"doble"`). na.omit(heights) -## Extract those elements which are complete cases. -## The returned object is an atomic vector of type `"numeric"` -## (or `"double"`). -heights[complete.cases(heights)] +## Extrae aquellos elementos que sean casos completos. +## El objeto devuelto es un vector atómico de tipo `"numérico"` +## (o `"doble"`). +alturas[completos.casos(alturas)] ``` -::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::: desafío -## Challenge: +## Desafío: -1. Using this vector of heights in inches, create a new vector with the NAs removed. +1. Usando este vector de alturas en pulgadas, cree un nuevo vector sin los NA. ```{r} -heights <- c(63, 69, 60, 65, NA, 68, 61, 70, 61, 59, 64, 69, 63, 63, NA, 72, 65, 64, 70, 63, 65) +alturas <- c(63, 69, 60, 65, NA, 68, 61, 70, 61, 59, 64, 69, 63, 63, NA, 72, 65, 64, 70, 63, 65) ``` -2. Use the function `median()` to calculate the median of the `heights` vector. -3. Use R to figure out how many people in the set are taller than 67 inches. +2. Utilice la función `median()` para calcular la mediana del vector `alturas`. +3. Usa R para calcular cuántas personas en el grupo miden más de 67 pulgadas. -::::::::::::::: solution +::::::::::::::: solución -## Solution +## Solución ```{r, purl=TRUE} -heights_no_na <- heights[!is.na(heights)] -## or -heights_no_na <- na.omit(heights) +alturas_no_na <- alturas[!is.na(alturas)] +## o +alturas_no_na <- na.omit(alturas) ``` ```{r, purl=TRUE} -median(heights, na.rm = TRUE) +mediana (alturas, na.rm = VERDADERO) ``` ```{r, purl=TRUE} -heights_above_67 <- heights_no_na[heights_no_na > 67] -length(heights_above_67) +alturas_arriba_67 <- alturas_no_na[alturas_no_na > 67] +longitud(alturas_arriba_67) ``` -::::::::::::::::::::::::: +:::::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::::::::: -## Generating vectors {#sec:genvec} +## Generando vectores {#sec:genvec} ```{r, echo=FALSE} -set.seed(1) +conjunto.semilla(1) ``` -### Constructors +### Constructores -There exists some functions to generate vectors of different type. To -generate a vector of numerics, one can use the `numeric()` -constructor, providing the length of the output vector as -parameter. The values will be initialised with 0. +Existen algunas funciones para generar vectores de diferente tipo. Para +generar un vector de números, se puede usar el constructor `numeric()` +, proporcionando la longitud del vector de salida como parámetro +. Los valores se inicializarán con 0. ```{r, purl=TRUE} -numeric(3) -numeric(10) +numérico(3) +numérico(10) ``` -Note that if we ask for a vector of numerics of length 0, we obtain -exactly that: +Tenga en cuenta que si pedimos un vector de números de longitud 0, obtenemos +exactamente eso: ```{r, purl=TRUE} -numeric(0) +numérico(0) ``` -There are similar constructors for characters and logicals, named -`character()` and `logical()` respectively. +Hay constructores similares para caracteres y lógicos, llamados +`character()` y `logic()` respectivamente. -::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::: desafío -## Challenge: +## Desafío: -What are the defaults for character and logical vectors? +¿Cuáles son los valores predeterminados para los vectores lógicos y de caracteres? -::::::::::::::: solution +::::::::::::::: solución -## Solution +## Solución ```{r, purl=TRUE} -character(2) ## the empty character -logical(2) ## FALSE +carácter(2) ## el carácter vacío +lógico(2) ## FALSO ``` -::::::::::::::::::::::::: +:::::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::::::::: -### Replicate elements +### Replicar elementos -The `rep` function allow to repeat a value a certain number of -times. If we want to initiate a vector of numerics of length 5 with -the value -1, for example, we could do the following: +La función `rep` permite repetir un valor un cierto número de +veces. Si queremos iniciar un vector de numéricos de longitud 5 con +el valor -1, por ejemplo, podríamos hacer lo siguiente: ```{r, purl=TRUE} -rep(-1, 5) +representante(-1, 5) ``` -Similarly, to generate a vector populated with missing values, which -is often a good way to start, without setting assumptions on the data -to be collected: +De manera similar, para generar un vector poblado con valores faltantes, lo cual +suele ser una buena forma de comenzar, sin establecer suposiciones sobre los datos +que se recopilarán: ```{r, purl=TRUE} -rep(NA, 5) +representante(NA, 5) ``` -`rep` can take vectors of any length as input (above, we used vectors -of length 1) and any type. For example, if we want to repeat the -values 1, 2 and 3 five times, we would do the following: +`rep` puede tomar vectores de cualquier longitud como entrada (arriba, usamos vectores +de longitud 1) y de cualquier tipo. Por ejemplo, si queremos repetir los valores +1, 2 y 3 cinco veces, haríamos lo siguiente: ```{r, purl=TRUE} -rep(c(1, 2, 3), 5) +representante(c(1, 2, 3), 5) ``` -::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::: desafío -## Challenge: +## Desafío: -What if we wanted to repeat the values 1, 2 and 3 five times, but -obtain five 1s, five 2s and five 3s in that order? There are two -possibilities - see `?rep` or `?sort` for help. +¿Qué pasaría si quisiéramos repetir los valores 1, 2 y 3 cinco veces, pero +obtuviéramos cinco 1, cinco 2 y cinco 3 en ese orden? Hay dos posibilidades +; consulte `?rep` o `?sort` para obtener ayuda. -::::::::::::::: solution +::::::::::::::: solución -## Solution +## Solución ```{r, purl=TRUE} -rep(c(1, 2, 3), each = 5) +rep(c(1, 2, 3), cada uno = 5) sort(rep(c(1, 2, 3), 5)) ``` -::::::::::::::::::::::::: +:::::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::::::::: -### Sequence generation +### Generación de secuencia -Another very useful function is `seq`, to generate a sequence of -numbers. For example, to generate a sequence of integers from 1 to 20 -by steps of 2, one would use: +Otra función muy útil es `seq`, para generar una secuencia de +números. Por ejemplo, para generar una secuencia de números enteros del 1 al 20 +en pasos de 2, se usaría: ```{r, purl=TRUE} -seq(from = 1, to = 20, by = 2) +seq(de = 1, a = 20, por = 2) ``` -The default value of `by` is 1 and, given that the generation of a -sequence of one value to another with steps of 1 is frequently used, -there's a shortcut: +El valor predeterminado de `by` es 1 y, dado que con frecuencia se usa la generación de una secuencia +de un valor a otro con pasos de 1, +hay un atajo: ```{r, purl=TRUE} -seq(1, 5, 1) -seq(1, 5) ## default by +secuencia (1, 5, 1) +secuencia (1, 5) ## predeterminado por 1:5 ``` -To generate a sequence of numbers from 1 to 20 of final length of 3, -one would use: +Para generar una secuencia de números del 1 al 20 de longitud final de 3, +se usaría: ```{r, purl=TRUE} -seq(from = 1, to = 20, length.out = 3) +seq(de = 1, a = 20, longitud.salida = 3) ``` -### Random samples and permutations +### Muestras aleatorias y permutaciones. -A last group of useful functions are those that generate random -data. The first one, `sample`, generates a random permutation of -another vector. For example, to draw a random order to 10 students -oral exam, I first assign each student a number from 1 to ten (for -instance based on the alphabetic order of their name) and then: +Un último grupo de funciones útiles son aquellas que generan datos +aleatorios. El primero, "muestra", genera una permutación aleatoria de +otro vector. Por ejemplo, para dibujar un orden aleatorio para el examen oral de 10 estudiantes +, primero le asigno a cada estudiante un número del 1 al diez (por ejemplo, +según el orden alfabético de su nombre) y luego: ```{r, purl=TRUE} -sample(1:10) +muestra(1:10) ``` -Without further arguments, `sample` will return a permutation of all -elements of the vector. If I want a random sample of a certain size, I -would set this value as the second argument. Below, I sample 5 random -letters from the alphabet contained in the pre-defined `letters` vector: +Sin más argumentos, `sample` devolverá una permutación de todos los +elementos del vector. Si quiero una muestra aleatoria de un cierto tamaño, +establecería este valor como segundo argumento. A continuación, muestro 5 letras +aleatorias del alfabeto contenido en el vector de "letras" predefinido: ```{r, purl=TRUE} -sample(letters, 5) +muestra(letras, 5) ``` -If I wanted an output larger than the input vector, or being able to -draw some elements multiple times, I would need to set the `replace` -argument to `TRUE`: +Si quisiera una salida más grande que el vector de entrada, o poder +dibujar algunos elementos varias veces, necesitaría establecer el argumento `replace` +en `TRUE`: ```{r, purl=TRUE} -sample(1:5, 10, replace = TRUE) +muestra(1:5, 10, reemplazar = VERDADERO) ``` -::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::: desafío -## Challenge: +## Desafío: -When trying the functions above out, you will have realised that the -samples are indeed random and that one doesn't get the same -permutation twice. To be able to reproduce these random draws, one can -set the random number generation seed manually with `set.seed()` -before drawing the random sample. +Al probar las funciones anteriores, te habrás dado cuenta de que las muestras +son realmente aleatorias y que no se obtiene la misma permutación +dos veces. Para poder reproducir estos sorteos aleatorios, se puede +configurar la semilla de generación de números aleatorios manualmente con `set.seed()` +antes de extraer la muestra aleatoria. -Test this feature with your neighbour. First draw two random -permutations of `1:10` independently and observe that you get -different results. +Pruebe esta característica con su vecino. Primero dibuja dos permutaciones aleatorias +de `1:10` de forma independiente y observa que obtienes +resultados diferentes. -Now set the seed with, for example, `set.seed(123)` and repeat the -random draw. Observe that you now get the same random draws. +Ahora establezca la semilla con, por ejemplo, `set.seed(123)` y repita el sorteo aleatorio +. Observe que ahora obtiene los mismos sorteos aleatorios. -Repeat by setting a different seed. +Repita colocando una semilla diferente. -::::::::::::::: solution +::::::::::::::: solución -## Solution +## Solución -Different permutations +Diferentes permutaciones ```{r, purl=TRUE} -sample(1:10) -sample(1:10) +muestra(1:10) +muestra(1:10) ``` -Same permutations with seed 123 +Mismas permutaciones con la semilla 123. ```{r, purl=TRUE} set.seed(123) -sample(1:10) +muestra(1:10) set.seed(123) -sample(1:10) +muestra(1:10) ``` -A different seed +Una semilla diferente ```{r, purl=TRUE} set.seed(1) -sample(1:10) +muestra(1:10) set.seed(1) -sample(1:10) +muestra(1:10) ``` -::::::::::::::::::::::::: +:::::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::::::::: -### Drawing samples from a normal distribution +### Extraer muestras de una distribución normal -The last function we are going to see is `rnorm`, that draws a random -sample from a normal distribution. Two normal distributions of means 0 -and 100 and standard deviations 1 and 5, noted _N(0, 1)_ and -_N(100, 5)_, are shown below. +La última función que vamos a ver es `rnorm`, que extrae una muestra aleatoria +de una distribución normal. A continuación se muestran dos distribuciones normales de medias 0 +y 100 y desviaciones estándar 1 y 5, anotadas _N(0, 1)_ y +_N(100, 5)_. ```{r, echo=FALSE, fig.width=12, fig.height=6, fig.cap="Two normal distributions: *N(0, 1)* on the left and *N(100, 5)* on the right."} par(mfrow = c(1, 2)) -plot(density(rnorm(1000)), main = "", sub = "N(0, 1)") -plot(density(rnorm(1000, 100, 5)), main = "", sub = "N(100, 5)") +plot(densidad(rnorm(1000)), main = "", sub = "N(0, 1)") +plot(densidad (norm(1000, 100, 5)), principal = "", sub = "N(100, 5)") ``` -The three arguments, `n`, `mean` and `sd`, define the size of the -sample, and the parameters of the normal distribution, i.e the mean -and its standard deviation. The defaults of the latter are 0 and 1. +Los tres argumentos, `n`, `mean` y `sd`, definen el tamaño de la muestra +y los parámetros de la distribución normal, es decir, la media +y su desviación estándar. Los valores predeterminados de este último son 0 y 1. ```{r, purl=TRUE} -rnorm(5) -rnorm(5, 2, 2) -rnorm(5, 100, 5) +norma(5) +norma(5, 2, 2) +norma(5, 100, 5) ``` -Now that we have learned how to write scripts, and the basics of R's -data structures, we are ready to start working with larger data, and -learn about data frames. +Ahora que hemos aprendido cómo escribir scripts y los conceptos básicos de las estructuras de datos +de R, estamos listos para comenzar a trabajar con datos más grandes y +aprender sobre marcos de datos. -:::::::::::::::::::::::::::::::::::::::: keypoints +:::::::::::::::::::::::::::::::::::::::: puntos clave -- How to interact with R +- Cómo interactuar con R -:::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::::::::: From fd233222b4f22f4958200b45385ffcc442a0bced Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Tue, 14 May 2024 00:30:08 +0900 Subject: [PATCH 171/334] New translations 25-starting-with-data.md (Spanish) --- locale/es/episodes/25-starting-with-data.Rmd | 894 +++++++++---------- 1 file changed, 447 insertions(+), 447 deletions(-) diff --git a/locale/es/episodes/25-starting-with-data.Rmd b/locale/es/episodes/25-starting-with-data.Rmd index 65bef1be3..d60804211 100644 --- a/locale/es/episodes/25-starting-with-data.Rmd +++ b/locale/es/episodes/25-starting-with-data.Rmd @@ -1,6 +1,6 @@ --- source: Rmd -title: Starting with data +title: Comenzando con datos teaching: 30 exercises: 30 --- @@ -10,244 +10,244 @@ exercises: 30 ::::::::::::::::::::::::::::::::::::::: objetivos -- Describe what a `data.frame` is. -- Load external data from a .csv file into a data frame. -- Summarize the contents of a data frame. -- Describe what a factor is. -- Convert between strings and factors. -- Reorder and rename factors. -- Format dates. -- Export and save data. +- Describe qué es un "marco.de.datos". +- Cargue datos externos desde un archivo .csv en un marco de datos. +- Resumir el contenido de un marco de datos. +- Describe qué es un factor. +- Convertir entre cadenas y factores. +- Reordenar y cambiar el nombre de los factores. +- Dar formato a las fechas. +- Exportar y guardar datos. :::::::::::::::::::::::::::::::::::::::::::::::::: :::::::::::::::::::::::::::::::::::::::: questions -- First data analysis in R - -:::::::::::::::::::::::::::::::::::::::::::::::::: - -> This episode is based on the Data Carpentries's _Data Analysis and -> Visualisation in R for Ecologists_ lesson. - -## Presentation of the gene expression data - -We are going to use part of the data published by Blackmore , _The -effect of upper-respiratory infection on transcriptomic changes in the -CNS_. The goal of the study was to determine the effect of an -upper-respiratory infection on changes in RNA transcription occurring -in the cerebellum and spinal cord post infection. Gender matched eight -week old C57BL/6 mice were inoculated with saline or with Influenza A by -intranasal route and transcriptomic changes in the cerebellum and -spinal cord tissues were evaluated by RNA-seq at days 0 -(non-infected), 4 and 8. - -The dataset is stored as a comma-separated values (CSV) file. Each row -holds information for a single RNA expression measurement, and the first eleven -columns represent: - -| Column | Description | -| ---------- | -------------------------------------------------------------------------------------------------------------------------------------------- | -| gene | The name of the gene that was measured | -| sample | The name of the sample the gene expression was measured in | -| expression | The value of the gene expression | -| organism | The organism/species - here all data stem from mice | -| age | The age of the mouse (all mice were 8 weeks here) | -| sex | The sex of the mouse | -| infection | The infection state of the mouse, i.e. infected with Influenza A or not infected. | -| strain | The Influenza A strain. | -| time | The duration of the infection (in days). | -| tissue | The tissue that was used for the gene expression experiment, i.e. cerebellum or spinal cord. | -| mouse | The mouse unique identifier. | - -We are going to use the R function `download.file()` to download the -CSV file that contains the gene expression data, and we will use -`read.csv()` to load into memory the content of the CSV file as an -object of class `data.frame`. Inside the `download.file` command, the -first entry is a character string with the source URL. This source URL -downloads a CSV file from a GitHub repository. The text after the -comma (`"data/rnaseq.csv"`) is the destination of the file on your -local machine. You'll need to have a folder on your machine called -`"data"` where you'll download the file. So this command downloads the -remote file, names it `"rnaseq.csv"` and adds it to a preexisting -folder named `"data"`. +- Primer análisis de datos en R + +:::::::::::::::::::::::::::::::::::::::::::::::::::::::: + +> Este episodio se basa en la lección _Análisis de datos y +> Visualización en R para ecologistas_ de Data Carpentries. + +## Presentación de los datos de expresión génica. + +Vamos a utilizar parte de los datos publicados por Blackmore , _El efecto +de la infección de las vías respiratorias superiores sobre los cambios transcriptómicos en el +SNC_. El objetivo del estudio fue determinar el efecto de una +infección de las vías respiratorias superiores sobre los cambios en la transcripción del ARN que ocurren +en el cerebelo y la médula espinal después de la infección. Se inocularon ocho ratones C57BL/6 de +semanas de edad de sexo coincidente con solución salina o con Influenza A por +vía intranasal y se evaluaron los cambios transcriptómicos en el cerebelo y +tejidos de la médula espinal mediante ARN- seq en los días 0 +(no infectados), 4 y 8. + +El conjunto de datos se almacena como un archivo de valores separados por comas (CSV). Cada fila +contiene información para una única medición de expresión de ARN, y las primeras once columnas +representan: + +| Columna | Descripción | +| --------- | ---------------------------------------------------------------------------------------------------------------------------- | +| gene | El nombre del gen que se midió. | +| muestra | El nombre de la muestra en la que se midió la expresión genética. | +| expresión | El valor de la expresión genética. | +| organismo | El organismo/especie: aquí todos los datos provienen de ratones. | +| edad | La edad del ratón (aquí todos los ratones tenían 8 semanas) | +| sexo | El sexo del ratón | +| infección | El estado de infección del ratón, es decir, infectado con gripe A o no infectado. | +| cepa | La cepa de influenza A. | +| tiempo | La duración de la infección (en días). | +| tejido | El tejido que se utilizó para el experimento de expresión génica, es decir, el cerebelo o la médula espinal. | +| ratón | El identificador único del mouse. | + +Usaremos la función R `download.file()` para descargar el +archivo CSV que contiene los datos de expresión genética, y usaremos +`read.csv()` para cargar en la memoria el contenido del archivo CSV como un +objeto de clase `data.frame`. Dentro del comando `download.file`, la primera entrada +es una cadena de caracteres con la URL de origen. Esta URL de origen +descarga un archivo CSV desde un repositorio de GitHub. El texto después de la coma +("data/rnaseq.csv"`) es el destino del archivo en su máquina local +. Necesitará tener una carpeta en su máquina llamada +`"data"`donde descargará el archivo. Entonces, este comando descarga el archivo remoto +, lo llama`"rnaseq.csv"`y lo agrega a una carpeta +preexistente llamada`"data"\`. ```{r, eval=TRUE} download.file(url = "https://github.com/carpentries-incubator/bioc-intro/raw/main/episodes/data/rnaseq.csv", - destfile = "data/rnaseq.csv") + destfile = "data/rnaseq.csv" ) ``` -You are now ready to load the data: +Ahora está listo para cargar los datos: ```{r, eval=TRUE, purl=TRUE} -rna <- read.csv("data/rnaseq.csv") +arn <- read.csv("datos/rnaseq.csv") ``` -This statement doesn't produce any output because, as you might -recall, assignments don't display anything. If we want to check that -our data has been loaded, we can see the contents of the data frame by -typing its name: +Esta declaración no produce ningún resultado porque, como +recordarás, las asignaciones no muestran nada. Si queremos comprobar que +nuestros datos han sido cargados, podemos ver el contenido del marco de datos +escribiendo su nombre: ```{r, eval=FALSE} -rna +arn ``` -Wow... that was a lot of output. At least it means the data loaded -properly. Let's check the top (the first 6 lines) of this data frame -using the function `head()`: +Guau... eso fue mucho resultado. Al menos significa que los datos se cargaron +correctamente. Revisemos la parte superior (las primeras 6 líneas) de este marco de datos +usando la función `head()`: ```{r, purl=TRUE} head(rna) -## Try also -## View(rna) -``` - -**Note** - -`read.csv()` assumes that fields are delineated by commas, however, in -several countries, the comma is used as a decimal separator and the -semicolon (;) is used as a field delineator. If you want to read in -this type of files in R, you can use the `read.csv2()` function. It -behaves exactly like `read.csv()` but uses different parameters for -the decimal and the field separators. If you are working with another -format, they can be both specified by the user. Check out the help for -`read.csv()` by typing `?read.csv` to learn more. There is also the -`read.delim()` function for reading tab separated data files. It is important to -note that all of these functions are actually wrapper functions for -the main `read.table()` function with different arguments. As such, -the data above could have also been loaded by using `read.table()` -with the separation argument as `,`. The code is as follows: +## Prueba también +## Ver(rna) +``` + +**Nota** + +`read.csv()` asume que los campos están delimitados por comas; sin embargo, en +varios países, la coma se usa como separador decimal y el +punto y coma (;) se usa como campo delineador. Si desea leer en +este tipo de archivos en R, puede usar la función `read.csv2()`. +se comporta exactamente como `read.csv()` pero usa diferentes parámetros para +los separadores decimales y de campo. Si está trabajando con otro formato +, el usuario puede especificar ambos. Consulte la ayuda para +`read.csv()` escribiendo `?read.csv` para obtener más información. También existe la función +`read.delim()` para leer archivos de datos separados por tabulaciones. Es importante +tener en cuenta que todas estas funciones son en realidad funciones contenedoras para +la función principal `read.table()` con diferentes argumentos. Como tal, +los datos anteriores también podrían haberse cargado usando `read.table()` +con el argumento de separación como `,`. El código es el siguiente: ```{r, eval=TRUE, purl=TRUE} -rna <- read.table(file = "data/rnaseq.csv", +rna <- read.table(archivo = "data/rnaseq.csv", sep = ",", - header = TRUE) + encabezado = VERDADERO) ``` -The header argument has to be set to TRUE to be able to read the -headers as by default `read.table()` has the header argument set to -FALSE. +El argumento del encabezado debe establecerse en VERDADERO para poder leer los encabezados +ya que, de forma predeterminada, `read.table()` tiene el argumento del encabezado establecido en +FALSO. -## What are data frames? +## ¿Qué son los marcos de datos? -Data frames are the _de facto_ data structure for most tabular data, -and what we use for statistics and plotting. +Los marcos de datos son la estructura de datos _de facto_ para la mayoría de los datos tabulares, +y lo que usamos para estadísticas y gráficos. -A data frame can be created by hand, but most commonly they are -generated by the functions `read.csv()` or `read.table()`; in other -words, when importing spreadsheets from your hard drive (or the web). +Un marco de datos se puede crear a mano, pero lo más común es que +se generen mediante las funciones `read.csv()` o `read.table()`; en otras +palabras, al importar hojas de cálculo desde su disco duro (o la web). -A data frame is the representation of data in the format of a table -where the columns are vectors that all have the same length. Because -columns are vectors, each column must contain a single type of data -(e.g., characters, integers, factors). For example, here is a figure -depicting a data frame comprising a numeric, a character, and a -logical vector. +Un marco de datos es la representación de datos en el formato de una tabla +donde las columnas son vectores que tienen la misma longitud. Debido a que las columnas +son vectores, cada columna debe contener un único tipo de datos +(por ejemplo, caracteres, números enteros, factores). Por ejemplo, aquí hay una figura +que representa un marco de datos que comprende un vector numérico, un carácter y un vector lógico +. ![](./fig/data-frame.svg) -We can see this when inspecting the <b>str</b>ucture of a data frame -with the function `str()`: +Podemos ver esto al inspeccionar la estructura <b>str</b>de un marco de datos +con la función `str()`: ```{r} -str(rna) +cadena (arn) ``` -## Inspecting `data.frame` Objects +## Inspeccionando objetos `data.frame` -We already saw how the functions `head()` and `str()` can be useful to -check the content and the structure of a data frame. Here is a -non-exhaustive list of functions to get a sense of the -content/structure of the data. Let's try them out! +Ya vimos cómo las funciones `head()` y `str()` pueden ser útiles para +comprobar el contenido y la estructura de un marco de datos. Aquí hay una +lista no exhaustiva de funciones para tener una idea del +contenido/estructura de los datos. ¡Probémoslos! -**Size**: +**Tamaño**: -- `dim(rna)` - returns a vector with the number of rows as the first - element, and the number of columns as the second element (the - **dim**ensions of the object). -- `nrow(rna)` - returns the number of rows. -- `ncol(rna)` - returns the number of columns. +- `dim(rna)` - devuelve un vector con el número de filas como el primer elemento + y el número de columnas como el segundo elemento (las + **dim**ensiones del objeto ). +- `nrow(rna)` - devuelve el número de filas. +- `ncol(rna)` - devuelve el número de columnas. -**Content**: +**Contenido**: -- `head(rna)` - shows the first 6 rows. -- `tail(rna)` - shows the last 6 rows. +- `head(rna)` - muestra las primeras 6 filas. +- `tail(rna)` - muestra las últimas 6 filas. -**Names**: +**Nombres**: -- `names(rna)` - returns the column names (synonym of `colnames()` for - `data.frame` objects). -- `rownames(rna)` - returns the row names. +- `names(rna)` - devuelve los nombres de las columnas (sinónimo de `colnames()` para + objetos `data.frame`). +- `rownames(rna)` - devuelve los nombres de las filas. -**Summary**: +**Resumen**: -- `str(rna)` - structure of the object and information about the - class, length and content of each column. -- `summary(rna)` - summary statistics for each column. +- `str(rna)` - estructura del objeto e información sobre la clase + , longitud y contenido de cada columna. +- `summary(rna)`: resumen de estadísticas para cada columna. -Note: most of these functions are "generic", they can be used on other types of -objects besides `data.frame`. +Nota: la mayoría de estas funciones son "genéricas", se pueden usar en otros tipos de objetos +además de `data.frame`. -::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::: desafío -## Challenge: +## Desafío: -Based on the output of `str(rna)`, can you answer the following -questions? +Según el resultado de `str(rna)`, ¿puedes responder las siguientes +preguntas? -- What is the class of the object `rna`? -- How many rows and how many columns are in this object? +- ¿Cuál es la clase del objeto `rna`? +- ¿Cuántas filas y cuántas columnas hay en este objeto? -::::::::::::::: solution +::::::::::::::: solución -## Solution +## Solución -- class: data frame -- how many rows: 66465, how many columns: 11 +- clase: marco de datos +- cuantas filas: 66465, cuantas columnas: 11 -::::::::::::::::::::::::: +:::::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::::::::: -## Indexing and subsetting data frames +## Indexación y subconjunto de marcos de datos -Our `rna` data frame has rows and columns (it has 2 dimensions); if we -want to extract some specific data from it, we need to specify the -"coordinates" we want. Row numbers come first, followed by -column numbers. However, note that different ways of specifying these -coordinates lead to results with different classes. +Nuestro marco de datos `rna` tiene filas y columnas (tiene 2 dimensiones); Si +queremos extraer algunos datos específicos de él, debemos especificar las +"coordenadas" que queremos. Los números de fila van primero, seguidos de +números de columna. Sin embargo, tenga en cuenta que diferentes formas de especificar estas coordenadas +conducen a resultados con diferentes clases. ```{r, eval=FALSE, purl=TRUE} -# first element in the first column of the data frame (as a vector) +# primer elemento en la primera columna del marco de datos (como vector) rna[1, 1] -# first element in the 6th column (as a vector) -rna[1, 6] -# first column of the data frame (as a vector) +# primer elemento en la sexta columna (como vector) +rna [1, 6] +# primera columna del marco de datos (como un vector) rna[, 1] -# first column of the data frame (as a data.frame) +# primera columna del marco de datos (como un data.frame ) rna[1] -# first three elements in the 7th column (as a vector) +# primeros tres elementos en la séptima columna (como un vector) rna[1:3, 7] -# the 3rd row of the data frame (as a data.frame) +# la tercera fila del marco de datos (como un data.frame) rna[3, ] -# equivalent to head_rna <- head(rna) +# equivalente a head_rna <- head(rna) head_rna <- rna[1:6, ] -head_rna +cabeza_rna ``` -`:` is a special function that creates numeric vectors of integers in -increasing or decreasing order, test `1:10` and `10:1` for -instance. See section @ref(sec:genvec) for details. +`:` es una función especial que crea vectores numéricos de números enteros en +orden creciente o decreciente, pruebe `1:10` y `10:1` para la instancia +. Consulte la sección @ref(sec:genvec) para obtener más detalles. -You can also exclude certain indices of a data frame using the "`-`" sign: +También puedes excluir ciertos índices de un marco de datos usando el signo "`-`": ```{r, eval=FALSE, purl=TRUE} -rna[, -1] ## The whole data frame, except the first column -rna[-c(7:66465), ] ## Equivalent to head(rna) +rna[, -1] ## Todo el marco de datos, excepto la primera columna +rna[-c(7:66465), ] ## Equivalente a head(rna) ``` -Data frames can be subsetted by calling indices (as shown previously), -but also by calling their column names directly: +Los marcos de datos se pueden subconjuntos llamando a índices (como se mostró anteriormente), +pero también llamando directamente a sus nombres de columnas: ```{r, eval=FALSE, purl=TRUE} rna["gene"] # Result is a data.frame @@ -256,119 +256,119 @@ rna[["gene"]] # Result is a vector rna$gene # Result is a vector ``` -In RStudio, you can use the autocompletion feature to get the full and -correct names of the columns. +En RStudio, puede utilizar la función de autocompletar para obtener los nombres completos y +correctos de las columnas. -::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::: desafío -## Challenge +## Desafío -1. Create a `data.frame` (`rna_200`) containing only the data in - row 200 of the `rna` dataset. +1. Cree un `data.frame` (`rna_200`) que contenga solo los datos en + fila 200 del conjunto de datos `rna`. -2. Notice how `nrow()` gave you the number of rows in a `data.frame`? +2. ¿Observó cómo `nrow()` le dio el número de filas en un `data.frame`? -- Use that number to pull out just that last row in the initial - `rna` data frame. +- Use ese número para extraer solo la última fila en el marco de datos inicial + `rna`. -- Compare that with what you see as the last row using `tail()` to - make sure it's meeting expectations. +- Compare eso con lo que ve como la última fila usando `tail()` para + y asegúrese de que cumpla con las expectativas. -- Pull out that last row using `nrow()` instead of the row number. +- Saque la última fila usando `nrow()` en lugar del número de fila. -- Create a new data frame (`rna_last`) from that last row. +- Cree un nuevo marco de datos (`rna_last`) a partir de esa última fila. -3. Use `nrow()` to extract the row that is in the middle of the - `rna` dataframe. Store the content of this row in an object - named `rna_middle`. +3. Utilice `nrow()` para extraer la fila que está en el medio del marco de datos + `rna`. Almacene el contenido de esta fila en un objeto + llamado `rna_middle`. -4. Combine `nrow()` with the `-` notation above to reproduce the - behavior of `head(rna)`, keeping just the first through 6th - rows of the rna dataset. +4. Combine `nrow()` con la notación `-` anterior para reproducir el comportamiento + de `head(rna)`, manteniendo solo la primera a la sexta + filas del conjunto de datos de rna. -::::::::::::::: solution +::::::::::::::: solución -## Solution +## Solución ```{r, purl=TRUE} ## 1. rna_200 <- rna[200, ] ## 2. -## Saving `n_rows` to improve readability and reduce duplication -n_rows <- nrow(rna) +## Guardando `n_rows` para mejorar la legibilidad y reducir la duplicación +n_rows < - nrow(rna) rna_last <- rna[n_rows, ] ## 3. rna_middle <- rna[n_rows / 2, ] -## 4. +## 4 rna_head <- rna[-(7:n_rows), ] ``` -::::::::::::::::::::::::: +:::::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::::::::: -## Factors +## Factores -Factors represent **categorical data**. They are stored as integers -associated with labels and they can be ordered or unordered. While -factors look (and often behave) like character vectors, they are -actually treated as integer vectors by R. So you need to be very -careful when treating them as strings. +Los factores representan **datos categóricos**. Se almacenan como números enteros +asociados con etiquetas y pueden estar ordenados o desordenados. Si bien los factores +parecen (y a menudo se comportan) como vectores de caracteres, en realidad R los trata +como vectores enteros. Por lo tanto, debe tener mucho cuidado +al tratarlos como cadenas. -Once created, factors can only contain a pre-defined set of values, -known as _levels_. By default, R always sorts levels in alphabetical -order. For instance, if you have a factor with 2 levels: +Una vez creados, los factores solo pueden contener un conjunto predefinido de valores, +conocidos como _niveles_. De forma predeterminada, R siempre ordena los niveles en orden alfabético +. Por ejemplo, si tienes un factor con 2 niveles: ```{r, purl=TRUE} -sex <- factor(c("male", "female", "female", "male", "female")) +sexo <- factor(c("masculino", "femenino", "femenino", "masculino", "femenino")) ``` R will assign `1` to the level `"female"` and `2` to the level `"male"` (because `f` comes before `m`, even though the first element -in this vector is `"male"`). You can see this by using the function -`levels()` and you can find the number of levels using `nlevels()`: +in this vector is `"male"`). Puedes ver esto usando la función +`levels()` y puedes encontrar el número de niveles usando `nlevels()`: ```{r, purl=TRUE} -levels(sex) -nlevels(sex) +niveles(sexo) +nniveles(sexo) ``` -Sometimes, the order of the factors does not matter, other times you -might want to specify the order because it is meaningful (e.g., "low", -"medium", "high"), it improves your visualization, or it is required -by a particular type of analysis. Here, one way to reorder our levels -in the `sex` vector would be: +A veces, el orden de los factores no importa, otras veces +es posible que desee especificar el orden porque es significativo (por ejemplo, "bajo", +"medio", "alto"), mejora su visualización, o es requerido +por un tipo particular de análisis. Aquí, una forma de reordenar nuestros niveles +en el vector `sex` sería: ```{r, purl=TRUE} -sex ## current order -sex <- factor(sex, levels = c("male", "female")) -sex ## after re-ordering +sexo ## orden actual +sexo <- factor(sexo, niveles = c("masculino", "femenino")) +sexo ## después de reordenar ``` -In R's memory, these factors are represented by integers (1, 2, 3), -but are more informative than integers because factors are self -describing: `"female"`, `"male"` is more descriptive than `1`, -`2`. Which one is "male"? You wouldn't be able to tell just from the -integer data. Factors, on the other hand, have this information built-in. -It is particularly helpful when there are many levels (like the -gene biotype in our example dataset). +En la memoria de R, estos factores están representados por números enteros (1, 2, 3), +pero son más informativos que los números enteros porque los factores son auto +y describen: `"femenino"`, `"masculino" ` es más descriptivo que `1`, +`2`. ¿Cuál es "masculino"? No podrías saberlo solo por los datos enteros +. Los factores, por otro lado, tienen esta información incorporada. +Es particularmente útil cuando hay muchos niveles (como el biotipo del gen +en nuestro conjunto de datos de ejemplo). -When your data is stored as a factor, you can use the `plot()` -function to get a quick glance at the number of observations -represented by each factor level. Let's look at the number of males -and females in our data. +Cuando sus datos se almacenan como un factor, puede usar la función `plot()` +para obtener un vistazo rápido al número de observaciones +representadas por cada nivel de factor. Veamos la cantidad de hombres +y mujeres en nuestros datos. ```{r firstfactorplot, fig.cap="Bar plot of the number of females and males.", purl=TRUE} -plot(sex) +trama (sexo) ``` -### Converting to character +### Convertirse en personaje -If you need to convert a factor to a character vector, you use +Si necesita convertir un factor en un vector de caracteres, utilice `as.character(x)`. ```{r, purl=TRUE} -as.character(sex) +como.personaje(sexo) ``` <!-- ### Numeric factors --> @@ -409,45 +409,45 @@ as.character(sex) <!-- vector `year_fct` inside the square brackets --> -### Renaming factors +### Factores de cambio de nombre -If we want to rename these factor, it is sufficient to change its -levels: +Si queremos cambiar el nombre de estos factores, basta con cambiar sus +niveles: ```{r, purl=TRUE} -levels(sex) -levels(sex) <- c("M", "F") -sex -plot(sex) +niveles(sexo) +niveles(sexo) <- c("M", "F") +sexo +trama(sexo) ``` -:::::::::::::::::::::::::::::::::::::: challenge +:::::::::::::::::::::::::::::::::::::: desafío -## Challenge: +## Desafío: -- Rename "F" and "M" to "Female" and "Male" respectively. +- Cambie el nombre de "F" y "M" a "Mujer" y "Masculino" respectivamente. -::::::::::::::: solution +::::::::::::::: solución -## Solution +## Solución ```{r, eval=TRUE, purl=TRUE} -levels(sex) -levels(sex) <- c("Male", "Female") +niveles(sexo) +niveles(sexo) <- c("Hombre", "Mujer") ``` -::::::::::::::::::::::::: +:::::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::::::::: -::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::: desafío -## Challenge: +## Desafío: -We have seen how data frames are created when using `read.csv()`, but -they can also be created by hand with the `data.frame()` function. -There are a few mistakes in this hand-crafted `data.frame`. Can you -spot and fix them? Don't hesitate to experiment! +Hemos visto cómo se crean los marcos de datos cuando se usa `read.csv()`, pero +también se pueden crear a mano con la función `data.frame()`. +Hay algunos errores en este "data.frame" hecho a mano. ¿Puedes +detectarlos y solucionarlos? ¡No dudes en experimentar! ```{r, eval=FALSE} animal_data <- data.frame( @@ -456,318 +456,318 @@ animal_data <- data.frame( weight = c(45, 8 1.1, 0.8)) ``` -::::::::::::::: solution +::::::::::::::: solución -## Solution +## Solución -- missing quotations around the names of the animals -- missing one entry in the "feel" column (probably for one of the furry animals) -- missing one comma in the weight column +- Faltan citas alrededor de los nombres de los animales. +- falta una entrada en la columna "sensación" (probablemente para uno de los animales peludos) +- falta una coma en la columna de peso -::::::::::::::::::::::::: +:::::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::::::::: -::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::: desafío -## Challenge: +## Desafío: -Can you predict the class for each of the columns in the following -example? +¿Puedes predecir la clase para cada una de las columnas en el siguiente ejemplo +? -Check your guesses using `str(country_climate)`: +Comprueba tus conjeturas usando `str(country_climate)`: -- Are they what you expected? Why? Why not? +- ¿Son lo que esperabas? ¿Por qué? ¿Por qué no? -- Try again by adding `stringsAsFactors = TRUE` after the last - variable when creating the data frame. What is happening now? - `stringsAsFactors` can also be set when reading text-based - spreadsheets into R using `read.csv()`. +- Intente nuevamente agregando `stringsAsFactors = TRUE` después de la última variable + al crear el marco de datos. ¿Qué está pasando ahora? + `stringsAsFactors` también se puede configurar al leer + hojas de cálculo basadas en texto en R usando `read.csv()`. ```{r, eval=FALSE, purl=TRUE} country_climate <- data.frame( - country = c("Canada", "Panama", "South Africa", "Australia"), - climate = c("cold", "hot", "temperate", "hot/temperate"), - temperature = c(10, 30, 18, "15"), - northern_hemisphere = c(TRUE, TRUE, FALSE, "FALSE"), - has_kangaroo = c(FALSE, FALSE, FALSE, 1) - ) + country = c("Canadá", "Panamá", "Sudáfrica", "Australia"), + clima = c("frío", "caliente" , "templado", "caliente/templado"), + temperatura = c(10, 30, 18, "15"), + hemisferio_norte = c(VERDADERO, VERDADERO, FALSO, "FALSO" ), + has_kangaroo = c(FALSO, FALSO, FALSO, 1) +) ``` -::::::::::::::: solution +::::::::::::::: solución -## Solution +## Solución ```{r, eval=TRUE, purl=TRUE} country_climate <- data.frame( - country = c("Canada", "Panama", "South Africa", "Australia"), - climate = c("cold", "hot", "temperate", "hot/temperate"), - temperature = c(10, 30, 18, "15"), - northern_hemisphere = c(TRUE, TRUE, FALSE, "FALSE"), - has_kangaroo = c(FALSE, FALSE, FALSE, 1) + country = c("Canadá", "Panamá", "Sudáfrica", "Australia"), + clima = c("frío", "caliente" , "templado", "caliente/templado"), + temperatura = c(10, 30, 18, "15"), + hemisferio_norte = c(VERDADERO, VERDADERO, FALSO, "FALSO" ), + has_kangaroo = c(FALSO, FALSO, FALSO, 1) ) -str(country_climate) +str(clima_país) ``` -::::::::::::::::::::::::: +:::::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::::::::: + +La conversión automática de tipos de datos es a veces una bendición, a veces una +molestia. Tenga en cuenta que existe, aprenda las reglas y verifique que los datos +que importe en R sean del tipo correcto dentro de su marco de datos. De lo contrario, úselo +a su favor para detectar errores que podrían haberse introducido durante la entrada de datos +(una letra en una columna que solo debe contener números, por ejemplo). -The automatic conversion of data type is sometimes a blessing, sometimes an -annoyance. Be aware that it exists, learn the rules, and double check that data -you import in R are of the correct type within your data frame. If not, use it -to your advantage to detect mistakes that might have been introduced during data -entry (a letter in a column that should only contain numbers for instance). +Obtenga más información en este tutorial de RStudio -Learn more in this RStudio -tutorial -## Matrices +## matrices -Before proceeding, now that we have learnt about data frames, let's -recap package installation and learn about a new data type, namely the -`matrix`. Like a `data.frame`, a matrix has two dimensions, rows and -columns. But the major difference is that all cells in a `matrix` must -be of the same type: `numeric`, `character`, `logical`, ... In that -respect, matrices are closer to a `vector` than a `data.frame`. +Antes de continuar, ahora que hemos aprendido sobre los marcos de datos, recapitulemos +la instalación del paquete y aprendamos sobre un nuevo tipo de datos, a saber, la +`matriz`. Al igual que un `data.frame`, una matriz tiene dos dimensiones, filas y +columnas. Pero la principal diferencia es que todas las celdas de una `matriz` deben +ser del mismo tipo: `numérica`, `carácter`, `lógica`, ... En ese +respecto, las matrices están más cerca de un "vector" que de un "marco.de.datos". -The default constructor for a matrix is `matrix`. It takes a vector of -values to populate the matrix and the number of row and/or -columns[^ncol]. The values are sorted along the columns, as illustrated -below. +El constructor predeterminado para una matriz es "matriz". Se necesita un vector de +valores para poblar la matriz y el número de filas y/o +columnas[^ncol]. Los valores se ordenan a lo largo de las columnas, como se ilustra +a continuación. ```{r mat1, purl=TRUE} -m <- matrix(1:9, ncol = 3, nrow = 3) +m <- matriz(1:9, ncol = 3, nrow = 3) m ``` -[^ncol]: Either the number of rows or columns are enough, as the other one can be deduced from the length of the values. Try out what happens if the values and number of rows/columns don't add up. +[^ncol]: O el número de filas o columnas es suficiente, ya que el otro se puede deducir de la longitud de los valores. Pruebe qué sucede si los valores y el número de filas/columnas no cuadran. -::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::: desafío -## Challenge: +## Desafío: -Using the function `installed.packages()`, create a `character` matrix -containing the information about all packages currently installed on -your computer. Explore it. +Usando la función `installed.packages()`, cree una matriz de `caracteres` +que contenga la información sobre todos los paquetes actualmente instalados en +su computadora. Explorarlo. -::::::::::::::: solution +::::::::::::::: solución -## Solution: +## Solución: ```{r pkg_sln, eval=FALSE, purl=TRUE} -## create the matrix -ip <- installed.packages() +## crea la matriz +ip <- install.packages() head(ip) -## try also View(ip) -## number of package +## prueba también View(ip) +## número de paquete nrow(ip) -## names of all installed packages +## nombres de todos los paquetes instalados rownames(ip) -## type of information we have about each package -colnames(ip) +## tipo de información que tenemos sobre cada paquete +nombres de columna (ip) ``` -::::::::::::::::::::::::: +:::::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::::::::: -It is often useful to create large random data matrices as test -data. The exercise below asks you to create such a matrix with random -data drawn from a normal distribution of mean 0 and standard deviation -1, which can be done with the `rnorm()` function. +A menudo resulta útil crear grandes matrices de datos aleatorios como datos de prueba +. El siguiente ejercicio le pide que cree dicha matriz con datos aleatorios +extraídos de una distribución normal de media 0 y desviación estándar +1, lo cual se puede hacer con la función `rnorm()`. -::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::: desafío -## Challenge: +## Desafío: -Construct a matrix of dimension 1000 by 3 of normally distributed data -(mean 0, standard deviation 1) +Construya una matriz de dimensión 1000 por 3 de datos distribuidos normalmente +(media 0, desviación estándar 1) -::::::::::::::: solution +::::::::::::::: solución -## Solution +## Solución ```{r rnormmat_sln, purl=TRUE} set.seed(123) -m <- matrix(rnorm(3000), ncol = 3) +m <- matriz(rnorm(3000), ncol = 3) dim(m) -head(m) +cabeza(m) ``` -::::::::::::::::::::::::: +:::::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::::::::: + +## Formato de fechas + +Uno de los problemas más comunes que los nuevos (¡y experimentados!) Los usuarios de R +convierten la información de fecha y hora en una variable que es +apropiada y utilizable durante los análisis. + +### Nota sobre fechas en programas de hojas de cálculo -## Formatting Dates - -One of the most common issues that new (and experienced!) R users have -is converting date and time information into a variable that is -appropriate and usable during analyses. - -### Note on dates in spreadsheet programs - -Dates in spreadsheets are generally stored in a single column. While -this seems the most natural way to record dates, it actually is not -best practice. A spreadsheet application will display the dates in a -seemingly correct way (to a human observer) but how it actually -handles and stores the dates may be problematic. It is often much -safer to store dates with YEAR, MONTH and DAY in separate columns or -as YEAR and DAY-OF-YEAR in separate columns. - -Spreadsheet programs such as LibreOffice, Microsoft Excel, OpenOffice, -Gnumeric, ... have different (and often incompatible) ways of encoding -dates (even for the same program between versions and operating -systems). Additionally, Excel can turn things that aren't dates into -dates -(@Zeeberg:2004), for example names or identifiers like MAR1, DEC1, -OCT4. So if you're avoiding the date format overall, it's easier to -identify these issues. - -The Dates as -data -section of the Data Carpentry lesson provides additional insights -about pitfalls of dates with spreadsheets. - -We are going to use the `ymd()` function from the package -**`lubridate`** (which belongs to the **`tidyverse`**; learn more -[here](https://www.tidyverse.org/)). . **`lubridate`** gets installed -as part of the **`tidyverse`** installation. When you load the -**`tidyverse`** (`library(tidyverse)`), the core packages (the -packages used in most data analyses) get loaded. **`lubridate`** -however does not belong to the core tidyverse, so you have to load it -explicitly with `library(lubridate)`. - -Start by loading the required package: +Las fechas en las hojas de cálculo generalmente se almacenan en una sola columna. Si bien +esta parece la forma más natural de registrar fechas, en realidad no es +la mejor práctica. Una aplicación de hoja de cálculo mostrará las fechas de una manera +aparentemente correcta (para un observador humano), pero la forma en que realmente +maneja y almacena las fechas puede ser problemática. A menudo es mucho más seguro +almacenar fechas con AÑO, MES y DÍA en columnas separadas o +como AÑO y DÍA DEL AÑO en columnas separadas. + +Programas de hojas de cálculo como LibreOffice, Microsoft Excel, OpenOffice, +Gnumeric,... tienen formas diferentes (y a menudo incompatibles) de codificar +fechas (incluso para el mismo programa entre versiones y sistemas operativos +). Además, Excel puede [convertir cosas que no son fechas en +fechas](https://nsaunders.wordpress.com/2012/10/22/gene-name-errors-and-excel-lessons-not -aprendido/) +(@Zeeberg:2004), por ejemplo nombres o identificadores como MAR1, DEC1, +OCT4. Entonces, si evitas el formato de fecha en general, es más fácil +identificar estos problemas. + +La sección Fechas como +datos +de la lección Carpintería de datos proporciona Ideas adicionales +sobre los peligros de las fechas con hojas de cálculo. + +Vamos a utilizar la función `ymd()` del paquete +**`lubridate`** (que pertenece al **`tidyverse`**; aprende más +[aquí] (https://www.tidyverse.org/)). . **`lubridate`** se instala +como parte de la instalación de **`tidyverse`**. Cuando cargas +**`tidyverse`** (`library(tidyverse)`), los paquetes principales (los +paquetes utilizados en la mayoría de los análisis de datos) se cargan. **`lubridate`** +sin embargo no pertenece al tidyverse principal, por lo que debes cargarlo +explícitamente con `library(lubridate)`. + +Comience cargando el paquete requerido: ```{r loadlibridate, message=FALSE, purl=TRUE} -library("lubridate") +biblioteca("lubricar") ``` -`ymd()` takes a vector representing year, month, and day, and converts -it to a `Date` vector. `Date` is a class of data recognized by R as -being a date and can be manipulated as such. The argument that the -function requires is flexible, but, as a best practice, is a character -vector formatted as "YYYY-MM-DD". +`ymd()` toma un vector que representa año, mes y día, y lo convierte +en un vector `Date`. `Date` es una clase de datos reconocida por R como +siendo una fecha y puede manipularse como tal. El argumento que requiere la función +es flexible, pero, como práctica recomendada, es un vector de caracteres +con el formato "AAAA-MM-DD". -Let's create a date object and inspect the structure: +Creemos un objeto de fecha e inspeccionemos la estructura: ```{r, purl=TRUE} -my_date <- ymd("2015-01-01") -str(my_date) +mi_fecha <- ymd("2015-01-01") +str(mi_fecha) ``` -Now let's paste the year, month, and day separately - we get the same result: +Ahora peguemos el año, el mes y el día por separado; obtenemos el mismo resultado: ```{r, purl=TRUE} -# sep indicates the character to use to separate each component +# sep indica el carácter a utilizar para separar cada componente my_date <- ymd(paste("2015", "1", "1", sep = "-")) -str(my_date) +str(my_date ) ``` -Let's now familiarise ourselves with a typical date manipulation -pipeline. The small data below has stored dates in different `year`, -`month` and `day` columns. +Familiaricémonos ahora con una canalización típica de manipulación de fechas +. Los pequeños datos a continuación han almacenado fechas en diferentes columnas "año", +"mes" y "día". ```{r, purl=TRUE} -x <- data.frame(year = c(1996, 1992, 1987, 1986, 2000, 1990, 2002, 1994, 1997, 1985), - month = c(2, 3, 3, 10, 1, 8, 3, 4, 5, 5), - day = c(24, 8, 1, 5, 8, 17, 13, 10, 11, 24), - value = c(4, 5, 1, 9, 3, 8, 10, 2, 6, 7)) +x <- data.frame(año = c(1996, 1992, 1987, 1986, 2000, 1990, 2002, 1994, 1997, 1985), + mes = c(2, 3, 3, 10, 1 , 8, 3, 4, 5, 5), + día = c(24, 8, 1, 5, 8, 17, 13, 10, 11, 24), + valor = c (4, 5, 1, 9, 3, 8, 10, 2, 6, 7)) x ``` -Now we apply this function to the `x` dataset. We first create a -character vector from the `year`, `month`, and `day` columns of `x` -using `paste()`: +Ahora aplicamos esta función al conjunto de datos `x`. Primero creamos un vector de caracteres +a partir de las columnas `año`, `mes` y `día` de `x` +usando `paste()`: ```{r, purl=TRUE} -paste(x$year, x$month, x$day, sep = "-") +pegar(x$year, x$month, x$day, sep = "-") ``` -This character vector can be used as the argument for `ymd()`: +Este vector de caracteres se puede utilizar como argumento para `ymd()`: ```{r, purl=TRUE} -ymd(paste(x$year, x$month, x$day, sep = "-")) +ymd(pegar(x$year, x$month, x$day, sep = "-")) ``` -The resulting `Date` vector can be added to `x` as a new column called `date`: +El vector "Fecha" resultante se puede agregar a "x" como una nueva columna llamada "fecha": ```{r, purl=TRUE} -x$date <- ymd(paste(x$year, x$month, x$day, sep = "-")) -str(x) # notice the new column, with 'date' as the class +x$date <- ymd(pegar(x$year, x$month, x$day, sep = "-")) +str(x) # observe la nueva columna, con 'fecha' como clase ``` -Let's make sure everything worked correctly. One way to inspect the -new column is to use `summary()`: +Asegurémonos de que todo funcionó correctamente. Una forma de inspeccionar la +nueva columna es usar `summary()`: ```{r, purl=TRUE} -summary(x$date) +resumen(x$date) ``` -Note that `ymd()` expects to have the year, month and day, in that -order. If you have for instance day, month and year, you would need +Tenga en cuenta que `ymd()` espera tener el año, mes y día, en ese orden +. Si tiene, por ejemplo, día, mes y año, necesitará `dmy()`. ```{r, purl=TRUE} -dmy(paste(x$day, x$month, x$year, sep = "-")) +dmy(pegar(x$day, x$month, x$year, sep = "-")) ``` -`lubdridate` has many functions to address all date variations. +`lubdridate` tiene muchas funciones para abordar todas las variaciones de fechas. -## Summary of R objects +## Resumen de objetos R -So far, we have seen several types of R object varying in the number -of dimensions and whether they could store a single or multiple data -types: +Hasta ahora, hemos visto varios tipos de objetos R que varían en el número +de dimensiones y si pueden almacenar uno o varios tipos de datos +: -- **`vector`**: one dimension (they have a length), single type of data. -- **`matrix`**: two dimensions, single type of data. -- **`data.frame`**: two dimensions, one type per column. +- **`vector`**: una dimensión (tienen una longitud), un solo tipo de datos. +- **`matriz`**: dos dimensiones, un solo tipo de datos. +- **`data.frame`**: dos dimensiones, un tipo por columna. -## Lists +## Liza -A data type that we haven't seen yet, but that is useful to know, and -follows from the summary that we have just seen are lists: +Un tipo de datos que aún no hemos visto, pero que es útil conocer, y +se desprende del resumen que acabamos de ver son listas: -- **`list`**: one dimension, every item can be of a different data - type. +- **`lista`**: una dimensión, cada elemento puede ser de un tipo de datos + diferente. -Below, let's create a list containing a vector of numbers, characters, -a matrix, a dataframe and another list: +A continuación, creemos una lista que contiene un vector de números, caracteres, +una matriz, un marco de datos y otra lista: ```{r list0, purl=TRUE} -l <- list(1:10, ## numeric - letters, ## character - installed.packages(), ## a matrix - cars, ## a data.frame - list(1, 2, 3)) ## a list -length(l) +l <- lista (1:10, ## numérico + letras, ## carácter + paquetes.instalados(), ## una matriz + autos, ## un marco.de.datos + lista(1, 2, 3)) ## una lista +longitud(l) str(l) ``` -List subsetting is done using `[]` to subset a new sub-list or `[[]]` -to extract a single element of that list (using indices or names, if -the list is named). +El subconjunto de listas se realiza usando `[]` para crear subconjuntos de una nueva sublista o `[[]]` +para extraer un solo elemento de esa lista (usando índices o nombres, si +la lista es nombrado). ```{r, purl=TRUE} -l[[1]] ## first element -l[1:2] ## a list of length 2 -l[1] ## a list of length 1 +l[[1]] ## primer elemento +l[1:2] ## una lista de longitud 2 +l[1] ## una lista de longitud 1 ``` -## Exporting and saving tabular data {#sec:exportandsave} +## Exportar y guardar datos tabulares {#sec:exportandsave} -We have seen how to read a text-based spreadsheet into R using the -`read.table` family of functions. To export a `data.frame` to a -text-based spreadsheet, we can use the `write.table` set of functions -(`write.csv`, `write.delim`, ...). They all take the variable to be -exported and the file to be exported to. For example, to export the -`rna` data to the `my_rna.csv` file in the `data_output` -directory, we would execute: +Hemos visto cómo leer una hoja de cálculo basada en texto en R usando la familia de funciones +`read.table`. Para exportar un `data.frame` a una +hoja de cálculo basada en texto, podemos usar el conjunto de funciones `write.table` +(`write.csv`, `write.delim`, ...). Todos toman la variable que se exportará +y el archivo al que se exportará. Por ejemplo, para exportar los datos +`rna` al archivo `my_rna.csv` en el directorio `data_output` +, ejecutaríamos: ```{r, eval=FALSE, purl=TRUE} -write.csv(rna, file = "data_output/my_rna.csv") +write.csv(rna, archivo = "data_output/my_rna.csv") ``` This new csv file can now be shared with other collaborators who @@ -777,8 +777,8 @@ by default surround each field with quotes, and thus we will be able to read it back into R correctly, despite also using commas as column separators. -:::::::::::::::::::::::::::::::::::::::: keypoints +:::::::::::::::::::::::::::::::::::::::: puntos clave -- Tabular data in R +- Datos tabulares en R -:::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::::::::: From 16954ec821bbdbfa8a04361d4a9d9d5cd40c2093 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Tue, 14 May 2024 00:30:17 +0900 Subject: [PATCH 172/334] New translations 30-dplyr.md (Spanish) --- locale/es/episodes/30-dplyr.Rmd | 992 ++++++++++++++++---------------- 1 file changed, 496 insertions(+), 496 deletions(-) diff --git a/locale/es/episodes/30-dplyr.Rmd b/locale/es/episodes/30-dplyr.Rmd index b2360ec26..fd1d5da6e 100644 --- a/locale/es/episodes/30-dplyr.Rmd +++ b/locale/es/episodes/30-dplyr.Rmd @@ -1,6 +1,6 @@ --- source: Rmd -title: Manipulating and analysing data with dplyr +title: Manipular y analizar datos con dplyr teaching: 75 exercises: 75 --- @@ -10,83 +10,83 @@ exercises: 75 ::::::::::::::::::::::::::::::::::::::: objetivos -- Describe the purpose of the **`dplyr`** and **`tidyr`** packages. -- Describe several of their functions that are extremely useful to - manipulate data. -- Describe the concept of a wide and a long table format, and see - how to reshape a data frame from one format to the other one. -- Demonstrate how to join tables. +- Describe el propósito de los paquetes **`dplyr`** y **`tidyr`**. +- Describe varias de sus funciones que son extremadamente útiles para + manipular datos. +- Describa el concepto de formato de tabla ancho y largo, y vea + cómo remodelar un marco de datos de un formato a otro. +- Demuestre cómo unir tablas. -:::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::: questions +:::::::::::::::::::::::::::::::::::::::: preguntas -- Data analysis in R using the tidyverse meta-package +- Análisis de datos en R utilizando el metapaquete tidyverse -:::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::::::::: ```{r loaddata_dplyr, echo=FALSE, purl=FALSE, message=FALSE} if (!file.exists("data/rnaseq.csv")) -download.file(url = "https://github.com/carpentries-incubator/bioc-intro/raw/main/episodes/data/rnaseq.csv", - destfile = "data/rnaseq.csv") +download.file(url = "https://github.com/carpentries-incubator/bioc-intro/raw/main/episodes/ datos/rnaseq.csv", + destfile = "datos/rnaseq.csv") ``` -> This episode is based on the Data Carpentries's _Data Analysis and -> Visualisation in R for Ecologists_ lesson. +> Este episodio se basa en la lección _Análisis de datos y +> Visualización en R para ecologistas_ de Data Carpentries. -## Data manipulation using **`dplyr`** and **`tidyr`** +## Manipulación de datos usando **`dplyr`** y **`tidyr`** -Bracket subsetting is handy, but it can be cumbersome and difficult to -read, especially for complicated operations. +El subconjunto de corchetes es útil, pero puede resultar engorroso y difícil de +leer, especialmente para operaciones complicadas. -Some packages can greatly facilitate our task when we manipulate data. -Packages in R are basically sets of additional functions that let you -do more stuff. The functions we've been using so far, like `str()` or -`data.frame()`, come built into R; Loading packages can give you access to other -specific functions. Before you use a package for the first time you need to install -it on your machine, and then you should import it in every subsequent -R session when you need it. +Algunos paquetes pueden facilitarnos mucho la tarea a la hora de manipular datos. +Los paquetes en R son básicamente conjuntos de funciones adicionales que te permiten +hacer más cosas. Las funciones que hemos estado usando hasta ahora, como `str()` o +`data.frame()`, vienen integradas en R; Cargar paquetes puede darle acceso a otras +funciones específicas. Antes de usar un paquete por primera vez, necesita instalarlo +en su máquina, y luego debe importarlo en cada sesión posterior de +R cuando lo necesite. -- The package **`dplyr`** provides powerful tools for data manipulation tasks. - It is built to work directly with data frames, with many manipulation tasks - optimised. +- El paquete **`dplyr`** proporciona potentes herramientas para tareas de manipulación de datos. + Está diseñado para trabajar directamente con marcos de datos, con muchas tareas de manipulación + optimizadas. -- As we will see latter on, sometimes we want a data frame to be reshaped to be able - to do some specific analyses or for visualisation. The package **`tidyr`** addresses - this common problem of reshaping data and provides tools for manipulating - data in a tidy way. +- Como veremos más adelante, a veces queremos remodelar un marco de datos para poder + hacer algunos análisis específicos o visualizarlo. El paquete **`tidyr`** aborda + este problema común de remodelar datos y proporciona herramientas para manipular + datos de forma ordenada. -To learn more about **`dplyr`** and **`tidyr`** after the workshop, -you may want to check out this handy data transformation with - -and this one about +Para obtener más información sobre **`dplyr`** y **`tidyr`** después del taller, +quizás quieras consultar esta práctica transformación de datos con +\*\* +y este uno sobre . -- The **`tidyverse`** package is an "umbrella-package" that installs - several useful packages for data analysis which work well together, - such as **`tidyr`**, **`dplyr`**, **`ggplot2`**, **`tibble`**, etc. - These packages help us to work and interact with the data. - They allow us to do many things with your data, such as subsetting, transforming, - visualising, etc. +- El paquete **`tidyverse`** es un "paquete general" que instala + varios paquetes útiles para el análisis de datos que funcionan bien juntos, + como **`tidyr`**, \* \*`dplyr`\*\*, **`ggplot2`**, **`tibble`**, etc. + Estos paquetes nos ayudan a trabajar e interactuar con los datos. + Nos permiten hacer muchas cosas con sus datos, como subconjuntos, transformaciones, + visualización, etc. -If you did the set up, you should have already installed the tidyverse package. -Check to see if you have it by trying to load in from the library: +Si realizó la configuración, ya debería haber instalado el paquete tidyverse. +Comprueba si lo tienes intentando cargarlo desde la biblioteca: ```{r, message=FALSE, purl=TRUE} -## load the tidyverse packages, incl. dplyr -library("tidyverse") +## cargar los paquetes tidyverse, incl. dplyr +biblioteca("tidyverse") ``` -If you got an error message `there is no package called ‘tidyverse’` then you have not -installed the package yet for this version of R. To install the **`tidyverse`** package type: +Si recibió un mensaje de error `no hay ningún paquete llamado 'tidyverse'` entonces +aún no ha instalado el paquete para esta versión de R. Para instalar el tipo de paquete **`tidyverse`**: ```{r, eval=FALSE, purl=TRUE} -BiocManager::install("tidyverse") +BiocManager::instalar("tidyverse") ``` -If you had to install the **`tidyverse`** package, do not forget to load it in this R session by using the `library()` command above! +Si tuvo que instalar el paquete **`tidyverse`**, ¡no olvide cargarlo en esta sesión de R usando el comando `library()` arriba! -## Loading data with tidyverse +## Cargando datos con tidyverse Instead of `read.csv()`, we will read in our data using the `read_csv()` function (notice the `_` instead of the `.`), from the tidyverse package @@ -95,64 +95,64 @@ function (notice the `_` instead of the `.`), from the tidyverse package ```{r, message=FALSE, purl=TRUE} rna <- read_csv("data/rnaseq.csv") -## view the data +## ver los datos rna ``` -Notice that the class of the data is now referred to as a "tibble". +Observe que la clase de datos ahora se denomina "tibble". -Tibbles tweak some of the behaviors of the data frame objects we introduced in the -previously. The data structure is very similar to a data frame. For our purposes -the only differences are that: +Tibbles modifica algunos de los comportamientos de los objetos del marco de datos que presentamos anteriormente en +. La estructura de datos es muy similar a un marco de datos. Para nuestros propósitos +las únicas diferencias son las siguientes: -1. It displays the data type of each column under its name. - Note that \<`dbl`\> is a data type defined to hold numeric values with - decimal points. +1. Muestra el tipo de datos de cada columna debajo de su nombre. + Tenga en cuenta que \<`dbl`\> es un tipo de datos definido para contener valores numéricos con + puntos decimales. -2. It only prints the first few rows of data and only as many columns as fit on - one screen. +2. Solo imprime las primeras filas de datos y solo tantas columnas como quepan en + una pantalla. -We are now going to learn some of the most common **`dplyr`** functions: +Ahora vamos a aprender algunas de las funciones **`dplyr`** más comunes: -- `select()`: subset columns -- `filter()`: subset rows on conditions -- `mutate()`: create new columns by using information from other columns -- `group_by()` and `summarise()`: create summary statistics on grouped data -- `arrange()`: sort results -- `count()`: count discrete values +- `select()`: subconjunto de columnas +- `filter()`: subconjunto de filas en condiciones +- `mutate()`: crea nuevas columnas usando información de otras columnas +- `group_by()` y `summarise()`: crean estadísticas resumidas sobre datos agrupados +- `arrange()`: ordenar resultados +- `count()`: cuenta valores discretos -## Selecting columns and filtering rows +## Seleccionar columnas y filtrar filas -To select columns of a data frame, use `select()`. The first argument -to this function is the data frame (`rna`), and the subsequent -arguments are the columns to keep. +Para seleccionar columnas de un marco de datos, use `select()`. El primer argumento +de esta función es el marco de datos (`rna`), y los argumentos +siguientes son las columnas que se deben conservar. ```{r, purl=TRUE} -select(rna, gene, sample, tissue, expression) +seleccionar (arn, gen, muestra, tejido, expresión) ``` -To select all columns _except_ certain ones, put a "-" in front of -the variable to exclude it. +Para seleccionar todas las columnas _excepto_ algunas, coloque un "-" delante de +la variable para excluirla. ```{r, purl=TRUE} -select(rna, -tissue, -organism) +seleccionar (arn, -tejido, -organismo) ``` -This will select all the variables in `rna` except `tissue` -and `organism`. +Esto seleccionará todas las variables en `rna` excepto `tejido` +y `organismo`. -To choose rows based on a specific criteria, use `filter()`: +Para elegir filas según un criterio específico, utilice `filtro()`: ```{r, purl=TRUE} -filter(rna, sex == "Male") -filter(rna, sex == "Male" & infection == "NonInfected") +filter(rna, sexo == "Masculino") +filter(rna, sex == "Masculino" & infección == "No infectado") ``` -Now let's imagine we are interested in the human homologs of the mouse -genes analysed in this dataset. This information can be found in the -last column of the `rna` tibble, named -`hsapiens_homolog_associated_gene_name`. To visualise it easily, we -will create a new table containing just the 2 columns `gene` and +Ahora imaginemos que estamos interesados en los homólogos humanos de los genes +de ratón analizados en este conjunto de datos. Esta información se puede encontrar en la +última columna del tibble `rna`, denominada +`hsapiens_homolog_associated_gene_name`. Para visualizarlo fácilmente, +crearemos una nueva tabla que contenga solo las 2 columnas `gene` y `hsapiens_homolog_associated_gene_name`. ```{r} @@ -160,345 +160,345 @@ genes <- select(rna, gene, hsapiens_homolog_associated_gene_name) genes ``` -Some mouse genes have no human homologs. These can be retrieved using -`filter()` and the `is.na()` function, that determines whether -something is an `NA`. +Algunos genes de ratón no tienen homólogos humanos. Estos se pueden recuperar usando +`filter()` y la función `is.na()`, que determina si +algo es un `NA`. ```{r, purl=TRUE} -filter(genes, is.na(hsapiens_homolog_associated_gene_name)) +filtro (genes, is.na (hsapiens_homolog_associated_gene_name)) ``` -If we want to keep only mouse genes that have a human homolog, we can -insert a "!" symbol that negates the result, so we're asking for -every row where hsapiens\_homolog\_associated\_gene\_name _is not_ an +Si queremos conservar sólo genes de ratón que tienen un homólogo humano, podemos +insertar un "!" símbolo que niega el resultado, por lo que estamos pidiendo +cada fila donde hsapiens\_homolog\_associated\_gene\_name _no es_ un `NA`. ```{r, purl=TRUE} -filter(genes, !is.na(hsapiens_homolog_associated_gene_name)) +filtro(genes, !is.na(hsapiens_homolog_associated_gene_name)) ``` -## Pipes +## Tubería -What if you want to select and filter at the same time? There are three -ways to do this: use intermediate steps, nested functions, or pipes. +¿Qué pasa si quieres seleccionar y filtrar al mismo tiempo? Hay tres +formas de hacer esto: usar pasos intermedios, funciones anidadas o canalizaciones. -With intermediate steps, you create a temporary data frame and use -that as input to the next function, like this: +Con pasos intermedios, crea un marco de datos temporal y lo usa +como entrada para la siguiente función, como esta: ```{r, purl=TRUE} -rna2 <- filter(rna, sex == "Male") -rna3 <- select(rna2, gene, sample, tissue, expression) +rna2 <- filter(rna, sexo == "Masculino") +rna3 <- select(rna2, gen, muestra, tejido, expresión) rna3 ``` -This is readable, but can clutter up your workspace with lots of -intermediate objects that you have to name individually. With multiple -steps, that can be hard to keep track of. +Esto es legible, pero puede saturar tu espacio de trabajo con muchos +objetos intermedios que debes nombrar individualmente. Con múltiples +pasos, puede ser difícil seguirles la pista. -You can also nest functions (i.e. one function inside of another), -like this: +También puedes anidar funciones (es decir, una función dentro de otra), +así: ```{r, purl=TRUE} -rna3 <- select(filter(rna, sex == "Male"), gene, sample, tissue, expression) +rna3 <- select(filtro(rna, sexo == "Masculino"), gen, muestra, tejido, expresión) rna3 ``` -This is handy, but can be difficult to read if too many functions are nested, as -R evaluates the expression from the inside out (in this case, filtering, then selecting). +Esto es útil, pero puede ser difícil de leer si hay demasiadas funciones anidadas, ya que +R evalúa la expresión de adentro hacia afuera (en este caso, filtra y luego selecciona). -The last option, _pipes_, are a recent addition to R. Pipes let you take -the output of one function and send it directly to the next, which is useful -when you need to do many things to the same dataset. +La última opción, _pipes_, es una adición reciente a R. Pipes te permite tomar +la salida de una función y enviarla directamente a la siguiente, lo cual es útil +cuando necesitas hacer muchas cosas al mismo conjunto de datos. -Pipes in R look like `%>%` (made available via the **`magrittr`** -package) or `|>` (through base R). If you use RStudio, you can type +Las tuberías en R se parecen a `%>%` (disponible a través del paquete **`magrittr`** +) o `|>` (a través de la base R). If you use RStudio, you can type the pipe with <kbd>Ctrl</kbd> + <kbd>Shift</kbd> + <kbd>M</kbd> if you have a PC or <kbd>Cmd</kbd> + <kbd>Shift</kbd> + <kbd>M</kbd> if you have a Mac. -In the above code, we use the pipe to send the `rna` dataset first -through `filter()` to keep rows where `sex` is Male, then through -`select()` to keep only the `gene`, `sample`, `tissue`, and -`expression`columns. +En el código anterior, usamos la tubería para enviar el conjunto de datos `rna` primero +a través de `filter()` para mantener las filas donde `sex` es Masculino, luego a través de +`select()` para mantener solo las columnas `gen`, `muestra`, `tejido` y +`expresión`. -The pipe `%>%` takes the object on its left and passes it directly as -the first argument to the function on its right, we don't need to -explicitly include the data frame as an argument to the `filter()` and -`select()` functions any more. +La tubería `%>%` toma el objeto a su izquierda y lo pasa directamente como +el primer argumento de la función a su derecha, no necesitamos +incluir explícitamente el marco de datos como un argumento para las funciones `filter()` y +`select()`. ```{r, purl=TRUE} rna %>% - filter(sex == "Male") %>% - select(gene, sample, tissue, expression) + filtro(sexo == "Masculino") %>% + seleccionar(gen, muestra, tejido, expresión) ``` -Some may find it helpful to read the pipe like the word "then". For instance, -in the above example, we took the data frame `rna`, _then_ we `filter`ed -for rows with `sex == "Male"`, _then_ we `select`ed columns `gene`, `sample`, -`tissue`, and `expression`. +A algunos les puede resultar útil leer la tubería como la palabra "entonces". Por ejemplo, +en el ejemplo anterior, tomamos el marco de datos `rna`, _luego_ `filtramos` +para filas con `sexo == "Masculino"`, _luego_ `seleccionamos` las columnas `gen`, `muestra`, +`tejido` y `expresión`. -The **`dplyr`** functions by themselves are somewhat simple, but by -combining them into linear workflows with the pipe, we can accomplish -more complex manipulations of data frames. +Las funciones **`dplyr`** por sí mismas son algo simples, pero al +combinarlas en flujos de trabajo lineales con la tubería, podemos lograr +manipulaciones más complejas de marcos de datos. -If we want to create a new object with this smaller version of the data, we -can assign it a new name: +Si queremos crear un nuevo objeto con esta versión más pequeña de los datos, +podemos asignarle un nuevo nombre: ```{r, purl=TRUE} rna3 <- rna %>% - filter(sex == "Male") %>% - select(gene, sample, tissue, expression) + filter(sexo == "Masculino") %>% + select(gen, muestra, tejido, expresión) rna3 ``` -::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::: desafío -## Challenge: +## Desafío: -Using pipes, subset the `rna` data to keep observations in female mice at time 0, -where the gene has an expression higher than 50000, and retain only the columns -`gene`, `sample`, `time`, `expression` and `age`. +Usando tuberías, subconjunto de datos de `rna` para mantener las observaciones en ratones hembra en el momento 0, +donde el gen tiene una expresión superior a 50000, y retenga solo las columnas +`gene`, `sample `, `tiempo`, `expresión` y `edad`. -::::::::::::::: solution +::::::::::::::: solución -## Solution +## Solución ```{r} rna %>% - filter(expression > 50000, - sex == "Female", - time == 0 ) %>% - select(gene, sample, time, expression, age) + filtro(expresión > 50000, + sexo == "Mujer", + tiempo == 0 ) %>% + seleccionar(gen, muestra , tiempo, expresión, edad) ``` -::::::::::::::::::::::::: +:::::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::::::::: -## Mutate +## Mudar -Frequently you'll want to create new columns based on the values of existing -columns, for example to do unit conversions, or to find the ratio of values in two -columns. For this we'll use `mutate()`. +Con frecuencia querrás crear nuevas columnas basadas en los valores de las columnas +existentes, por ejemplo, para hacer conversiones de unidades o para encontrar la proporción de valores en dos columnas +. Para esto usaremos `mutate()`. -To create a new column of time in hours: +Para crear una nueva columna de tiempo en horas: ```{r, purl=TRUE} rna %>% - mutate(time_hours = time * 24) %>% - select(time, time_hours) + mutar(tiempo_horas = tiempo * 24) %>% + seleccionar(tiempo, tiempo_horas) ``` -You can also create a second new column based on the first new column within the same call of `mutate()`: +También puede crear una segunda columna nueva basada en la primera columna nueva dentro de la misma llamada de `mutate()`: ```{r, purl=TRUE} rna %>% - mutate(time_hours = time * 24, - time_mn = time_hours * 60) %>% - select(time, time_hours, time_mn) + mutar(tiempo_horas = tiempo * 24, + tiempo_mn = tiempo_horas * 60) %>% + seleccionar(tiempo, tiempo_horas, tiempo_mn) ``` -::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::: desafío -## Challenge +## Desafío -Create a new data frame from the `rna` data that meets the following -criteria: contains only the `gene`, `chromosome_name`, -`phenotype_description`, `sample`, and `expression` columns. The expression -values should be log-transformed. This data frame must -only contain genes located on sex chromosomes, associated with a -phenotype\_description, and with a log expression higher than 5. +Cree un nuevo marco de datos a partir de los datos `rna` que cumpla con los siguientes +criterios: contenga solo el `gen`, `chromosome_name`, +`phenotype_description`, `sample` y `expression`. columnas. Los valores de expresión +deben transformarse logarítmicamente. Este marco de datos +solo debe contener genes ubicados en los cromosomas sexuales, asociados con un +fenotipo\_descripción y con una expresión logarítmica superior a 5. -**Hint**: think about how the commands should be ordered to produce -this data frame! +**Sugerencia**: piense en cómo se deben ordenar los comandos para producir +este marco de datos. -::::::::::::::: solution +::::::::::::::: solución -## Solution +## Solución ```{r, eval=TRUE, purl=TRUE} -rna %>% - mutate(expression = log(expression)) %>% - select(gene, chromosome_name, phenotype_description, sample, expression) %>% - filter(chromosome_name == "X" | chromosome_name == "Y") %>% - filter(!is.na(phenotype_description)) %>% - filter(expression > 5) +arn %>% + mutar(expresión = log(expresión)) %>% + seleccionar(gen, nombre_cromosoma, descripción_fenotipo, muestra, expresión) %>% + filtrar(nombre_cromosoma = = "X" | nombre_cromosoma == "Y") %>% + filtro(!is.na(descripción_fenotipo)) %>% + filtro(expresión > 5) ``` -::::::::::::::::::::::::: +:::::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::::::::: -## Split-apply-combine data analysis +## Análisis de datos divididos, aplicados y combinados -Many data analysis tasks can be approached using the -_split-apply-combine_ paradigm: split the data into groups, apply some -analysis to each group, and then combine the results. **`dplyr`** -makes this very easy through the use of the `group_by()` function. +Muchas tareas de análisis de datos se pueden abordar utilizando el paradigma +_split-apply-combine_: divida los datos en grupos, aplique algún análisis +a cada grupo y luego combine los resultados. **`dplyr`** +hace que esto sea muy fácil mediante el uso de la función `group_by()`. ```{r} -rna %>% - group_by(gene) +arn %>% + group_by(gen) ``` -The `group_by()` function doesn't perform any data processing, it -groups the data into subsets: in the example above, our initial -`tibble` of `r nrow(rna)` observations is split into -`r length(unique(rna$gene))` groups based on the `gene` variable. +La función `group_by()` no realiza ningún procesamiento de datos, +agrupa los datos en subconjuntos: en el ejemplo anterior, nuestro +`tibble` inicial de `r nrow(rna)` Las observaciones se dividen en grupos +`r length(unique(rna$gene))` según la variable `gen`. -We could similarly decide to group the tibble by the samples: +De manera similar, podríamos decidir agrupar el tibble por muestras: ```{r} rna %>% - group_by(sample) + group_by(muestra) ``` -Here our initial `tibble` of `r nrow(rna)` observations is split into -`r length(unique(rna$sample))` groups based on the `sample` variable. +Aquí nuestro `tibble` inicial de observaciones de `r nrow(rna)` se divide en +`r length(unique(rna$sample))` grupos basados en la variable `sample`. -Once the data has been grouped, subsequent operations will be -applied on each group independently. +Una vez agrupados los datos, las operaciones posteriores se aplicarán +en cada grupo de forma independiente. -### The `summarise()` function +### La función `resumir()` -`group_by()` is often used together with `summarise()`, which -collapses each group into a single-row summary of that group. +`group_by()` se usa a menudo junto con `summarise()`, que +colapsa cada grupo en un resumen de una sola fila de ese grupo. -`group_by()` takes as arguments the column names that contain the -**categorical** variables for which you want to calculate the summary -statistics. So to compute the mean `expression` by gene: +`group_by()` toma como argumentos los nombres de las columnas que contienen las variables +**categóricas** para las que desea calcular el resumen de estadísticas +. Entonces, para calcular la "expresión" media por gen: ```{r} rna %>% - group_by(gene) %>% - summarise(mean_expression = mean(expression)) + group_by(gen) %>% + resumen(expresión_media = media(expresión)) ``` -We could also want to calculate the mean expression levels of all genes in each sample: +También podríamos querer calcular los niveles medios de expresión de todos los genes en cada muestra: ```{r} rna %>% - group_by(sample) %>% - summarise(mean_expression = mean(expression)) + group_by(muestra) %>% + resumen(expresión_media = media(expresión)) ``` -But we can can also group by multiple columns: +Pero también podemos agrupar por varias columnas: ```{r} rna %>% - group_by(gene, infection, time) %>% - summarise(mean_expression = mean(expression)) + group_by(gen, infección, tiempo) %>% + resumen(expresión_media = media(expresión)) ``` -Once the data is grouped, you can also summarise multiple variables at the same -time (and not necessarily on the same variable). For instance, we could add a -column indicating the median `expression` by gene and by condition: +Una vez agrupados los datos, también puede resumir varias variables al mismo tiempo +(y no necesariamente en la misma variable). Por ejemplo, podríamos agregar una columna +que indique la `expresión` mediana por gen y por condición: ```{r, purl=TRUE} rna %>% - group_by(gene, infection, time) %>% - summarise(mean_expression = mean(expression), - median_expression = median(expression)) + group_by(gen, infección, tiempo) %>% + resumen(expresión_media = media (expresión), + expresión_mediana = mediana (expresión)) ``` -::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::: desafío -## Challenge +## Desafío -Calculate the mean expression level of gene "Dok3" by timepoints. +Calcule el nivel de expresión medio del gen "Dok3" por puntos de tiempo. -::::::::::::::: solution +::::::::::::::: solución -## Solution +## Solución ```{r, purl=TRUE} rna %>% - filter(gene == "Dok3") %>% + filter(gen == "Dok3") %>% group_by(time) %>% - summarise(mean = mean(expression)) + resumen(media = media(expresión)) ``` -::::::::::::::::::::::::: +:::::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::::::::: -### Counting +### Contando -When working with data, we often want to know the number of observations found -for each factor or combination of factors. For this task, **`dplyr`** provides -`count()`. For example, if we wanted to count the number of rows of data for -each infected and non-infected samples, we would do: +Cuando trabajamos con datos, a menudo queremos saber el número de observaciones encontradas +para cada factor o combinación de factores. Para esta tarea, **`dplyr`** proporciona +`count()`. Por ejemplo, si quisiéramos contar el número de filas de datos para +cada muestra infectada y no infectada, haríamos: ```{r, purl=TRUE} -rna %>% - count(infection) +ARN %>% + recuento(infección) ``` -The `count()` function is shorthand for something we've already seen: grouping by a variable, and summarising it by counting the number of observations in that group. In other words, `rna %>% count(infection)` is equivalent to: +La función `count()` es una abreviatura de algo que ya hemos visto: agrupar por una variable y resumirla contando el número de observaciones en ese grupo. En otras palabras, `rna %>% count(infection)` es equivalente a: ```{r, purl=TRUE} rna %>% - group_by(infection) %>% - summarise(n = n()) + group_by(infección) %>% + resumen(n = n()) ``` -The previous example shows the use of `count()` to count the number of rows/observations -for _one_ factor (i.e., `infection`). -If we wanted to count a _combination of factors_, such as `infection` and `time`, -we would specify the first and the second factor as the arguments of `count()`: +El ejemplo anterior muestra el uso de `count()` para contar el número de filas/observaciones +para _un_ factor (es decir, `infección`). +Si quisiéramos contar una _combinación de factores_, como `infección` y `tiempo`, +especificaríamos el primer y el segundo factor como argumentos de `count()`: ```{r, purl=TRUE} rna %>% - count(infection, time) + recuento(infección, tiempo) ``` -which is equivalent to this: +que es equivalente a esto: ```{r, purl=TRUE} rna %>% - group_by(infection, time) %>% - summarise(n = n()) + group_by(infección, tiempo) %>% + resumen(n = n()) ``` -It is sometimes useful to sort the result to facilitate the comparisons. -We can use `arrange()` to sort the table. -For instance, we might want to arrange the table above by time: +A veces resulta útil ordenar el resultado para facilitar las comparaciones. +Podemos usar `arrange()` para ordenar la tabla. +Por ejemplo, es posible que deseemos organizar la tabla anterior por tiempo: ```{r, purl=TRUE} rna %>% - count(infection, time) %>% - arrange(time) + recuento(infección, tiempo) %>% + organizar(tiempo) ``` -or by counts: +o por conteos: ```{r, purl=TRUE} rna %>% - count(infection, time) %>% - arrange(n) + recuento(infección, tiempo) %>% + organizar(n) ``` -To sort in descending order, we need to add the `desc()` function: +Para ordenar en orden descendente, necesitamos agregar la función `desc()`: ```{r, purl=TRUE} rna %>% - count(infection, time) %>% - arrange(desc(n)) + recuento(infección, tiempo) %>% + organizar(desc(n)) ``` -::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::: desafío -## Challenge +## Desafío -1. How many genes were analysed in each sample? -2. Use `group_by()` and `summarise()` to evaluate the sequencing depth (the sum of all counts) in each sample. Which sample has the highest sequencing depth? -3. Pick one sample and evaluate the number of genes by biotype. -4. Identify genes associated with the "abnormal DNA methylation" phenotype description, and calculate their mean expression (in log) at time 0, time 4 and time 8. +1. ¿Cuántos genes se analizaron en cada muestra? +2. Utilice `group_by()` y `summarise()` para evaluar la profundidad de secuenciación (la suma de todos los recuentos) en cada muestra. ¿Qué muestra tiene la mayor profundidad de secuenciación? +3. Elija una muestra y evalúe la cantidad de genes por biotipo. +4. Identifique los genes asociados con la descripción del fenotipo de "metilación anormal del ADN" y calcule su expresión media (en log) en el tiempo 0, el tiempo 4 y el tiempo 8. -::::::::::::::: solution +::::::::::::::: solución -## Solution +## Solución ```{r} ## 1. @@ -522,80 +522,80 @@ rna %>% arrange() ``` -::::::::::::::::::::::::: +:::::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::::::::: -## Reshaping data +## Reformar datos -In the `rna` tibble, the rows contain expression values (the unit) that are -associated with a combination of 2 other variables: `gene` and `sample`. +En el tibble `rna`, las filas contienen valores de expresión (la unidad) que están +asociados con una combinación de otras 2 variables: `gen` y `sample`. -All the other columns correspond to variables describing either -the sample (organism, age, sex, ...) or the gene (gene\_biotype, ENTREZ\_ID, product, ...). -The variables that don't change with genes or with samples will have the same value in all the rows. +Todas las demás columnas corresponden a variables que describen +la muestra (organismo, edad, sexo, ...) o el gen (gen\_biotipo, ENTREZ\_ID, producto, ...). +Las variables que no cambian con genes o con muestras tendrán el mismo valor en todas las filas. ```{r} rna %>% - arrange(gene) + organizar(gen) ``` -This structure is called a `long-format`, as one column contains all the values, -and other column(s) list(s) the context of the value. +Esta estructura se denomina "formato largo", ya que una columna contiene todos los valores, +y otras columnas enumeran el contexto del valor. -In certain cases, the `long-format` is not really "human-readable", and another format, -a `wide-format` is preferred, as a more compact way of representing the data. -This is typically the case with gene expression values that scientists are used to -look as matrices, were rows represent genes and columns represent samples. +En ciertos casos, el "formato largo" no es realmente "legible para humanos", y se prefiere otro formato, +un "formato ancho", como una forma más compacta de representar los datos. +Este suele ser el caso de los valores de expresión genética que los científicos están acostumbrados a considerar +como matrices, donde las filas representan genes y las columnas representan muestras. -In this format, it would therefore become straightforward -to explore the relationship between the gene expression levels within, and -between, the samples. +En este formato, por lo tanto, sería sencillo +explorar la relación entre los niveles de expresión genética dentro y +entre las muestras. ```{r, echo=FALSE} rna %>% - select(gene, sample, expression) %>% - pivot_wider(names_from = sample, - values_from = expression) + select(gen, muestra, expresión) %>% + pivot_wider(nombres_de = muestra, + valores_de = expresión) ``` -To convert the gene expression values from `rna` into a wide-format, -we need to create a new table where the values of the `sample` column would -become the names of column variables. +Para convertir los valores de expresión genética de `rna` a un formato amplio, +, necesitamos crear una nueva tabla donde los valores de la columna `muestra` +se conviertan en los nombres de las variables de la columna. The key point here is that we are still following a tidy data structure, but we have **reshaped** the data according to the observations of interest: expression levels per gene instead of recording them per gene and per sample. -The opposite transformation would be to transform column names into -values of a new variable. +La transformación opuesta sería transformar los nombres de las columnas en +valores de una nueva variable. -We can do both these of transformations with two `tidyr` functions, -`pivot_longer()` and `pivot_wider()` (see -[here](https://tidyr.tidyverse.org/dev/articles/pivot.html) for -details). +Podemos hacer ambas transformaciones con dos funciones `tidyr`, +`pivot_longer()` y `pivot_wider()` (ver +[aquí](https://tidyr.tidyverse.org /dev/articles/pivot.html) para +detalles). -### Pivoting the data into a wider format +### Pivotar los datos a un formato más amplio -Let's select the first 3 columns of `rna` and use `pivot_wider()` -to transform the data into a wide-format. +Seleccionemos las primeras 3 columnas de `rna` y usemos `pivot_wider()` +para transformar los datos a un formato amplio. ```{r, purl=TRUE} rna_exp <- rna %>% - select(gene, sample, expression) + select(gen, muestra, expresión) rna_exp ``` -`pivot_wider` takes three main arguments: +`pivot_wider` toma tres argumentos principales: -1. the data to be transformed; -2. the `names_from` : the column whose values will become new column - names; -3. the `values_from`: the column whose values will fill the new - columns. +1. los datos a transformar; +2. `names_from`: la columna cuyos valores se convertirán en la nueva columna + nombres; +3. `values_from`: la columna cuyos valores llenarán las nuevas columnas + . -\`\`\`{r, fig.cap="Wide pivot of the `rna` data.", echo=FALSE, message=FALSE} +\`\`\`{r, fig.cap="Pivote amplio de los datos `rna`.", echo=FALSE, message=FALSE} knitr::include_graphics("fig/pivot_wider.png") ```` @@ -607,11 +607,11 @@ rna_wide <- rna_exp %>% rna_wide ```` -Note that by default, the `pivot_wider()` function will add `NA` for missing values. +Tenga en cuenta que, de forma predeterminada, la función `pivot_wider()` agregará `NA` para los valores faltantes. -Let's imagine that for some reason, we had some missing expression values for some -genes in certain samples. In the following fictive example, the gene Cyp2d22 has only -one expression value, in GSM2545338 sample. +Imaginemos que, por alguna razón, nos faltan algunos valores de expresión para algunos genes +en ciertas muestras. En el siguiente ejemplo ficticio, el gen Cyp2d22 tiene solo +un valor de expresión, en la muestra GSM2545338. ```{r, purl=TRUE} rna_with_missing_values <- rna %>% @@ -623,39 +623,39 @@ rna_with_missing_values <- rna %>% rna_with_missing_values ``` -By default, the `pivot_wider()` function will add `NA` for missing -values. This can be parameterised with the `values_fill` argument of -the `pivot_wider()` function. +De forma predeterminada, la función `pivot_wider()` agregará `NA` para los valores +faltantes. Esto se puede parametrizar con el argumento `values_fill` de +la función `pivot_wider()`. ```{r, purl=TRUE} rna_with_missing_values %>% - pivot_wider(names_from = sample, - values_from = expression) + pivot_wider(nombres_de = muestra, + valores_de = expresión) rna_with_missing_values %>% - pivot_wider(names_from = sample, - values_from = expression, - values_fill = 0) + pivot_wider(nombres_de = muestra, + valores_de = expresión, + valores_relleno = 0) ``` -### Pivoting data into a longer format +### Pivotar datos a un formato más largo -In the opposite situation we are using the column names and turning them into -a pair of new variables. One variable represents the column names as -values, and the other variable contains the values previously -associated with the column names. +En la situación opuesta, usamos los nombres de las columnas y los convertimos en +un par de nuevas variables. Una variable representa los nombres de las columnas como valores +y la otra variable contiene los valores previamente +asociados con los nombres de las columnas. -`pivot_longer()` takes four main arguments: +`pivot_longer()` toma cuatro argumentos principales: -1. the data to be transformed; -2. the `names_to`: the new column name we wish to create and populate with the - current column names; -3. the `values_to`: the new column name we wish to create and populate with - current values; -4. the names of the columns to be used to populate the `names_to` and - `values_to` variables (or to drop). +1. los datos a transformar; +2. `names_to`: el nuevo nombre de la columna que deseamos crear y completar con los + nombres de las columnas actuales; +3. `values_to`: el nuevo nombre de la columna que deseamos crear y completar con + valores actuales; +4. los nombres de las columnas que se utilizarán para completar las variables `names_to` y + `values_to` (o para eliminar). -\`\`\`{r, fig.cap="Long pivot of the `rna` data.", echo=FALSE, message=FALSE} +\`\`\`{r, fig.cap="Pivote largo de los datos `rna`.", echo=FALSE, message=FALSE} knitr::include_graphics("fig/pivot_longer.png") ```` @@ -675,28 +675,28 @@ rna_long <- rna_wide %>% rna_long ```` -We could also have used a specification for what columns to -include. This can be useful if you have a large number of identifying -columns, and it's easier to specify what to gather than what to leave -alone. Here the `starts_with()` function can help to retrieve sample -names without having to list them all! -Another possibility would be to use the `:` operator! +También podríamos haber usado una especificación sobre qué columnas incluir +. Esto puede ser útil si tiene una gran cantidad de columnas de identificación +, y es más fácil especificar qué recopilar que qué dejar +solo. Aquí, la función `starts_with()` puede ayudar a recuperar +nombres de muestra sin tener que enumerarlos todos. +¡Otra posibilidad sería utilizar el operador `:`! ```{r} rna_wide %>% - pivot_longer(names_to = "sample", - values_to = "expression", - cols = starts_with("GSM")) -rna_wide %>% - pivot_longer(names_to = "sample", - values_to = "expression", + pivot_longer(names_to = "muestra", + valores_to = "expresión", + cols = comienza_con("GSM")) +rna_wide %> % + pivot_longer(names_to = "muestra", + valores_to = "expresión", GSM2545336:GSM2545380) ``` -Note that if we had missing values in the wide-format, the `NA` would be -included in the new long format. +Tenga en cuenta que si nos faltaran valores en el formato ancho, `NA` estaría +incluido en el nuevo formato largo. -Remember our previous fictive tibble containing missing values: +Recuerde nuestro tibble ficticio anterior que contiene valores faltantes: ```{r} rna_with_missing_values @@ -712,113 +712,113 @@ wide_with_NA %>% -gene) ``` -Pivoting to wider and longer formats can be a useful way to balance out a dataset -so every replicate has the same composition. +Pasar a formatos más amplios y largos puede ser una forma útil de equilibrar un conjunto de datos +para que cada réplica tenga la misma composición. -::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::: desafío -## Question +## Pregunta -Starting from the rna table, use the `pivot_wider()` function to create -a wide-format table giving the gene expression levels in each mouse. -Then use the `pivot_longer()` function to restore a long-format table. +A partir de la tabla de ARN, utilice la función `pivot_wider()` para crear +una tabla de formato amplio que proporcione los niveles de expresión genética en cada ratón. +Luego use la función `pivot_longer()` para restaurar una tabla de formato largo. -::::::::::::::: solution +::::::::::::::: solución -## Solution +## Solución ```{r, answer=TRUE, purl=TRUE} rna1 <- rna %>% -select(gene, mouse, expression) %>% -pivot_wider(names_from = mouse, values_from = expression) +select(gen, ratón, expresión) %>% +pivot_wider(names_from = ratón, valores_from = expresión) rna1 rna1 %>% -pivot_longer(names_to = "mouse_id", values_to = "counts", -gene) +pivot_longer(names_to = "mouse_id", valores_to = "cuentas", -gene) ``` -::::::::::::::::::::::::: +:::::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::::::::: -::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::: desafío -## Question +## Pregunta -Subset genes located on X and Y chromosomes from the `rna` data frame and -spread the data frame with `sex` as columns, `chromosome_name` as -rows, and the mean expression of genes located in each chromosome as the values, -as in the following tibble: +Subconjunto de genes ubicados en los cromosomas X e Y del marco de datos `rna` y +distribuyen el marco de datos con `sexo` como columnas, `cromosoma_nombre` como +filas y la expresión media de los genes ubicados en cada cromosoma como los valores, +como en el siguiente tibble: ```{r, echo=FALSE, message=FALSE} knitr::include_graphics("fig/Exercise_pivot_W.png") ``` -You will need to summarise before reshaping! +¡Necesitará resumir antes de remodelar! -::::::::::::::: solution +::::::::::::::: solución -## Solution +## Solución -Let's first calculate the mean expression level of X and Y linked genes from -male and female samples... +Primero calculemos el nivel de expresión medio de los genes ligados a X e Y de +muestras masculinas y femeninas... ```{r} - rna %>% - filter(chromosome_name == "Y" | chromosome_name == "X") %>% - group_by(sex, chromosome_name) %>% - summarise(mean = mean(expression)) + arn %>% + filtro(nombre_cromosoma == "Y" | nombre_cromosoma == "X") %>% + grupo_por(sexo, nombre_cromosoma) %>% + resumen(media = media (expresión)) ``` -And pivot the table to wide format +Y gire la tabla a formato ancho ```{r, answer=TRUE, purl=TRUE} -rna_1 <- rna %>% - filter(chromosome_name == "Y" | chromosome_name == "X") %>% - group_by(sex, chromosome_name) %>% - summarise(mean = mean(expression)) %>% - pivot_wider(names_from = sex, - values_from = mean) +rna_1 <- arn %>% + filter(nombre_cromosoma == "Y" | nombre_cromosoma == "X") %>% + group_by(sexo, nombre_cromosoma) %>% + resumir (media = media (expresión)) %>% + pivot_wider(nombres_de = sexo, + valores_de = media) rna_1 ``` -Now take that data frame and transform it with `pivot_longer()` so -each row is a unique `chromosome_name` by `gender` combination. +Ahora tome ese marco de datos y transfórmelo con `pivot_longer()` para que +cada fila sea un `cromosoma_nombre` único por combinación de `género`. ```{r, answer=TRUE, purl=TRUE} rna_1 %>% - pivot_longer(names_to = "gender", - values_to = "mean", + pivot_longer(names_to = "género", + valores_to = "media", -chromosome_name) ``` -::::::::::::::::::::::::: +:::::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::::::::: -::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::: desafío -## Question +## Pregunta -Use the `rna` dataset to create an expression matrix where each row -represents the mean expression levels of genes and columns represent -the different timepoints. +Utilice el conjunto de datos `rna` para crear una matriz de expresión donde cada fila +represente los niveles de expresión medios de genes y las columnas representen +los diferentes puntos de tiempo. -::::::::::::::: solution +::::::::::::::: solución -## Solution +## Solución -Let's first calculate the mean expression by gene and by time +Primero calculemos la expresión media por gen y por tiempo. ```{r} rna %>% - group_by(gene, time) %>% - summarise(mean_exp = mean(expression)) + group_by(gen, tiempo) %>% + resumen(exp_media = media(expresión)) ``` -before using the pivot\_wider() function +antes de usar la función pivot\_wider() ```{r} rna_time <- rna %>% @@ -829,9 +829,9 @@ rna_time <- rna %>% rna_time ``` -Notice that this generates a tibble with some column names starting by a number. -If we wanted to select the column corresponding to the timepoints, -we could not use the column names directly... What happens when we select the column 4? +Observe que esto genera un tibble con algunos nombres de columnas que comienzan con un número. +Si quisiéramos seleccionar la columna correspondiente a los puntos de tiempo, +no podríamos usar los nombres de las columnas directamente... ¿Qué pasa cuando seleccionamos la columna 4? ```{r} rna %>% @@ -842,7 +842,7 @@ rna %>% select(gene, 4) ``` -To select the timepoint 4, we would have to quote the column name, with backticks "\\`" +Para seleccionar el punto de tiempo 4, tendríamos que citar el nombre de la columna, con comillas invertidas "\\`" ```{r} rna %>% @@ -853,8 +853,8 @@ rna %>% select(gene, `4`) ``` -Another possibility would be to rename the column, -choosing a name that doesn't start by a number : +Otra posibilidad sería cambiar el nombre de la columna, +eligiendo un nombre que no comience con un número: ```{r} rna %>% @@ -866,37 +866,37 @@ rna %>% select(gene, time4) ``` -::::::::::::::::::::::::: +:::::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::::::::: -::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::: desafío -## Question +## Pregunta -Use the previous data frame containing mean expression levels per timepoint and create -a new column containing fold-changes between timepoint 8 and timepoint 0, and fold-changes -between timepoint 8 and timepoint 4. -Convert this table into a long-format table gathering the fold-changes calculated. +Utilice el marco de datos anterior que contiene los niveles de expresión medios por punto de tiempo y cree +una nueva columna que contenga los cambios de pliegue entre el punto de tiempo 8 y el punto de tiempo 0, y los cambios de pliegue +entre el punto de tiempo 8 y el punto de tiempo 4. +Convierta esta tabla en una tabla de formato largo que recopile los cambios de pliegue calculados. -::::::::::::::: solution +::::::::::::::: solución -## Solution +## Solución -Starting from the rna\_time tibble: +A partir del tibble rna\_time: ```{r} -rna_time +tiempo_rna ``` -Calculate fold-changes: +Calcular cambios de pliegue: ```{r} rna_time %>% - mutate(time_8_vs_0 = `8` / `0`, time_8_vs_4 = `8` / `4`) + mutar(time_8_vs_0 = `8` / `0`, time_8_vs_4 = `8` / `4`) ``` -And use the pivot\_longer() function: +Y use la función pivot\_longer(): ```{r} rna_time %>% @@ -906,142 +906,142 @@ rna_time %>% time_8_vs_0:time_8_vs_4) ``` -::::::::::::::::::::::::: +:::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::: +## Unir mesas -## Joining tables +En muchas situaciones de la vida real, los datos se distribuyen en varias tablas. +Por lo general, esto ocurre porque se recopilan +diferentes tipos de información de diferentes fuentes. -In many real life situations, data are spread across multiple tables. -Usually this occurs because different types of information are -collected from different sources. +Puede ser deseable que algunos análisis combinen datos de dos o más tablas +en un solo marco de datos basado en una columna que sería común +a todas las tablas. -It may be desirable for some analyses to combine data from two or more -tables into a single data frame based on a column that would be common -to all the tables. +El paquete `dplyr` proporciona un conjunto de funciones de unión para combinar dos marcos de datos +basados en coincidencias dentro de columnas especificadas. Aquí, +proporcionamos una breve introducción a las uniones. Para obtener más información, +consulte el capítulo sobre [tabla +uniones] (https://uclouvain-cbio.github.io/WSBIM1207/sec-join.html). La +Hoja de trucos de transformación de datos -The `dplyr` package provides a set of join functions for combining two -data frames based on matches within specified columns. Here, we -provide a short introduction to joins. For further reading, please -refer to the chapter about table -joins. The -Data Transformation Cheat -Sheet -also provides a short overview on table joins. +también proporciona una breve descripción general en las uniones de la mesa. -We are going to illustrate join using a small table, `rna_mini` that -we will create by subsetting the original `rna` table, keeping only 3 -columns and 10 lines. +Vamos a ilustrar la unión usando una pequeña tabla, `rna_mini` que +crearemos subconjuntos de la tabla `rna` original, manteniendo solo 3 +columnas y 10 líneas. ```{r} rna_mini <- rna %>% - select(gene, sample, expression) %>% + select(gen, muestra, expresión) %>% head(10) rna_mini ``` -The second table, `annot1`, contains 2 columns, gene and -gene\_description. You can either -[download annot1.csv](https://carpentries-incubator.github.io/bioc-intro/data/annot1.csv) -by clicking on the link and then moving it to the `data/` folder, or -you can use the R code below to download it directly to the folder. +La segunda tabla, `annot1`, contiene 2 columnas, gene y +gene\_description. Puede +[descargar annot1.csv](https://carpentries-incubator.github.io/bioc-intro/data/annot1.csv) +haciendo clic en el enlace y luego moviéndose a la carpeta `data/`, o +puedes usar el código R a continuación para descargarlo directamente a la carpeta. ```{r, message=FALSE} download.file(url = "https://carpentries-incubator.github.io/bioc-intro/data/annot1.csv", destfile = "data/annot1.csv") -annot1 <- read_csv(file = "data/annot1.csv") +annot1 <- read_csv(archivo = "datos/annot1.csv") annot1 ``` -We now want to join these two tables into a single one containing all -variables using the `full_join()` function from the `dplyr` package. The -function will automatically find the common variable to match columns -from the first and second table. In this case, `gene` is the common -variable. Such variables are called keys. Keys are used to match -observations across different tables. +Ahora queremos unir estas dos tablas en una sola que contenga todas las variables +usando la función `full_join()` del paquete `dplyr`. La función +encontrará automáticamente la variable común que coincida con las columnas +de la primera y segunda tabla. En este caso, "gen" es la variable +común. Estas variables se denominan claves. Las claves se utilizan para hacer coincidir +observaciones en diferentes tablas. ```{r} -full_join(rna_mini, annot1) +unión_completa(rna_mini, annot1) ``` -In real life, gene annotations are sometimes labelled differently. +En la vida real, las anotaciones genéticas a veces se etiquetan de manera diferente. -The `annot2` table is exactly the same than `annot1` except that the -variable containing gene names is labelled differently. Again, either -[download annot2.csv](https://carpentries-incubator.github.io/bioc-intro/data/annot2.csv) -yourself and move it to `data/` or use the R code below. +La tabla `annot2` es exactamente igual que `annot1` excepto que la variable +que contiene los nombres de los genes está etiquetada de manera diferente. Nuevamente, +[descarga annot2.csv](https://carpentries-incubator.github.io/bioc-intro/data/annot2.csv) +tú mismo y muévelo a `data/ `o use el código R a continuación. ```{r, message=FALSE} download.file(url = "https://carpentries-incubator.github.io/bioc-intro/data/annot2.csv", destfile = "data/annot2.csv") -annot2 <- read_csv(file = "data/annot2.csv") +annot2 <- read_csv(archivo = "datos/annot2.csv") annot2 ``` -In case none of the variable names match, we can set manually the -variables to use for the matching. These variables can be set using -the `by` argument, as shown below with `rna_mini` and `annot2` tables. +En caso de que ninguno de los nombres de las variables coincida, podemos configurar manualmente las +variables que se utilizarán para la coincidencia. Estas variables se pueden configurar usando +el argumento `by`, como se muestra a continuación con las tablas `rna_mini` y `annot2`. ```{r} full_join(rna_mini, annot2, by = c("gene" = "external_gene_name")) ``` -As can be seen above, the variable name of the first table is retained -in the joined one. +Como se puede ver arriba, el nombre de la variable de la primera tabla se conserva +en la unida. -::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::: desafío -## Challenge: +## Desafío: -Download the `annot3` table by clicking -[here](https://carpentries-incubator.github.io/bioc-intro/data/annot3.csv) -and put the table in your data/ repository. Using the `full_join()` -function, join tables `rna_mini` and `annot3`. What has happened for -genes _Klk6_, _mt-Tf_, _mt-Rnr1_, _mt-Tv_, _mt-Rnr2_, and _mt-Tl1_ ? +Descargue la tabla `annot3` haciendo clic +[aquí](https://carpentries-incubator.github.io/bioc-intro/data/annot3.csv) +y coloque la tabla en su repositorio de datos. Usando la función `full_join()` +, une las tablas `rna_mini` y `annot3`. ¿Qué ha sucedido con los +genes _Klk6_, _mt-Tf_, _mt-Rnr1_, _mt-Tv_, _mt-Rnr2_ y _mt-Tl1_? -::::::::::::::: solution +::::::::::::::: solución -## Solution +## Solución ```{r, message=FALSE} annot3 <- read_csv("data/annot3.csv") full_join(rna_mini, annot3) ``` -Genes _Klk6_ is only present in `rna_mini`, while genes _mt-Tf_, _mt-Rnr1_, _mt-Tv_, -_mt-Rnr2_, and _mt-Tl1_ are only present in `annot3` table. Their respective values for the -variables of the table have been encoded as missing. +Los genes _Klk6_ solo están presentes en `rna_mini`, mientras que los genes _mt-Tf_, _mt-Rnr1_, _mt-Tv_, +_mt-Rnr2_ y _mt-Tl1_ están solo está presente en la tabla `annot3`. Sus valores respectivos para las variables +de la tabla se han codificado como faltantes. -::::::::::::::::::::::::: +:::::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::::::::: -## Exporting data +## Exportar datos -Now that you have learned how to use `dplyr` to extract information from -or summarise your raw data, you may want to export these new data sets to share -them with your collaborators or for archival. +Ahora que ha aprendido a utilizar `dplyr` para extraer información de +o resumir sus datos sin procesar, es posible que desee exportar estos nuevos conjuntos de datos para compartirlos +con sus colaboradores o para archivarlos. -Similar to the `read_csv()` function used for reading CSV files into R, there is -a `write_csv()` function that generates CSV files from data frames. +Similar a la función `read_csv()` utilizada para leer archivos CSV en R, existe +una función `write_csv()` que genera archivos CSV a partir de marcos de datos. -Before using `write_csv()`, we are going to create a new folder, `data_output`, -in our working directory that will store this generated dataset. We don't want -to write generated datasets in the same directory as our raw data. -It's good practice to keep them separate. The `data` folder should only contain -the raw, unaltered data, and should be left alone to make sure we don't delete -or modify it. In contrast, our script will generate the contents of the `data_output` -directory, so even if the files it contains are deleted, we can always -re-generate them. +Antes de usar `write_csv()`, vamos a crear una nueva carpeta, `data_output`, +en nuestro directorio de trabajo que almacenará este conjunto de datos generado. No queremos que +escriba conjuntos de datos generados en el mismo directorio que nuestros datos sin procesar. +Es una buena práctica mantenerlos separados. La carpeta `data` solo debe contener +los datos sin procesar y sin modificar, y debe dejarse en paz para asegurarnos de que no los eliminemos +ni los modifiquemos. Por el contrario, nuestro script generará el contenido del directorio `data_output` +, por lo que incluso si los archivos que contiene se eliminan, siempre podemos +volver a generarlos. -Let's use `write_csv()` to save the rna\_wide table that we have created previously. +Usemos `write_csv()` para guardar la tabla rna\_wide que hemos creado anteriormente. ```{r, purl=TRUE, eval=FALSE} -write_csv(rna_wide, file = "data_output/rna_wide.csv") +write_csv(rna_wide, archivo = "data_output/rna_wide.csv") ``` -:::::::::::::::::::::::::::::::::::::::: keypoints +:::::::::::::::::::::::::::::::::::::::: puntos clave -- Tabular data in R using the tidyverse meta-package +- Datos tabulares en R usando el metapaquete tidyverse -:::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::::::::: From a4f112b710216beade136476dbec88c14c9154f7 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Tue, 14 May 2024 00:30:26 +0900 Subject: [PATCH 173/334] New translations 40-visualization.md (Spanish) --- locale/es/episodes/40-visualization.Rmd | 1007 ++++++++++++----------- 1 file changed, 504 insertions(+), 503 deletions(-) diff --git a/locale/es/episodes/40-visualization.Rmd b/locale/es/episodes/40-visualization.Rmd index f0b7de9b3..97849eb0b 100644 --- a/locale/es/episodes/40-visualization.Rmd +++ b/locale/es/episodes/40-visualization.Rmd @@ -1,241 +1,241 @@ --- source: Rmd -title: Data visualization +title: Visualización de datos teaching: 60 exercises: 60 --- ```{r loaddata_vis, echo=FALSE, purl=FALSE, message=FALSE} if (!file.exists("data/rnaseq.csv")) -download.file(url = "https://github.com/carpentries-incubator/bioc-intro/raw/main/episodes/data/rnaseq.csv", - destfile = "data/rnaseq.csv") +download.file(url = "https://github.com/carpentries-incubator/bioc-intro/raw/main/episodes/ datos/rnaseq.csv", + destfile = "datos/rnaseq.csv") ``` ::::::::::::::::::::::::::::::::::::::: objetivos -- Produce scatter plots, boxplots, line plots, etc. using ggplot. -- Set universal plot settings. -- Describe what faceting is and apply faceting in ggplot. -- Modify the aesthetics of an existing ggplot plot (including axis labels and color). -- Build complex and customized plots from data in a data frame. +- Produzca diagramas de dispersión, diagramas de caja, diagramas de líneas, etc. utilizando ggplot. +- Establezca configuraciones de trama universales. +- Describe qué es el facetado y aplícalo en ggplot. +- Modifique la estética de un gráfico ggplot existente (incluidas las etiquetas de los ejes y el color). +- Cree gráficos complejos y personalizados a partir de datos en un marco de datos. -:::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::: questions +:::::::::::::::::::::::::::::::::::::::: preguntas -- Visualization in R +- Visualización en R -:::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::::::::: ```{r vis_setup, echo=FALSE} -rna <- read.csv("data/rnaseq.csv") +arn <- read.csv("datos/rnaseq.csv") ``` -> This episode is based on the Data Carpentries's _Data Analysis and -> Visualisation in R for Ecologists_ lesson. +> Este episodio se basa en la lección _Análisis de datos y +> Visualización en R para ecologistas_ de Data Carpentries. -## Data Visualization +## Visualización de datos -We start by loading the required packages. **`ggplot2`** is included in -the **`tidyverse`** package. +Comenzamos cargando los paquetes requeridos. **`ggplot2`** está incluido en +el paquete **`tidyverse`**. ```{r load-package, message=FALSE, purl=TRUE} -library("tidyverse") +biblioteca("tidyverse") ``` -If not still in the workspace, load the data we saved in the previous -lesson. +Si aún no está en el espacio de trabajo, cargue los datos que guardamos en la lección +anterior. ```{r load-data, eval=FALSE, purl=TRUE} -rna <- read.csv("data/rnaseq.csv") +arn <- read.csv("datos/rnaseq.csv") ``` -The Data Visualization Cheat -Sheet -will cover the basics and more advanced features of `ggplot2` and will -help, in addition to serve as a reminder, getting an overview of the -many data representations available in the package. The following video -tutorials ([part 1](https://www.youtube.com/watch?v=h29g21z0a68) and -[2](https://www.youtube.com/watch?v=0m4yywqNPVY)) by Thomas Lin Pedersen -are also very instructive. +La Hoja de trucos de visualización de datos -## Plotting with `ggplot2` +cubrirá los conceptos básicos y las funciones más avanzadas de ` ggplot2` y +ayudará, además de servir como recordatorio, a obtener una descripción general de las +muchas representaciones de datos disponibles en el paquete. Los siguientes videos +tutoriales ([parte 1](https://www.youtube.com/watch?v=h29g21z0a68) y +[2](https://www.youtube.com /watch?v=0m4yywqNPVY)) de Thomas Lin Pedersen +también son muy instructivos. -`ggplot2` is a plotting package that makes it simple to create complex -plots from data in a data frame. It provides a more programmatic -interface for specifying what variables to plot, how they are displayed, -and general visual properties. The theoretical foundation that supports -the `ggplot2` is the _Grammar of Graphics_ (@Wilkinson:2005). Using this -approach, we only need minimal changes if the underlying data change or -if we decide to change from a bar plot to a scatterplot. This helps in -creating publication quality plots with minimal amounts of adjustments -and tweaking. +## Trazar con `ggplot2` -There is a book about `ggplot2` (@ggplot2book) that provides a good -overview, but it is outdated. The 3rd edition is in preparation and will -be [freely available online](https://ggplot2-book.org/). The `ggplot2` -webpage ([https://ggplot2.tidyverse.org](https://ggplot2.tidyverse.org)) provides ample documentation. +`ggplot2` es un paquete de trazado que simplifica la creación de trazados +complejos a partir de datos en un marco de datos. Proporciona una interfaz +más programática para especificar qué variables trazar, cómo se muestran, +y propiedades visuales generales. El fundamento teórico que sustenta +el `ggplot2` es la _Gramática de Gráficos_ (@Wilkinson:2005). Usando este enfoque +, solo necesitamos cambios mínimos si los datos subyacentes cambian o +si decidimos cambiar de un diagrama de barras a un diagrama de dispersión. Esto ayuda a +a crear gráficos con calidad de publicación con una cantidad mínima de ajustes +y ajustes. -`ggplot2` functions like data in the 'long' format, i.e., a column for -every dimension, and a row for every observation. Well-structured data -will save you lots of time when making figures with `ggplot2`. +Hay un libro sobre `ggplot2` (@ggplot2book) que proporciona una buena +descripción general, pero está desactualizado. La tercera edición está en preparación y +estará [disponible gratuitamente en línea](https://ggplot2-book.org/). La página web `ggplot2` +([https://ggplot2.tidyverse.org](https://ggplot2.tidyverse.org)) proporciona amplia documentación. -ggplot graphics are built step by step by adding new elements. Adding -layers in this fashion allows for extensive flexibility and -customization of plots. +`ggplot2` funciona como datos en formato 'largo', es decir, una columna para +cada dimensión y una fila para cada observación. Los datos bien estructurados +te ahorrarán mucho tiempo al hacer figuras con `ggplot2`. -> The idea behind the Grammar of Graphics it is that you can build every -> graph from the same 3 components: (1) a data set, (2) a coordinate system, -> and (3) geoms — i.e. visual marks that represent data points \[^three\\_comp\\_ggplot2] +Los gráficos de ggplot se crean paso a paso agregando nuevos elementos. Agregar +capas de esta manera permite una gran flexibilidad y +personalización de los gráficos. -[^three_comp_ggplot2]: Source: [Data Visualization Cheat Sheet](https://raw.githubusercontent.com/rstudio/cheatsheets/main/data-visualization.pdf). +> La idea detrás de la Gramática de Gráficos es que puedes construir cada gráfico +> a partir de los mismos 3 componentes: (1) un conjunto de datos, (2) un sistema de coordenadas, +> y (3) geoms. — es decir, marcas visuales que representan puntos de datos \[^tres\\_comp\\_ggplot2] -To build a ggplot, we will use the following basic template that can be -used for different types of plots: +[^three_comp_ggplot2]: Fuente: [Hoja de referencia de visualización de datos](https://raw.githubusercontent.com/rstudio/cheatsheets/main/data-visualization.pdf). + +Para construir un ggplot, usaremos la siguiente plantilla básica que se puede +usar para diferentes tipos de gráficos: ``` -ggplot(data = <DATA>, mapping = aes(<MAPPINGS>)) + <GEOM_FUNCTION>() +ggplot(datos = <DATA>, mapeo = aes(<MAPPINGS>)) + <GEOM_FUNCTION>() ``` -- use the `ggplot()` function and bind the plot to a specific **data - frame** using the `data` argument +- use la función `ggplot()` y vincule el gráfico a un \*\*marco de datos + \*\* específico usando el argumento `data` ```{r, eval=FALSE} -ggplot(data = rna) +ggplot(datos = arn) ``` -- define a **mapping** (using the aesthetic (`aes`) function), by - selecting the variables to be plotted and specifying how to present - them in the graph, e.g. as x/y positions or characteristics such as - size, shape, color, etc. +- defina un **mapeo** (usando la función estética (`aes`)), seleccionando + las variables que se trazarán y especificando cómo presentarlas + en el gráfico, por ejemplo, como x/ y posiciones o características como + tamaño, forma, color, etc. ```{r, eval=FALSE} -ggplot(data = rna, mapping = aes(x = expression)) +ggplot(datos = rna, mapeo = aes(x = expresión)) ``` -- add '**geoms**' - geometries, or graphical representations of the - data in the plot (points, lines, bars). `ggplot2` offers many - different geoms; we will use some common ones today, including: +- agregue '**geoms**': geometrías o representaciones gráficas de los datos + en el gráfico (puntos, líneas, barras). `ggplot2` ofrece muchas + geoms diferentes; Usaremos algunos comunes hoy, que incluyen: ``` - * `geom_point()` for scatter plots, dot plots, etc. - * `geom_histogram()` for histograms - * `geom_boxplot()` for, well, boxplots! - * `geom_line()` for trend lines, time series, etc. + * `geom_point()` para diagramas de dispersión, diagramas de puntos, etc. + * `geom_histogram()` para histogramas + * `geom_boxplot()` para, bueno, diagramas de caja. + * `geom_line()` para líneas de tendencia, series de tiempo, etc. ``` -To add a geom(etry) to the plot use the `+` operator. Let's use -`geom_histogram()` first: +Para agregar una geometría (etry) al gráfico, use el operador `+`. Usemos +`geom_histogram()` primero: ```{r first-ggplot, cache=FALSE, purl=TRUE} -ggplot(data = rna, mapping = aes(x = expression)) + +ggplot(datos = rna, mapeo = aes(x = expresión)) + geom_histogram() ``` -The `+` in the `ggplot2` package is particularly useful because it -allows you to modify existing `ggplot` objects. This means you can -easily set up plot templates and conveniently explore different types of -plots, so the above plot can also be generated with code like this: +El `+` en el paquete `ggplot2` es particularmente útil porque +te permite modificar objetos `ggplot` existentes. Esto significa que puedes +configurar fácilmente plantillas de gráficos y explorar cómodamente diferentes tipos de gráficos +, por lo que el gráfico anterior también se puede generar con un código como este: ```{r, eval=FALSE, purl=TRUE} -# Assign plot to a variable +# Asignar gráfico a una variable rna_plot <- ggplot(data = rna, - mapping = aes(x = expression)) + mapeo = aes(x = expresión)) -# Draw the plot -rna_plot + geom_histogram() +# Dibujar el gráfico +rna_plot + geom_histograma() ``` -::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::: desafío -## Challenge +## Desafío -You have probably noticed an automatic message that appears when -drawing the histogram: +Probablemente hayas notado un mensaje automático que aparece cuando +dibuja el histograma: ```{r, echo=FALSE, fig.show="hide"} -ggplot(rna, aes(x = expression)) + +ggplot(rna, aes(x = expresión)) + geom_histogram() ``` -Change the arguments `bins` or `binwidth` of `geom_histogram()` to -change the number or width of the bins. +Cambie los argumentos `bins` o `binwidth` de `geom_histogram()` a +cambie el número o ancho de los bins. ::::::::::::::: solution -## Solution +## Solución ```{r, purl=TRUE} -# change bins -ggplot(rna, aes(x = expression)) + +# cambiar contenedores +ggplot(rna, aes(x = expresión)) + geom_histogram(bins = 15) -# change binwidth -ggplot(rna, aes(x = expression)) + +# cambiar ancho de contenedor +ggplot(rna, aes( x = expresión)) + geom_histogram(binwidth = 2000) ``` -::::::::::::::::::::::::: +:::::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::::::::: -We can observe here that the data are skewed to the right. We can apply -log2 transformation to have a more symmetric distribution. Note that we -add here a small constant value (`+1`) to avoid having `-Inf` values -returned for expression values equal to 0. +Podemos observar aquí que los datos están sesgados hacia la derecha. Podemos aplicar la transformación +log2 para tener una distribución más simétrica. Tenga en cuenta que +agregamos aquí un pequeño valor constante (`+1`) para evitar que se devuelvan valores `-Inf` +para valores de expresión iguales a 0. ```{r log-transfo, cache=FALSE, purl=TRUE} rna <- rna %>% - mutate(expression_log = log2(expression + 1)) + mutar(expresión_log = log2(expresión + 1)) ``` -If we now draw the histogram of the log2-transformed expressions, the -distribution is indeed closer to a normal distribution. +Si ahora dibujamos el histograma de las expresiones transformadas log2, la distribución +está más cerca de una distribución normal. ```{r second-ggplot, cache=FALSE, purl=TRUE} -ggplot(rna, aes(x = expression_log)) + geom_histogram() +ggplot(rna, aes(x = expresión_log)) + geom_histogram() ``` -From now on we will work on the log-transformed expression values. +De ahora en adelante trabajaremos en los valores de expresión transformados logarítmicamente. -::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::: desafío -## Challenge +## Desafío -Another way to visualize this transformation is to consider the scale -of the observations. For example, it may be worth changing the scale -of the axis to better distribute the observations in the space of the -plot. Changing the scale of the axes is done similarly to -adding/modifying other components (i.e., by incrementally adding -commands). Try making this modification: +Otra forma de visualizar esta transformación es considerar la escala +de las observaciones. Por ejemplo, puede que valga la pena cambiar la escala +del eje para distribuir mejor las observaciones en el espacio del gráfico +. Cambiar la escala de los ejes se realiza de manera similar a +agregar/modificar otros componentes (es decir, agregando incrementalmente comandos +). Intenta hacer esta modificación: -- Represent the un-transformed expression on the log10 scale; see - `scale_x_log10()`. Compare it with the previous graph. Why do you - now have warning messages appearing? +- Representa la expresión no transformada en la escala log10; ver + `scale_x_log10()`. Compáralo con el gráfico anterior. ¿Por qué + ahora aparecen mensajes de advertencia? -::::::::::::::: solution +::::::::::::::: solución -## Solution +## Solución ```{r, eval=TRUE, purl=TRUE, echo=TRUE} -ggplot(data = rna,mapping = aes(x = expression))+ +ggplot(datos = rna,mapping = aes(x = expresión))+ geom_histogram() + scale_x_log10() ``` -::::::::::::::::::::::::: +:::::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::::::::: -**Notes** +**Notas** -- Anything you put in the `ggplot()` function can be seen by any geom - layers that you add (i.e., these are global plot settings). This - includes the x- and y-axis mapping you set up in `aes()`. -- You can also specify mappings for a given geom independently of the - mappings defined globally in the `ggplot()` function. -- The `+` sign used to add new layers must be placed at the end of the - line containing the _previous_ layer. If, instead, the `+` sign is +- Todo lo que pongas en la función `ggplot()` puede ser visto por cualquier geom + capas que agregues (es decir, estas son configuraciones de trazado globales). Este + incluye el mapeo de los ejes x e y que configuró en `aes()`. +- También puede especificar asignaciones para una geom determinada independientemente de las + asignaciones definidas globalmente en la función `ggplot()`. +- El signo `+` usado para agregar nuevas capas debe colocarse al final de la línea + que contiene la capa _anterior_. If, instead, the `+` sign is added at the beginning of the line containing the new layer, `ggplot2` will not add the new layer and will return an error message. @@ -250,235 +250,235 @@ rna_plot + geom_histogram() ``` -## Building your plots iteratively +## Construyendo sus parcelas de forma iterativa -We will now draw a scatter plot with two continuous variables and the -`geom_point()` function. This graph will represent the log2 fold changes -of expression comparing time 8 versus time 0, and time 4 versus time 0. -To this end, we first need to compute the means of the log-transformed -expression values by gene and time, then the log fold changes by -subtracting the mean log expressions between time 8 and time 0 and -between time 4 and time 0. Note that we also include here the gene -biotype that we will use later on to represent the genes. We will save -the fold changes in a new data frame called `rna_fc.` +Ahora dibujaremos un diagrama de dispersión con dos variables continuas y la función +`geom_point()`. Este gráfico representará los cambios log2 +de la expresión que compara el tiempo 8 con el tiempo 0 y el tiempo 4 con el tiempo 0. +Para este fin, primero necesitamos calcular las medias de los valores de expresión +transformados logarítmicamente por gen y tiempo, luego el pliegue logarítmico cambia +restando las expresiones logarítmicas medias entre el tiempo 8 y el tiempo 0. y +entre el tiempo 4 y el tiempo 0. Tenga en cuenta que también incluimos aquí el biotipo del gen +que usaremos más adelante para representar los genes. Guardaremos +los cambios de pliegue en un nuevo marco de datos llamado `rna_fc.` ```{r rna_fc, cache=FALSE, purl=TRUE} -rna_fc <- rna %>% select(gene, time, - gene_biotype, expression_log) %>% - group_by(gene, time, gene_biotype) %>% - summarize(mean_exp = mean(expression_log)) %>% - pivot_wider(names_from = time, - values_from = mean_exp) %>% +rna_fc <- rna %>% seleccionar(gen, tiempo, + biotipo_gen, registro_expresión) %>% + group_by(gen, tiempo, biotipo_gen) %>% + resumir(exp_media = mean(expression_log)) %>% + pivot_wider(names_from = tiempo, + valores_from = mean_exp) %>% mutate(time_8_vs_0 = `8` - `0`, time_4_vs_0 = `4` - `0`) ``` -We can then build a ggplot with the newly created dataset `rna_fc`. -Building plots with `ggplot2` is typically an iterative process. We -start by defining the dataset we'll use, lay out the axes, and choose a -geom: +Luego podemos construir un ggplot con el conjunto de datos recién creado `rna_fc`. +La construcción de parcelas con `ggplot2` suele ser un proceso iterativo. +comenzamos definiendo el conjunto de datos que usaremos, diseñamos los ejes y elegimos una geom +: ```{r create-ggplot-object, cache=FALSE, purl=TRUE} -ggplot(data = rna_fc, mapping = aes(x = time_4_vs_0, y = time_8_vs_0)) + +ggplot(datos = rna_fc, mapeo = aes(x = time_4_vs_0, y = time_8_vs_0)) + geom_point() ``` -Then, we start modifying this plot to extract more information from it. -For instance, we can add transparency (`alpha`) to avoid overplotting: +Luego, comenzamos a modificar este gráfico para extraer más información del mismo. +Por ejemplo, podemos agregar transparencia (`alfa`) para evitar el trazado excesivo: ```{r adding-transparency, cache=FALSE, purl=TRUE} -ggplot(data = rna_fc, mapping = aes(x = time_4_vs_0, y = time_8_vs_0)) + +ggplot(data = rna_fc, mapeo = aes(x = time_4_vs_0, y = time_8_vs_0)) + geom_point(alpha = 0.3) ``` -We can also add colors for all the points: +También podemos agregar colores para todos los puntos: ```{r adding-colors, cache=FALSE, purl=TRUE} -ggplot(data = rna_fc, mapping = aes(x = time_4_vs_0, y = time_8_vs_0)) + - geom_point(alpha = 0.3, color = "blue") +ggplot(data = rna_fc, mapeo = aes(x = time_4_vs_0, y = time_8_vs_0)) + + geom_point(alpha = 0.3, color = "azul") ``` -Or to color each gene in the plot differently, you could use a vector as -an input to the argument **color**. `ggplot2` will provide a different -color corresponding to different values in the vector. Here is an -example where we color with `gene_biotype`: +O para colorear cada gen en el gráfico de manera diferente, puede usar un vector como +una entrada para el argumento **color**. `ggplot2` proporcionará un color +diferente correspondiente a diferentes valores en el vector. Aquí hay un +ejemplo donde coloreamos con `gene_biotype`: ```{r color-by-gene_biotype1, cache=FALSE, purl=TRUE} -ggplot(data = rna_fc, mapping = aes(x = time_4_vs_0, y = time_8_vs_0)) + +ggplot(data = rna_fc, mapeo = aes(x = time_4_vs_0, y = time_8_vs_0)) + geom_point(alpha = 0.3, aes(color = gene_biotype)) ``` -We can also specify the colors directly inside the mapping provided in -the `ggplot()` function. This will be seen by any geom layers and the -mapping will be determined by the x- and y-axis set up in `aes()`. +También podemos especificar los colores directamente dentro del mapeo proporcionado en +la función `ggplot()`. Esto será visto por cualquier capa de geom y el mapeo +estará determinado por los ejes x e y configurados en `aes()`. ```{r color-by-gene_biotype2, cache=FALSE, purl=TRUE} -ggplot(data = rna_fc, mapping = aes(x = time_4_vs_0, y = time_8_vs_0, +ggplot(data = rna_fc, mapeo = aes(x = time_4_vs_0, y = time_8_vs_0, color = gene_biotype)) + geom_point(alpha = 0.3) ``` -Finally, we could also add a diagonal line with the `geom_abline()` -function: +Finalmente, también podríamos agregar una línea diagonal con la función `geom_abline()` +: ```{r adding-diag, cache=FALSE, purl=TRUE} -ggplot(data = rna_fc, mapping = aes(x = time_4_vs_0, y = time_8_vs_0, +ggplot(data = rna_fc, mapeo = aes(x = time_4_vs_0, y = time_8_vs_0, color = gene_biotype)) + geom_point(alpha = 0.3) + geom_abline(intercept = 0) ``` -Notice that we can change the geom layer from `geom_point` to -`geom_jitter` and colors will still be determined by `gene_biotype`. +Tenga en cuenta que podemos cambiar la capa geom de `geom_point` a +`geom_jitter` y los colores seguirán estando determinados por `gene_biotype`. ```{r color-by-gene_biotype3, cache=FALSE, purl=TRUE} -ggplot(data = rna_fc, mapping = aes(x = time_4_vs_0, y = time_8_vs_0, +ggplot(data = rna_fc, mapeo = aes(x = time_4_vs_0, y = time_8_vs_0, color = gene_biotype)) + geom_jitter(alpha = 0.3) + geom_abline(intercept = 0) ``` ```{r, echo=FALSE, message=FALSE} -library("hexbin") +biblioteca("hexbin") ``` -::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::: desafío -## Challenge +## Desafío -Scatter plots can be useful exploratory tools for small datasets. For -data sets with large numbers of observations, such as the `rna_fc` -data set, overplotting of points can be a limitation of scatter plots. -One strategy for handling such settings is to use hexagonal binning of -observations. The plot space is tessellated into hexagons. Each -hexagon is assigned a color based on the number of observations that -fall within its boundaries. +Los diagramas de dispersión pueden ser herramientas exploratorias útiles para conjuntos de datos pequeños. Para +conjuntos de datos con una gran cantidad de observaciones, como el conjunto de datos `rna_fc` +, el trazado excesivo de puntos puede ser una limitación de los diagramas de dispersión. +Una estrategia para manejar tales configuraciones es utilizar agrupación hexagonal de +observaciones. El espacio de la trama está teselado en hexágonos. A cada hexágono +se le asigna un color según el número de observaciones que +caen dentro de sus límites. -- To use hexagonal binning in `ggplot2`, first install the R package - `hexbin` from CRAN and load it. +- Para utilizar la agrupación hexagonal en `ggplot2`, primero instale el paquete R + `hexbin` de CRAN y cárguelo. -- Then use the `geom_hex()` function to produce the hexbin figure. +- Luego use la función `geom_hex()` para producir la figura hexbin. -- What are the relative strengths and weaknesses of a hexagonal bin - plot compared to a scatter plot? Examine the above scatter plot - and compare it with the hexagonal bin plot that you created. +- ¿Cuáles son las fortalezas y debilidades relativas de un diagrama de bin hexagonal + en comparación con un diagrama de dispersión? Examine el diagrama de dispersión anterior + y compárelo con el diagrama de bin hexagonal que creó. -::::::::::::::: solution +::::::::::::::: solución -## Solution +## Solución ```{r, eval=FALSE, purl=TRUE} -install.packages("hexbin") +instalar.paquetes("hexbin") ``` ```{r, purl=TRUE} -library("hexbin") +biblioteca("hexbin") -ggplot(data = rna_fc, mapping = aes(x = time_4_vs_0, y = time_8_vs_0)) + +ggplot(data = rna_fc, mapeo = aes(x = time_4_vs_0, y = time_8_vs_0)) + geom_hex() + - geom_abline(intercept = 0) + geom_abline(intercepción = 0 ) ``` -::::::::::::::::::::::::: +:::::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::::::::: -::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::: desafío -## Challenge +## Desafío -Use what you just learned to create a scatter plot of `expression_log` -over `sample` from the `rna` dataset with the time showing in -different colors. Is this a good way to show this type of data? +Utilice lo que acaba de aprender para crear un diagrama de dispersión de `expression_log` +sobre `sample` del conjunto de datos `rna` con el tiempo mostrado en +colores diferentes. ¿Es esta una buena manera de mostrar este tipo de datos? -::::::::::::::: solution +::::::::::::::: solución -## Solution +## Solución ```{r, eval=TRUE, purl=TRUE} -ggplot(data = rna, mapping = aes(y = expression_log, x = sample)) + - geom_point(aes(color = time)) +ggplot(datos = arn, mapeo = aes(y = expresión_log, x = muestra)) + + geom_point(aes(color = tiempo)) ``` -::::::::::::::::::::::::: +:::::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::::::::: -## Boxplot +## diagrama de caja -We can use boxplots to visualize the distribution of gene expressions -within each sample: +Podemos usar diagramas de caja para visualizar la distribución de expresiones genéticas +dentro de cada muestra: ```{r boxplot, cache=FALSE, purl=TRUE} -ggplot(data = rna, - mapping = aes(y = expression_log, x = sample)) + +ggplot(datos = rna, + mapeo = aes(y = expresión_log, x = muestra)) + geom_boxplot() ``` -By adding points to boxplot, we can have a better idea of the number of -measurements and of their distribution: +Al agregar puntos al diagrama de caja, podemos tener una mejor idea del número de +mediciones y de su distribución: ```{r boxplot-with-points, cache=FALSE, purl=TRUE} -ggplot(data = rna, - mapping = aes(y = expression_log, x = sample)) + - geom_jitter(alpha = 0.2, color = "tomato") + - geom_boxplot(alpha = 0) +ggplot(datos = rna, + mapeo = aes(y = expresión_log, x = muestra)) + + geom_jitter(alfa = 0.2, color = "tomate") + + geom_boxplot( alfa = 0) ``` -::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::: desafío -## Challenge +## Desafío -Note how the boxplot layer is in front of the jitter layer? What do -you need to change in the code to put the boxplot below the points? +¿Observa cómo la capa del diagrama de caja está delante de la capa de fluctuación? ¿Qué +necesitas cambiar en el código para colocar el diagrama de caja debajo de los puntos? -::::::::::::::: solution +::::::::::::::: solución -## Solution +## Solución -We should switch the order of these two geoms: +Deberíamos cambiar el orden de estas dos geoms: ```{r boxplot-with-points2, cache=FALSE, purl=TRUE} ggplot(data = rna, - mapping = aes(y = expression_log, x = sample)) + + mapeo = aes(y = expresión_log, x = muestra)) + geom_boxplot(alpha = 0) + - geom_jitter(alpha = 0.2, color = "tomato") + geom_jitter(alpha = 0.2, color = "tomate") ``` -::::::::::::::::::::::::: +:::::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::::::::: -You may notice that the values on the x-axis are still not properly -readable. Let's change the orientation of the labels and adjust them -vertically and horizontally so they don't overlap. You can use a -90-degree angle, or experiment to find the appropriate angle for -diagonally oriented labels: +Puede notar que los valores en el eje x todavía no se pueden leer correctamente +. Cambiemos la orientación de las etiquetas y ajustémoslas +vertical y horizontalmente para que no se superpongan. Puedes usar un ángulo de +90 grados, o experimentar para encontrar el ángulo apropiado para +etiquetas orientadas en diagonal: ```{r boxplot-xaxis-rotated, cache=FALSE, purl=TRUE} -ggplot(data = rna, - mapping = aes(y = expression_log, x = sample)) + - geom_jitter(alpha = 0.2, color = "tomato") + - geom_boxplot(alpha = 0) + - theme(axis.text.x = element_text(angle = 90, hjust = 0.5, vjust = 0.5)) +ggplot(datos = rna, + mapeo = aes(y = expresión_log, x = muestra)) + + geom_jitter(alfa = 0.2, color = "tomate") + + geom_boxplot( alfa = 0) + + tema(axis.text.x = element_text(ángulo = 90, hjust = 0.5, vjust = 0.5)) ``` -::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::: desafío -## Challenge +## Desafío -Add color to the data points on your boxplot according to the duration -of the infection (`time`). +Agregue color a los puntos de datos en su diagrama de caja de acuerdo con la duración +de la infección (`tiempo`). -_Hint:_ Check the class for `time`. Consider changing the class of -`time` from integer to factor directly in the ggplot mapping. Why does -this change how R makes the graph? +_Pista:_ Verifique el "tiempo" de la clase. Considere cambiar la clase de +`time` de entero a factor directamente en el mapeo de ggplot. ¿Por qué +esto cambia la forma en que R hace el gráfico? -::::::::::::::: solution +::::::::::::::: solución -## Solution +## Solución ```{r boxplot-color-time, cache=FALSE, purl=TRUE} # time as integer @@ -498,65 +498,65 @@ ggplot(data = rna, theme(axis.text.x = element_text(angle = 90, hjust = 0.5, vjust = 0.5)) ``` -::::::::::::::::::::::::: +:::::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::::::::: -::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::: desafío -## Challenge +## Desafío -Boxplots are useful summaries, but hide the _shape_ of the -distribution. For example, if the distribution is bimodal, we would -not see it in a boxplot. An alternative to the boxplot is the violin -plot, where the shape (of the density of points) is drawn. +Los diagramas de caja son resúmenes útiles, pero ocultan la _forma_ de la distribución +. Por ejemplo, si la distribución es bimodal, +no la veríamos en un diagrama de caja. Una alternativa al diagrama de caja es el diagrama del violín +, donde se dibuja la forma (de la densidad de puntos). -- Replace the box plot with a violin plot; see `geom_violin()`. Fill - in the violins according to the time with the argument `fill`. +- Reemplace el diagrama de caja con un diagrama de violín; ver `geom_violin()`. Rellena + los violines según el tiempo con el argumento `fill`. -::::::::::::::: solution +::::::::::::::: solución -## Solution +## Solución ```{r, eval=TRUE, echo=TRUE, cache=FALSE, purl=TRUE} ggplot(data = rna, - mapping = aes(y = expression_log, x = sample)) + + mapeo = aes(y = expresión_log, x = muestra)) + geom_violin(aes(fill = as.factor(time))) + - theme(axis.text.x = element_text(angle = 90, hjust = 0.5, vjust = 0.5)) + tema(axis.text.x = element_text(ángulo = 90, hjust = 0.5, vjust = 0.5)) ``` -::::::::::::::::::::::::: +:::::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::::::::: -::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::: desafío -## Challenge +## Desafío -- Modify the violin plot to fill in the violins by `sex`. +- Modifique la trama del violín para completar los violines por "sexo". -::::::::::::::: solution +::::::::::::::: solución -## Solution +## Solución ```{r, eval=TRUE, echo=TRUE, cache=FALSE, purl=TRUE} -ggplot(data = rna, - mapping = aes(y = expression_log, x = sample)) + - geom_violin(aes(fill = sex)) + - theme(axis.text.x = element_text(angle = 90, hjust = 0.5, vjust = 0.5)) +ggplot(datos = rna, + mapeo = aes(y = expresión_log, x = muestra)) + + geom_violin(aes(relleno = sexo)) + + tema(eje.texto .x = elemento_texto(ángulo = 90, hjust = 0.5, vjust = 0.5)) ``` -::::::::::::::::::::::::: +:::::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::::::::: -## Line plots +## Gráficos de líneas -Let's calculate the mean expression per duration of the infection for -the 10 genes having the highest log fold changes comparing time 8 versus -time 0. First, we need to select the genes and create a subset of `rna` -called `sub_rna` containing the 10 selected genes, then we need to group -the data and calculate the mean gene expression within each group: +Calculemos la expresión media por duración de la infección para +los 10 genes que tienen los cambios logarítmicos más altos comparando el tiempo 8 versus +tiempo 0. Primero, necesitamos seleccionar los genes y crear un subconjunto de `rna` +llamado `sub_rna` que contiene los 10 genes seleccionados, luego necesitamos agrupar +los datos y calcular la expresión génica media dentro cada grupo: ```{r, purl=TRUE} rna_fc <- rna_fc %>% arrange(desc(time_8_vs_0)) @@ -573,205 +573,206 @@ mean_exp_by_time <- sub_rna %>% mean_exp_by_time ``` -We can build the line plot with duration of the infection on the x-axis -and the mean expression on the y-axis: +Podemos construir el gráfico lineal con la duración de la infección en el eje x +y la expresión media en el eje y: ```{r first-time-series, purl=TRUE} -ggplot(data = mean_exp_by_time, mapping = aes(x = time, y = mean_exp)) + +ggplot(datos = media_exp_por_tiempo, mapeo = aes(x = tiempo, y = media_exp)) + geom_line() ``` -Unfortunately, this does not work because we plotted data for all the -genes together. We need to tell ggplot to draw a line for each gene by -modifying the aesthetic function to include `group = gene`: +Desafortunadamente, esto no funciona porque representamos los datos de todos los genes +juntos. Necesitamos decirle a ggplot que dibuje una línea para cada gen +modificando la función estética para incluir `grupo = gen`: ```{r time-series-by-gene, purl=TRUE} -ggplot(data = mean_exp_by_time, - mapping = aes(x = time, y = mean_exp, group = gene)) + +ggplot(datos = media_exp_por_tiempo, + mapeo = aes(x = tiempo, y = media_exp, grupo = gen)) + geom_line() ``` -We will be able to distinguish genes in the plot if we add colors (using -`color` also automatically groups the data): +Podremos distinguir genes en el gráfico si agregamos colores (usando +`color` también agrupa automáticamente los datos): ```{r time-series-with-colors, purl=TRUE} -ggplot(data = mean_exp_by_time, - mapping = aes(x = time, y = mean_exp, color = gene)) + +ggplot(datos = media_exp_por_tiempo, + mapeo = aes(x = tiempo, y = media_exp, color = gen)) + geom_line() ``` -## Faceting +## facetado -`ggplot2` has a special technique called _faceting_ that allows the user -to split one plot into multiple (sub) plots based on a factor included -in the dataset. These different subplots inherit the same properties -(axes limits, ticks, ...) to facilitate their direct comparison. We will -use it to make a line plot across time for each gene: +`ggplot2` tiene una técnica especial llamada _facetado_ que permite al usuario +dividir un gráfico en múltiples (sub)gráficos en función de un factor incluido +en el conjunto de datos. Estas diferentes subtramas heredan las mismas propiedades +(límites de ejes, ticks, ...) para facilitar su comparación directa. +lo usaremos para hacer un gráfico lineal a lo largo del tiempo para cada gen: ```{r first-facet, purl=TRUE} -ggplot(data = mean_exp_by_time, - mapping = aes(x = time, y = mean_exp)) + geom_line() + - facet_wrap(~ gene) +ggplot(datos = media_exp_por_tiempo, + mapeo = aes(x = tiempo, y = media_exp)) + geom_line() + + facet_wrap(~ gen) ``` -Here both x- and y-axis have the same scale for all the subplots. You -can change this default behavior by modifying `scales` in order to allow -a free scale for the y-axis: +Aquí, tanto el eje x como el y tienen la misma escala para todas las subtramas. Usted +puede cambiar este comportamiento predeterminado modificando `escalas` para permitir +una escala libre para el eje y: ```{r first-facet-scales, purl=TRUE} -ggplot(data = mean_exp_by_time, - mapping = aes(x = time, y = mean_exp)) + +ggplot(data = media_exp_por_tiempo, + mapeo = aes(x = tiempo, y = media_exp)) + geom_line() + - facet_wrap(~ gene, scales = "free_y") + facet_wrap(~ gen, escalas = "free_y") ``` -Now we would like to split the line in each plot by the sex of the mice. -To do that we need to calculate the mean expression in the data frame -grouped by `gene`, `time`, and `sex`: +Ahora nos gustaría dividir la línea en cada gráfico por el sexo de los ratones. +Para hacer eso necesitamos calcular la expresión media en el marco de datos +agrupado por "gen", "tiempo" y "sexo": ```{r data-facet-by-gene-and-sex, purl=TRUE} mean_exp_by_time_sex <- sub_rna %>% - group_by(gene, time, sex) %>% - summarize(mean_exp = mean(expression_log)) + group_by(gen, tiempo, sexo) %>% + resumen(mean_exp = mean(expression_log)) mean_exp_by_time_sex ``` -We can now make the faceted plot by splitting further by sex using -`color` (within a single plot): +Ahora podemos hacer el gráfico facetado dividiéndolo aún más por sexo usando +`color` (dentro de un solo gráfico): ```{r facet-by-gene-and-sex, cache=FALSE, purl=TRUE} ggplot(data = mean_exp_by_time_sex, - mapping = aes(x = time, y = mean_exp, color = sex)) + + mapeo = aes(x = tiempo, y = mean_exp, color = sexo)) + geom_line() + - facet_wrap(~ gene, scales = "free_y") + facet_wrap(~ gen, escalas = "libre_y") ``` -Usually plots with white background look more readable when printed. We -can set the background to white using the function `theme_bw()`. -Additionally, we can remove the grid: +Por lo general, los gráficos con fondo blanco parecen más legibles cuando se imprimen. Nosotros +podemos configurar el fondo en blanco usando la función `theme_bw()`. +Además, podemos eliminar la cuadrícula: ```{r facet-by-gene-and-sex-white-bg, cache=FALSE, purl=TRUE} ggplot(data = mean_exp_by_time_sex, - mapping = aes(x = time, y = mean_exp, color = sex)) + + mapeo = aes(x = tiempo, y = mean_exp, color = sexo)) + geom_line() + - facet_wrap(~ gene, scales = "free_y") + + facet_wrap(~ gen, escalas = "free_y") + theme_bw() + - theme(panel.grid = element_blank()) + tema(panel.grid = element_blank()) ``` -::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::: desafío -## Challenge +## Desafío -Use what you just learned to create a plot that depicts how the -average expression of each chromosome changes through the duration of -infection. +Utilice lo que acaba de aprender para crear un gráfico que represente cómo cambia la expresión promedio +de cada cromosoma durante la duración de la infección +. -::::::::::::::: solution +::::::::::::::: solución -## Solution +## Solución ```{r mean-exp-chromosome-time-series, purl=TRUE} -mean_exp_by_chromosome <- rna %>% +mean_exp_by_chromosome <- arn %>% group_by(chromosome_name, time) %>% - summarize(mean_exp = mean(expression_log)) + resumen(mean_exp = mean(expression_log)) -ggplot(data = mean_exp_by_chromosome, mapping = aes(x = time, - y = mean_exp)) + +ggplot(data = mean_exp_by_chromosome, mapeo = aes( x = tiempo, + y = media_exp)) + geom_line() + - facet_wrap(~ chromosome_name, scales = "free_y") + facet_wrap(~ nombre_cromosoma, escalas = "free_y") ``` -::::::::::::::::::::::::: +:::::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::::::::: -The `facet_wrap` geometry extracts plots into an arbitrary number of -dimensions to allow them to cleanly fit on one page. On the other hand, +La geometría `facet_wrap` extrae gráficos en un número arbitrario de +dimensiones para permitir que quepan limpiamente en una página. On the other hand, the `facet_grid` geometry allows you to explicitly specify how you want your plots to be arranged via formula notation (`rows ~ columns`; a `.` can be used as a placeholder that indicates only one row or column). -Let's modify the previous plot to compare how the mean gene expression -of males and females has changed through time: +Modifiquemos el gráfico anterior para comparar cómo la expresión genética media +de hombres y mujeres ha cambiado a lo largo del tiempo: ```{r mean-exp-time-facet-sex-rows, purl=TRUE} -# One column, facet by rows -ggplot(data = mean_exp_by_time_sex, - mapping = aes(x = time, y = mean_exp, color = gene)) + +# Una columna, faceta por filas +ggplot(data = media_exp_by_time_sex, + mapeo = aes(x = tiempo, y = media_exp, color = gen)) + geom_line() + - facet_grid(sex ~ .) + facet_grid(sexo ~ .) ``` ```{r mean-exp-time-facet-sex-columns, purl=TRUE} -# One row, facet by column -ggplot(data = mean_exp_by_time_sex, - mapping = aes(x = time, y = mean_exp, color = gene)) + +# Una fila, faceta por columna +ggplot(data = media_exp_by_time_sex, + mapeo = aes(x = tiempo, y = media_exp, color = gen)) + geom_line() + - facet_grid(. ~ sex) + facet_grid(. ~ sexo) ``` -## `ggplot2` themes +## temas `ggplot2` -In addition to `theme_bw()`, which changes the plot background to white, -`ggplot2` comes with several other themes which can be useful to quickly -change the look of your visualization. The complete list of themes is -available at [https://ggplot2.tidyverse.org/reference/ggtheme.html](https://ggplot2.tidyverse.org/reference/ggtheme.html). -`theme_minimal()` and `theme_light()` are popular, and `theme_void()` -can be useful as a starting point to create a new hand-crafted theme. +Además de `theme_bw()`, que cambia el fondo de la trama a blanco, +`ggplot2` viene con varios otros temas que pueden ser útiles para +cambiar rápidamente el aspecto de tu visualización. La lista completa de temas está +disponible en [https://ggplot2.tidyverse.org/reference/ggtheme.html](https://ggplot2.tidyverse.org/reference/ggtheme.html). +`theme_minimal()` y `theme_light()` son populares, y `theme_void()` +puede ser útil como punto de partida para crear un nuevo tema hecho a mano. -The [ggthemes](https://jrnold.github.io/ggthemes/reference/index.html) -package provides a wide variety of options (including an Excel 2003 -theme). The ggplot2 provides a list of +El paquete [ggthemes](https://jrnold.github.io/ggthemes/reference/index.html) +proporciona una amplia variedad de opciones (incluido un tema de Excel 2003 +). The ggplot2 provides a list of packages that extend the capabilities of `ggplot2`, including additional themes. -## Customisation +## Personalización -Let's come back to the faceted plot of mean expression by time and gene, -colored by sex. +Volvamos a la trama facetada de expresión media por tiempo y gen, +coloreada por sexo. -Take a look at the ggplot2, -and think of ways you could improve the plot. +Eche un vistazo a la hoja de trucos , +y piense en formas Podrías mejorar la trama. -Now, we can change names of axes to something more informative than -'time' and 'mean\_exp', and add a title to the figure: +Ahora, podemos cambiar los nombres de los ejes a algo más informativo que +'tiempo' y 'media\_exp', y agregar un título a la figura: ```{r mean_exp-time-with-right-labels, cache=FALSE, purl=TRUE} ggplot(data = mean_exp_by_time_sex, - mapping = aes(x = time, y = mean_exp, color = sex)) + + mapeo = aes(x = tiempo, y = mean_exp, color = sexo)) + geom_line() + - facet_wrap(~ gene, scales = "free_y") + + facet_wrap(~ gen, escalas = "free_y") + theme_bw() + theme(panel.grid = element_blank()) + - labs(title = "Mean gene expression by duration of the infection", - x = "Duration of the infection (in days)", - y = "Mean gene expression") + labs(title = "Expresión genética media por duración de la infección", + x = "Duración de la infección (en días)", + y = "Expresión genética media") ``` -The axes have more informative names, but their readability can be -improved by increasing the font size: +Los ejes tienen nombres más informativos, pero su legibilidad se puede mejorar +aumentando el tamaño de fuente: ```{r mean_exp-time-with-right-labels-xfont-size, cache=FALSE, purl=TRUE} ggplot(data = mean_exp_by_time_sex, - mapping = aes(x = time, y = mean_exp, color = sex)) + + mapeo = aes(x = tiempo, y = mean_exp, color = sexo)) + geom_line() + - facet_wrap(~ gene, scales = "free_y") + + facet_wrap(~ gen, escalas = "free_y") + theme_bw() + theme(panel.grid = element_blank()) + - labs(title = "Mean gene expression by duration of the infection", - x = "Duration of the infection (in days)", - y = "Mean gene expression") + - theme(text = element_text(size = 16)) + labs(title = "Expresión genética media por duración de la infección", + x = "Duración de la infección (en días)", + y = "Expresión genética media") + + tema(texto = elemento_texto(tamaño = 16)) ``` -Note that it is also possible to change the fonts of your plots. If you -are on Windows, you may have to install the . +Tenga en cuenta que también es posible cambiar las fuentes de sus gráficos. Si +estás en Windows, es posible que tengas que instalar el [**`extrafont`** +paquete](https://cran.r-project.org/web/packages/extrafont /index.html). -We can further customize the color of x- and y-axis text, the color of -the grid, etc. We can also for example move the legend to the top by -setting `legend.position` to `"top"`. +Podemos personalizar aún más el color del texto de los ejes x e y, el color de +la cuadrícula, etc. También podemos, por ejemplo, mover la leyenda a la parte superior +configurando `legend.position` en `"top"`. ```{r mean_exp-time-with-theme, cache=FALSE, purl=TRUE} ggplot(data = mean_exp_by_time_sex, @@ -790,10 +791,10 @@ ggplot(data = mean_exp_by_time_sex, legend.position = "top") ``` -If you like the changes you created better than the default theme, you -can save them as an object to be able to easily apply them to other -plots you may create. Here is an example with the histogram we have -previously created. +Si le gustan más los cambios que creó que el tema predeterminado, +puede guardarlos como un objeto para poder aplicarlos fácilmente a otros +gráficos que pueda crear. Aquí hay un ejemplo con el histograma que hemos +creado previamente. ```{r mean_exp-time-with-right-labels-xfont, cache=FALSE, purl=TRUE} blue_theme <- theme(axis.text.x = element_text(colour = "royalblue4", @@ -808,39 +809,40 @@ ggplot(rna, aes(x = expression_log)) + blue_theme ``` -::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::: desafío -## Challenge +## Desafío -With all of this information in hand, please take another five minutes -to either improve one of the plots generated in this exercise or -create a beautiful graph of your own. Use the RStudio ggplot2 -for inspiration. Here are some ideas: +Con toda esta información en la mano, tómate otros cinco minutos +para mejorar uno de los gráficos generados en este ejercicio o +para crear un hermoso gráfico propio. Utilice la hoja de trucos +de RStudio ggplot2 +para inspirarse. Aquí hay algunas ideas: -- See if you can change the thickness of the lines. -- Can you find a way to change the name of the legend? What about - its labels? (hint: look for a ggplot function starting with +- Vea si puede cambiar el grosor de las líneas. +- ¿Puedes encontrar una manera de cambiar el nombre de la leyenda? ¿Qué pasa con + sus etiquetas? (pista: busque una función ggplot que comience con `scale_`) -- Try using a different color palette or manually specifying the - colors for the lines (see +- Intente usar una paleta de colores diferente o especifique manualmente los + colores para las líneas (consulte [http://www.cookbook-r.com/Graphs/Colors\_(ggplot2)/](https://www.cookbook-r.com/Graphs/Colors_\\(ggplot2\\)/)). -::::::::::::::: solution +::::::::::::::: solución -## Solution +## Solución -For example, based on this plot: +Por ejemplo, basado en esta trama: ```{r, purl=TRUE} ggplot(data = mean_exp_by_time_sex, - mapping = aes(x = time, y = mean_exp, color = sex)) + + mapeo = aes(x = tiempo, y = mean_exp, color = sexo)) + geom_line() + - facet_wrap(~ gene, scales = "free_y") + + facet_wrap(~ gen, escalas = "free_y") + theme_bw() + - theme(panel.grid = element_blank()) + tema(panel.grid = element_blank()) ``` -We can customize it the following ways: +Podemos personalizarlo de las siguientes maneras: ```{r, purl=TRUE} # change the thickness of the lines @@ -881,74 +883,74 @@ ggplot(data = mean_exp_by_time_sex, ``` -::::::::::::::::::::::::: +:::::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::::::::: -## Composing plots +## Componer tramas Faceting is a great tool for splitting one plot into multiple subplots, but sometimes you may want to produce a single figure that contains multiple independent plots, i.e. plots that are based on different variables or even different data frames. -Let's start by creating the two plots that we want to arrange next to -each other: +Comencemos creando los dos gráficos que queremos organizar uno al lado del otro: +: -The first graph counts the number of unique genes per chromosome. We -first need to reorder the levels of `chromosome_name` and filter the -unique genes per chromosome. We also change the scale of the y-axis to a -log10 scale for better readability. +El primer gráfico cuenta el número de genes únicos por cromosoma. +primero necesitamos reordenar los niveles de `chromosome_name` y filtrar los +genes únicos por cromosoma. También cambiamos la escala del eje y a una escala +log10 para una mejor legibilidad. ```{r sub1, purl=TRUE} -rna$chromosome_name <- factor(rna$chromosome_name, - levels = c(1:19,"X","Y")) - -count_gene_chromosome <- rna %>% select(chromosome_name, gene) %>% - distinct() %>% ggplot() + - geom_bar(aes(x = chromosome_name), fill = "seagreen", - position = "dodge", stat = "count") + - labs(y = "log10(n genes)", x = "chromosome") + +arn$chromosome_name <- factor(arn$chromosome_name, + niveles = c(1:19,"X","Y")) + +count_gene_chromosome <- arn %> % select(nombre_cromosoma, gen) %>% + distintivo() %>% ggplot() + + geom_bar(aes(x = nombre_cromosoma), fill = "verdemar", + posición = "esquivar", estadística = "contar") + + labs(y = "log10(n genes)", x = "cromosoma") + scale_y_log10() -count_gene_chromosome +recuento_gene_cromosoma ``` -Below, we also remove the legend altogether by setting the -`legend.position` to `"none"`. +A continuación, también eliminamos la leyenda por completo estableciendo +`legend.position` en `"none"`. ```{r sub2, purl=TRUE} exp_boxplot_sex <- ggplot(rna, aes(y=expression_log, x = as.factor(time), color=sex)) + geom_boxplot(alpha = 0) + - labs(y = "Mean gene exp", - x = "time") + theme(legend.position = "none") + labs(y = "Experiencia genética media", + x = "tiempo") + theme(legend.position = "none") exp_boxplot_sex ``` -The [**patchwork**](https://github.com/thomasp85/patchwork) package -provides an elegant approach to combining figures using the `+` to -arrange figures (typically side by side). More specifically the `|` -explicitly arranges them side by side and `/` stacks them on top of each -other. +El paquete [**patchwork**](https://github.com/thomasp85/patchwork) +proporciona un enfoque elegante para combinar figuras usando `+` para +organizar figuras (normalmente de lado). al lado). Más específicamente, `|` +los organiza explícitamente uno al lado del otro y `/` los apila uno encima del otro +. ```{r install-patchwork, message=FALSE, eval=FALSE, purl=TRUE} -install.packages("patchwork") +instalar.paquetes("patchwork") ``` ```{r patchworkplot1, purl=TRUE} -library("patchwork") +biblioteca("patchwork") count_gene_chromosome + exp_boxplot_sex -## or count_gene_chromosome | exp_boxplot_sex +## o count_gene_chromosome | exp_boxplot_sexo ``` ```{r patchwork2, purl=TRUE} count_gene_chromosome / exp_boxplot_sex ``` -We can combine further control the layout of the final composition with -`plot_layout` to create more complex layouts: +Podemos combinar un mayor control del diseño de la composición final con +`plot_layout` para crear diseños más complejos: ```{r patchwork3, purl=TRUE} count_gene_chromosome + exp_boxplot_sex + plot_layout(ncol = 1) @@ -961,7 +963,7 @@ count_gene_chromosome + plot_layout(ncol = 1) ``` -The last plot can also be created using the `|` and `/` composers: +El último gráfico también se puede crear usando los compositores `|` y `/`: ```{r patchwork5, purl=TRUE} count_gene_chromosome / @@ -969,38 +971,38 @@ count_gene_chromosome / exp_boxplot_sex ``` -Learn more about `patchwork` on its -[webpage](https://patchwork.data-imaginist.com/) or in this -[video](https://www.youtube.com/watch?v=0m4yywqNPVY). +Obtenga más información sobre `patchwork` en su +[página web](https://patchwork.data-imaginist.com/) o en este +[video](https://www.youtube. es/watch?v=0m4yywqNPVY). -Another option is the **`gridExtra`** package that allows to combine -separate ggplots into a single figure using `grid.arrange()`: +Otra opción es el paquete **`gridExtra`** que permite combinar +ggplots separados en una sola figura usando `grid.arrange()`: ```{r install-gridextra, message=FALSE, eval=FALSE, purl=TRUE} -install.packages("gridExtra") +instalar.paquetes("gridExtra") ``` ```{r gridarrange-example, message=FALSE, fig.width=10, purl=TRUE} -library("gridExtra") +biblioteca("gridExtra") grid.arrange(count_gene_chromosome, exp_boxplot_sex, ncol = 2) ``` -In addition to the `ncol` and `nrow` arguments, used to make simple -arrangements, there are tools for constructing more complex -layouts. +Además de los argumentos `ncol` y `nrow`, utilizados para hacer arreglos +simples, existen herramientas para [construir diseños +más complejos](https://cran.r-project. org/web/packages/gridExtra/vignettes/arrangeGrob.html). -## Exporting plots +## Exportar parcelas -After creating your plot, you can save it to a file in your favorite -format. The Export tab in the **Plot** pane in RStudio will save your -plots at low resolution, which will not be accepted by many journals and -will not scale well for posters. +Después de crear su trama, puede guardarla en un archivo en su formato +favorito. La pestaña Exportar en el panel **Trazado** en RStudio guardará sus +trazados en baja resolución, lo que no será aceptado por muchas revistas y +no se escalará bien para los carteles. -Instead, use the `ggsave()` function, which allows you easily change the -dimension and resolution of your plot by adjusting the appropriate -arguments (`width`, `height` and `dpi`). +En su lugar, use la función `ggsave()`, que le permite cambiar fácilmente la dimensión +y la resolución de su gráfico ajustando los argumentos +apropiados (`ancho`, `alto` y `dpi` ). -Make sure you have the `fig_output/` folder in your working directory. +Asegúrese de tener la carpeta `fig_output/` en su directorio de trabajo. ```{r ggsave-example, eval=FALSE, purl=TRUE} my_plot <- ggplot(data = mean_exp_by_time_sex, @@ -1027,80 +1029,79 @@ ggsave("fig_output/combo_plot_chromosome_sex.png", combo_plot, width = 10, dpi = 300) ``` -Note: The parameters `width` and `height` also determine the font size -in the saved plot. +Nota: Los parámetros `ancho` y `alto` también determinan el tamaño de fuente +en el gráfico guardado. ```{r final-challenge, eval=FALSE, purl=TRUE, echo=FALSE} -### Final plotting challenge: -## With all of this information in hand, please take another five -## minutes to either improve one of the plots generated in this -## exercise or create a beautiful graph of your own. Use the RStudio -## ggplot2 cheat sheet for inspiration: -## https://raw.githubusercontent.com/rstudio/cheatsheets/main/data-visualization.pdf +### Desafío de trama final: +## Con toda esta información en la mano, tómate otros cinco +## minutos para mejorar una de las tramas generadas en este +# # Haz ejercicio o crea un hermoso gráfico propio. Utilice la hoja de referencia de RStudio +## ggplot2 para inspirarse: +## https://raw.githubusercontent.com/rstudio/cheatsheets/main/data-visualization.pdf ``` -## Other packages for visualisation +## Otros paquetes para visualización. -`ggplot2` is a very powerful package that fits very nicely in our _tidy -data_ and _tidy tools_ pipeline. There are other visualization packages -in R that shouldn't be ignored. +`ggplot2` es un paquete muy poderoso que encaja muy bien en nuestra canalización de datos \*tidy* y _herramientas tidy_. Hay otros paquetes de visualización + en R que no deben ignorarse. -### Base graphics +### Gráficos básicos -The default graphics system that comes with R, often called _base R -graphics_ is simple and fast. It is based on the _painter's or canvas -model_, where different output are directly overlaid on top of each -other (see figure @ref(fig:paintermodel)). This is a fundamental +El sistema de gráficos predeterminado que viene con R, a menudo llamado _base R +gráficos_ es simple y rápido. Se basa en el \*modelo de pintor o lienzo +\*, donde diferentes resultados se superponen directamente uno encima del otro +(consulte la figura @ref(fig:paintermodel)). This is a fundamental difference with `ggplot2` (and with `lattice`, described below), that returns dedicated objects, that are rendered on screen or in a file, and that can even be updated. ```{r paintermodel, fig.width=12, fig.height=4, fig.cap="Successive layers added on top of each other."} par(mfrow = c(1, 3)) -plot(1:20, main = "First layer, produced with plot(1:20)") +plot(1:20, main = "Primera capa, producida con plot(1:20)") -plot(1:20, main = "A horizontal red line, added with abline(h = 10)") +plot(1:20, main = "Una línea roja horizontal, agregada con abline(h = 10)") abline(h = 10, col = "red") -plot(1:20, main = "A rectangle, added with rect(5, 5, 15, 15)") +plot(1:20, main = "Un rectángulo , agregado con rect(5, 5, 15, 15)") abline(h = 10, col = "red") -rect(5, 5, 15, 15, lwd = 3) +rect(5, 5, 15, 15, lwd = 3 ) ``` -Another main difference is that base graphics' plotting function try to -do _the right_ thing based on their input type, i.e. they will adapt -their behaviour based on the class of their input. This is again very -different from what we have in `ggplot2`, that only accepts dataframes -as input, and that requires plots to be constructed bit by bit. +Otra diferencia principal es que la función de trazado de los gráficos base intenta +hacer _lo correcto_ según su tipo de entrada, es decir, adaptarán +su comportamiento según la clase de su entrada. De nuevo, esto es muy +diferente de lo que tenemos en `ggplot2`, que solo acepta marcos de datos +como entrada y que requiere que los gráficos se construyan poco a poco. ```{r plotmethod, fig.width=8, fig.height=8, fig.cap="Plotting boxplots (top) and histograms (bottom) vectors (left) or a matrices (right)."} par(mfrow = c(2, 2)) boxplot(rnorm(100), - main = "Boxplot of rnorm(100)") -boxplot(matrix(rnorm(100), ncol = 10), - main = "Boxplot of matrix(rnorm(100), ncol = 10)") + main = "Gráfico de caja de rnorm(100)") +boxplot(matrix(rnorm( 100), ncol = 10), + main = "Gráfico de caja de la matriz(rnorm(100), ncol = 10)") hist(rnorm(100)) -hist(matrix(rnorm(100), ncol = 10)) +hist( matriz(norma(100), ncol = 10)) ``` -The out-of-the-box approach in base graphics can be very efficient for -simple, standard figures, that can be produced very quickly with a -single line of code and a single function such as `plot`, or `hist`, or -`boxplot`, ... The defaults are however not always the most appealing -and tuning of figures, especially when they become more complex (for -example to produce facets), can become lengthy and cumbersome. +El enfoque listo para usar en gráficos base puede ser muy eficiente para +figuras simples y estándar, que se pueden producir muy rápidamente con una sola línea de código +y una sola función, como `trama`, o `hist`, o +`boxplot`, ... Sin embargo, los valores predeterminados no siempre son los más atractivos +y el ajuste de las figuras, especialmente cuando se vuelven más complejos (por ejemplo, +para producir facetas), puede volverse largo y engorroso. -### The lattice package +### El paquete de celosía -The **`lattice`** package is similar to `ggplot2` in that is uses -dataframes as input, returns graphical objects and supports faceting. -`lattice` however isn't based on the grammar of graphics and has a more -convoluted interface. +El paquete **`lattice`** es similar a `ggplot2` en el sentido de que utiliza +marcos de datos como entrada, devuelve objetos gráficos y admite facetado. +Sin embargo, `lattice` no se basa en la gramática de los gráficos y tiene una interfaz más +complicada. -A good reference for the `lattice` package is @latticebook. +Una buena referencia para el paquete `lattice` es @latticebook. -:::::::::::::::::::::::::::::::::::::::: keypoints +:::::::::::::::::::::::::::::::::::::::: puntos clave -- Visualization in R +- Visualización en R -:::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::::::::: From c12165241d45f52961132c1ecbd1738565023e0b Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Tue, 14 May 2024 00:30:34 +0900 Subject: [PATCH 174/334] New translations 60-next-steps.md (Spanish) --- locale/es/episodes/60-next-steps.Rmd | 370 +++++++++++++-------------- 1 file changed, 184 insertions(+), 186 deletions(-) diff --git a/locale/es/episodes/60-next-steps.Rmd b/locale/es/episodes/60-next-steps.Rmd index fce5527bf..f190dc41a 100644 --- a/locale/es/episodes/60-next-steps.Rmd +++ b/locale/es/episodes/60-next-steps.Rmd @@ -1,6 +1,6 @@ --- source: Rmd -title: Next steps +title: Próximos pasos teaching: 45 exercises: 45 --- @@ -10,86 +10,86 @@ exercises: 45 ::::::::::::::::::::::::::::::::::::::: objetivos -- Introduce the Bioconductor project. -- Introduce the notion of data containers. -- Give an overview of the `SummarizedExperiment`, extensively used in - omics analyses. +- Presentar el proyecto Bioconductor. +- Introducir la noción de contenedores de datos. +- Brinde una descripción general del "Experimento resumido", ampliamente utilizado en + análisis ómicos. :::::::::::::::::::::::::::::::::::::::::::::::::: :::::::::::::::::::::::::::::::::::::::: questions -- What is a `SummarizedExperiment`? -- What is Bioconductor? +- ¿Qué es un "experimento resumido"? +- ¿Qué es un bioconductor? -:::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::::::::: -## Next steps +## Próximos pasos ```{r, echo=FALSE, message=FALSE} -library("tidyverse") +biblioteca("tidyverse") ``` -Data in bioinformatics is often complex. To deal with this, -developers define specialised data containers (termed classes) that -match the properties of the data they need to handle. +Los datos en bioinformática suelen ser complejos. Para solucionar esto, los desarrolladores de +definen contenedores de datos especializados (denominados clases) que +coinciden con las propiedades de los datos que necesitan manejar. -This aspect is central to the **Bioconductor**[^Bioconductor] project -which uses the same **core data infrastructure** across packages. This -certainly contributed to Bioconductor's success. Bioconductor package -developers are advised to make use of existing infrastructure to -provide coherence, interoperability, and stability to the project as a -whole. +Este aspecto es fundamental para el proyecto **Bioconductor**[^Bioconductor] +, que utiliza la misma **infraestructura de datos central** en todos los paquetes. Esto +ciertamente contribuyó al éxito de Bioconductor. Se recomienda a los desarrolladores del paquete de bioconductores +que utilicen la infraestructura existente para +proporcionar coherencia, interoperabilidad y estabilidad al proyecto en su conjunto +. -[^Bioconductor]: The [Bioconductor](https://www.bioconductor.org) was - initiated by Robert Gentleman, one of the two creators of the R - language. Bioconductor provides tools dedicated to omics data - analysis. Bioconductor uses the R statistical programming language - and is open source and open development. +[^Bioconductor]: El [Bioconductor](https://www.bioconductor.org) fue + iniciado por Robert Gentleman, uno de los dos creadores del lenguaje R + . Bioconductor proporciona herramientas dedicadas al análisis de datos ómicos + . Bioconductor utiliza el lenguaje de programación estadística R + y es de código abierto y desarrollo abierto. -To illustrate such an omics data container, we'll present the -`SummarizedExperiment` class. +Para ilustrar dicho contenedor de datos ómicos, presentaremos la clase +`SummarizedExperiment`. -## SummarizedExperiment +## Experimento resumido -The figure below represents the anatomy of the SummarizedExperiment class. +La siguiente figura representa la anatomía de la clase SummarizedExperiment. ```{r SE, echo=FALSE, out.width="80%"} knitr::include_graphics("https://uclouvain-cbio.github.io/WSBIM1322/figs/SE.svg") ``` -Objects of the class SummarizedExperiment contain : +Los objetos de la clase SummarizedExperiment contienen: -- **One (or more) assay(s)** containing the quantitative omics data - (expression data), stored as a matrix-like object. Features (genes, - transcripts, proteins, ...) are defined along the rows, and samples - along the columns. +- **Uno (o más) ensayos** que contienen los datos ómicos cuantitativos + (datos de expresión), almacenados como un objeto similar a una matriz. Características (genes, + transcripciones, proteínas, ...) se definen a lo largo de las filas y las muestras + a lo largo de las columnas. -- A **sample metadata** slot containing sample co-variates, stored as a - data frame. Rows from this table represent samples (rows match exactly the - columns of the expression data). +- Una ranura de **metadatos de muestra** que contiene covariables de muestra, almacenada como un marco de datos + . Las filas de esta tabla representan muestras (las filas coinciden exactamente con las + columnas de los datos de la expresión). -- A **feature metadata** slot containing feature co-variates, stored as - a data frame. The rows of this data frame match exactly the rows of the - expression data. +- Una ranura de **metadatos de características** que contiene covariables de características, almacenada como + un marco de datos. Las filas de este marco de datos coinciden exactamente con las filas de los datos de expresión + . -The coordinated nature of the `SummarizedExperiment` guarantees that -during data manipulation, the dimensions of the different slots will -always match (i.e the columns in the expression data and then rows in -the sample metadata, as well as the rows in the expression data and -feature metadata) during data manipulation. For example, if we had to -exclude one sample from the assay, it would be automatically removed -from the sample metadata in the same operation. +La naturaleza coordinada del `Experimento resumido` garantiza que +durante la manipulación de datos, las dimensiones de las diferentes ranuras +siempre coincidirán (es decir, las columnas en los datos de expresión y luego las filas en +los metadatos de muestra, así como las filas en los datos de expresión y +metadatos de características) durante la manipulación de datos. Por ejemplo, si tuviéramos que +excluir una muestra del ensayo, se eliminaría automáticamente +de los metadatos de la muestra en la misma operación. -The metadata slots can grow additional co-variates -(columns) without affecting the other structures. +Las ranuras de metadatos pueden generar covariables adicionales +(columnas) sin afectar las otras estructuras. -### Creating a SummarizedExperiment +### Crear un experimento resumido -In order to create a `SummarizedExperiment`, we will create the -individual components, i.e the count matrix, the sample and gene -metadata from csv files. These are typically how RNA-Seq data are -provided (after raw data have been processed). +Para crear un `Experimento resumido`, crearemos los +componentes individuales, es decir, la matriz de recuento, la muestra y el gen +metadatos a partir de archivos csv. Por lo general, así es como se proporcionan +los datos de RNA-Seq (después de que se hayan procesado los datos sin procesar). ```{r, echo=FALSE, message=FALSE} rna <- read_csv("data/rnaseq.csv") @@ -126,22 +126,22 @@ write.csv(gene_metadata, file = "data/gene_metadata.csv", row.names = FALSE) write.csv(sample_metadata, file = "data/sample_metadata.csv", row.names = FALSE) ``` -- **An expression matrix**: we load the count matrix, specifying that - the first columns contains row/gene names, and convert the - `data.frame` to a `matrix`. You can download it - [here](https://carpentries-incubator.github.io/bioc-intro/data/count_matrix.csv). +- **Una matriz de expresión**: cargamos la matriz de recuento, especificando que + las primeras columnas contienen nombres de filas/genes, y convertimos el + `data.frame` en una `matriz`. Puede descargarlo + [aquí](https://carpentries-incubator.github.io/bioc-intro/data/count_matrix.csv). ```{r} count_matrix <- read.csv("data/count_matrix.csv", - row.names = 1) %>% + fila.nombres = 1) %>% as.matrix() count_matrix[1:5, ] -dim(count_matrix) +tenue(count_matrix) ``` -- **A table describing the samples**, available - [here](https://carpentries-incubator.github.io/bioc-intro/data/sample_metadata.csv). +- **Una tabla que describe las muestras**, disponible + [aquí](https://carpentries-incubator.github.io/bioc-intro/data/sample_metadata.csv). ```{r} sample_metadata <- read.csv("data/sample_metadata.csv") @@ -149,8 +149,8 @@ sample_metadata dim(sample_metadata) ``` -- **A table describing the genes**, available - [here](https://carpentries-incubator.github.io/bioc-intro/data/gene_metadata.csv). +- **Una tabla que describe los genes**, disponible + [aquí](https://carpentries-incubator.github.io/bioc-intro/data/gene_metadata.csv). ```{r} gene_metadata <- read.csv("data/gene_metadata.csv") @@ -158,27 +158,27 @@ gene_metadata[1:10, 1:4] dim(gene_metadata) ``` -We will create a `SummarizedExperiment` from these tables: +Crearemos un `Experimento resumido` a partir de estas tablas: -- The count matrix that will be used as the **`assay`** +- La matriz de recuento que se utilizará como **`ensayo`** -- The table describing the samples will be used as the **sample - metadata** slot +- La tabla que describe las muestras se utilizará como espacio de metadatos \*\*muestra + \*\* -- The table describing the genes will be used as the **features - metadata** slot +- La tabla que describe los genes se utilizará como espacio de **características + metadatos** -To do this we can put the different parts together using the -`SummarizedExperiment` constructor: +Para hacer esto podemos juntar las diferentes partes usando el constructor +`SummarizedExperiment`: ```{r, message=FALSE, warning=FALSE} -## BiocManager::install("SummarizedExperiment") -library("SummarizedExperiment") +## BiocManager::install("Experimento resumido") +biblioteca("Experimento resumido") ``` -First, we make sure that the samples are in the same order in the -count matrix and the sample annotation, and the same for the genes in -the count matrix and the gene annotation. +Primero, nos aseguramos de que las muestras estén en el mismo orden en la matriz de conteo +y la anotación de muestra, y lo mismo para los genes en +la matriz de conteo y la anotación de genes. ```{r} stopifnot(rownames(count_matrix) == gene_metadata$gene) @@ -186,26 +186,26 @@ stopifnot(colnames(count_matrix) == sample_metadata$sample) ``` ```{r} -se <- SummarizedExperiment(assays = list(counts = count_matrix), - colData = sample_metadata, - rowData = gene_metadata) +se <- Experimento resumido (ensayos = lista (recuentos = matriz_conteo), + colData = muestra_metadatos, + filaData = gene_metadata) se ``` -### Saving data +### Guardar datos -Exporting data to a spreadsheet, as we did in a previous episode, has -several limitations, such as those described in the first chapter -(possible inconsistencies with `,` and `.` for decimal separators and -lack of variable type definitions). Furthermore, exporting data to a -spreadsheet is only relevant for rectangular data such as dataframes -and matrices. +Exportar datos a una hoja de cálculo, como hicimos en un episodio anterior, tiene +varias limitaciones, como las descritas en el primer capítulo +(posibles inconsistencias con `,` y `.` para los separadores decimales y +falta de definiciones de tipos de variables). Además, exportar datos a una hoja de cálculo +solo es relevante para datos rectangulares como marcos de datos +y matrices. -A more general way to save data, that is specific to R and is -guaranteed to work on any operating system, is to use the `saveRDS` -function. Saving objects like this will generate a binary -representation on disk (using the `rds` file extension here), which -can be loaded back into R using the `readRDS` function. +Una forma más general de guardar datos, que es específica de R y +garantiza que funciona en cualquier sistema operativo, es utilizar la función `saveRDS` +. Guardar objetos como este generará una representación binaria +en el disco (usando la extensión de archivo `rds` aquí), que +se puede volver a cargar en R usando la función `readRDS`. ```{r, eval=FALSE} saveRDS(se, file = "data_output/se.rds") @@ -214,41 +214,41 @@ se <- readRDS("data_output/se.rds") head(se) ``` -To conclude, when it comes to saving data from R that will be loaded -again in R, saving and loading with `saveRDS` and `readRDS` is the -preferred approach. If tabular data need to be shared with somebody -that is not using R, then exporting to a text-based spreadsheet is a -good alternative. +Para concluir, cuando se trata de guardar datos de R que se cargarán +nuevamente en R, guardar y cargar con `saveRDS` y `readRDS` es el enfoque preferido +. Si es necesario compartir datos tabulares con alguien +que no esté usando R, entonces exportarlos a una hoja de cálculo basada en texto es una +buena alternativa. -Using this data structure, we can access the expression matrix with -the `assay` function: +Usando esta estructura de datos, podemos acceder a la matriz de expresión con +la función `ensayo`: ```{r} -head(assay(se)) -dim(assay(se)) +head(ensayo(se)) +dim(ensayo(se)) ``` -We can access the sample metadata using the `colData` function: +Podemos acceder a los metadatos de muestra usando la función `colData`: ```{r} colData(se) dim(colData(se)) ``` -We can also access the feature metadata using the `rowData` function: +También podemos acceder a los metadatos de la característica usando la función `rowData`: ```{r} -head(rowData(se)) -dim(rowData(se)) +head(filaData(se)) +dim(filaData(se)) ``` -### Subsetting a SummarizedExperiment +### Subconjunto de un experimento resumido -SummarizedExperiment can be subset just like with data frames, with -numerics or with characters of logicals. +SummarizedExperiment se puede subconjunto como con marcos de datos, con +numéricos o con caracteres lógicos. -Below, we create a new instance of class SummarizedExperiment that -contains only the 5 first features for the 3 first samples. +A continuación, creamos una nueva instancia de la clase SummarizedExperiment que +contiene solo las 5 primeras características para las 3 primeras muestras. ```{r} se1 <- se[1:5, 1:3] @@ -257,13 +257,13 @@ se1 ```{r} colData(se1) -rowData(se1) +filaData(se1) ``` -We can also use the `colData()` function to subset on something from -the sample metadata or the `rowData()` to subset on something from the -feature metadata. For example, here we keep only miRNAs and the non -infected samples: +También podemos usar la función `colData()` para crear un subconjunto de algo de +los metadatos de muestra o `rowData()` para crear un subconjunto de algo de los metadatos de característica +. Por ejemplo, aquí conservamos solo los miARN y las muestras no +infectadas: ```{r} se1 <- se[rowData(se)$gene_biotype == "miRNA", @@ -288,16 +288,16 @@ function.--> <!-- ``` --> -::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::: desafío -## Challenge +## Desafío -Extract the gene expression levels of the 3 first genes in samples -at time 0 and at time 8. +Extraiga los niveles de expresión génica de los 3 primeros genes en muestras +en el tiempo 0 y en el tiempo 8. ::::::::::::::: solution -## Solution +## Solución ```{r, purl=FALSE} assay(se)[1:3, colData(se)$time != 4] @@ -312,123 +312,123 @@ assay(se)[1:3, colData(se)$time == 0 | colData(se)$time == 8] ::::::::::::::::::::::::::::::::::::::: challenge -## Challenge +## Desafío -Verify that you get the same values using the long `rna` table. +Verifique que obtenga los mismos valores usando la tabla larga `rna`. ::::::::::::::: solution -## Solution +## Solución ```{r, purl=FALSE} rna |> - filter(gene %in% c("Asl", "Apod", "Cyd2d22")) |> - filter(time != 4) |> select(expression) + filtro(gen %in% c("Asl", "Apod", "Cyd2d22")) |> + filtro(tiempo!= 4) |> seleccionar(expresión ) ``` ::::::::::::::::::::::::: :::::::::::::::::::::::::::::::::::::::::::::::::: -The long table and the `SummarizedExperiment` contain the same -information, but are simply structured differently. Each approach has its -own advantages: the former is a good fit for the `tidyverse` packages, -while the latter is the preferred structure for many bioinformatics and -statistical processing steps. For example, a typical RNA-Seq analyses using -the `DESeq2` package. +La tabla larga y el `Experimento resumido` contienen la misma información +, pero simplemente están estructurados de manera diferente. Cada enfoque tiene sus +propias ventajas: el primero es una buena opción para los paquetes `tidyverse`, +mientras que el segundo es la estructura preferida para muchos pasos de procesamiento bioinformático y +estadístico. Por ejemplo, un RNA-Seq típico analiza usando +el paquete `DESeq2`. -#### Adding variables to metadata +#### Agregar variables a los metadatos -We can also add information to the metadata. -Suppose that you want to add the center where the samples were collected... +También podemos agregar información a los metadatos. +Supongamos que desea agregar el centro donde se recolectaron las muestras... ```{r} -colData(se)$center <- rep("University of Illinois", nrow(colData(se))) +colData(se)$center <- rep("Universidad de Illinois", nrow(colData(se))) colData(se) ``` -This illustrates that the metadata slots can grow indefinitely without -affecting the other structures! +¡Esto ilustra que los espacios de metadatos pueden crecer indefinidamente sin que +afecte a las otras estructuras! -### tidySummarizedExperiment +### ordenadoResumidoExperimento -You may be wondering, can we use tidyverse commands to interact with -`SummarizedExperiment` objects? The answer is yes, we can with the -`tidySummarizedExperiment` package. +Quizás se pregunte: ¿podemos usar los comandos de tidyverse para interactuar con +objetos `SummarizedExperiment`? La respuesta es sí, podemos hacerlo con el paquete +`tidySummarizedExperiment`. -Remember what our SummarizedExperiment object looks like: +Recuerde cómo se ve nuestro objeto SummarizedExperiment: ```{r, message=FALSE} -se +sí ``` -Load `tidySummarizedExperiment` and then take a look at the se object -again. +Cargue `tidySummarizedExperiment` y luego eche un vistazo al objeto se +nuevamente. ```{r, message=FALSE} #BiocManager::install("tidySummarizedExperiment") -library("tidySummarizedExperiment") +biblioteca("tidySummarizedExperiment") se ``` -It's still a `SummarizedExperiment` object, so maintains the efficient -structure, but now we can view it as a tibble. Note the first line of -the output says this, it's a `SummarizedExperiment`\-`tibble` -abstraction. We can also see in the second line of the output the -number of transcripts and samples. +Sigue siendo un objeto `SummarizedExperiment`, por lo que mantiene la estructura eficiente +, pero ahora podemos verlo como un tibble. Tenga en cuenta que la primera línea de +el resultado dice esto, es una abstracción `SummarizedExperiment`\-`tibble` +. También podemos ver en la segunda línea del resultado el +número de transcripciones y muestras. -If we want to revert to the standard `SummarizedExperiment` view, we -can do that. +Si queremos volver a la vista estándar `Experimento resumido`, +podemos hacerlo. ```{r} -options("restore_SummarizedExperiment_show" = TRUE) +opciones ("restore_SummarizedExperiment_show" = VERDADERO) se ``` -But here we use the tibble view. +Pero aquí usamos la vista tibble. ```{r} -options("restore_SummarizedExperiment_show" = FALSE) +opciones("restore_SummarizedExperiment_show" = FALSO) se ``` -We can now use tidyverse commands to interact with the -`SummarizedExperiment` object. +Ahora podemos usar los comandos de tidyverse para interactuar con el objeto +`SummarizedExperiment`. -We can use `filter` to filter for rows using a condition e.g. to view -all rows for one sample. +Podemos usar `filter` para filtrar filas usando una condición, por ejemplo, para ver +todas las filas de una muestra. ```{r} -se %>% filter(.sample == "GSM2545336") +se %>% filtro(.sample == "GSM2545336") ``` -We can use `select` to specify columns we want to view. +Podemos usar `select` para especificar las columnas que queremos ver. ```{r} -se %>% select(.sample) +se %>% seleccionar(.muestra) ``` -We can use `mutate` to add metadata info. +Podemos usar `mutate` para agregar información de metadatos. ```{r} -se %>% mutate(center = "Heidelberg University") +se %>% mutate(centro = "Universidad de Heidelberg") ``` -We can also combine commands with the tidyverse pipe `%>%`. For -example, we could combine `group_by` and `summarise` to get the total -counts for each sample. +También podemos combinar comandos con la canalización tidyverse `%>%`. Por ejemplo, +, podríamos combinar `group_by` y `summarise` para obtener los recuentos totales de +para cada muestra. ```{r} se %>% group_by(.sample) %>% - summarise(total_counts=sum(counts)) + resumen(total_counts=sum(counts)) ``` -We can treat the tidy SummarizedExperiment object as a normal tibble -for plotting. +Podemos tratar el objeto ordenado SummarizedExperiment como un tibble normal +para trazar. -Here we plot the distribution of counts per sample. +Aquí trazamos la distribución de recuentos por muestra. ```{r tidySE-plot} se %>% @@ -438,27 +438,25 @@ se %>% theme_bw() ``` -For more information on tidySummarizedExperiment, see the package -website -[here](https://stemangiola.github.io/tidySummarizedExperiment/). +Para obtener más información sobre tidySummarizedExperiment, consulte el sitio web del paquete[aquí](https://stemangiola.github.io/tidySummarizedExperiment/). -**Take-home message** +**Llevar el mensaje a casa** -- `SummarizedExperiment` represents an efficient way to store and - handle omics data. +- `SummarizedExperiment` representa una forma eficiente de almacenar y + manejar datos ómicos. -- They are used in many Bioconductor packages. +- Se utilizan en muchos paquetes de Bioconductores. -If you follow the next training focused on RNA sequencing analysis, -you will learn to use the Bioconductor `DESeq2` package to do some -differential expression analyses. The whole analysis of the `DESeq2` -package is handled in a `SummarizedExperiment`. +Si sigues la próxima formación centrada en el análisis de secuenciación de ARN, +aprenderás a utilizar el paquete Bioconductor `DESeq2` para realizar algunos +análisis de expresión diferencial. Todo el análisis del paquete `DESeq2` +se maneja en un `SummarizedExperiment`. :::::::::::::::::::::::::::::::::::::::: keypoints -- Bioconductor is a project provide support and packages for the - comprehension of high high-throughput biology data. -- A `SummarizedExperiment` is a type of object useful to store and - manage high-throughput omics data. +- Bioconductor es un proyecto que proporciona soporte y paquetes para la + comprensión de datos biológicos de alto rendimiento. +- Un `Experimento resumido` es un tipo de objeto útil para almacenar y + administrar datos ómicos de alto rendimiento. :::::::::::::::::::::::::::::::::::::::::::::::::: From a973e4ce5f589174c4e0c5eb3db31c27a840880e Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Tue, 14 May 2024 00:30:40 +0900 Subject: [PATCH 175/334] New translations instructor-notes.md (Spanish) --- locale/es/instructors/instructor-notes.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/locale/es/instructors/instructor-notes.md b/locale/es/instructors/instructor-notes.md index a5ec5a2dc..4fe151fb4 100644 --- a/locale/es/instructors/instructor-notes.md +++ b/locale/es/instructors/instructor-notes.md @@ -1,5 +1,5 @@ --- -title: Instructor Notes +title: Notas del instructor --- -FIXME +ARREGLARME From cf48278d2c905bc5c31f3bed49ac52bd74925bf3 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Tue, 14 May 2024 00:30:44 +0900 Subject: [PATCH 176/334] New translations discuss.md (Spanish) --- locale/es/learners/discuss.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/locale/es/learners/discuss.md b/locale/es/learners/discuss.md index 405883d41..0f51580dd 100644 --- a/locale/es/learners/discuss.md +++ b/locale/es/learners/discuss.md @@ -1,5 +1,5 @@ --- -title: Discussion +title: Discusión --- -FIXME +ARREGLARME From 362faf95e201d3811373402809c6e332f01c44c9 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Tue, 14 May 2024 00:30:49 +0900 Subject: [PATCH 177/334] New translations reference.md (Spanish) --- locale/es/learners/reference.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/locale/es/learners/reference.md b/locale/es/learners/reference.md index 91bab9733..69f2a4021 100644 --- a/locale/es/learners/reference.md +++ b/locale/es/learners/reference.md @@ -2,6 +2,6 @@ {} --- -## Glossary +## Glosario -FIXME +ARREGLARME From a066e5b0ba9d7065e57da51e5f7b62cac2623abe Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Tue, 14 May 2024 00:30:53 +0900 Subject: [PATCH 178/334] New translations setup.md (Spanish) --- locale/es/learners/setup.md | 194 ++++++++++++++++++------------------ 1 file changed, 97 insertions(+), 97 deletions(-) diff --git a/locale/es/learners/setup.md b/locale/es/learners/setup.md index 2c33990f5..59eb9acdb 100644 --- a/locale/es/learners/setup.md +++ b/locale/es/learners/setup.md @@ -1,158 +1,158 @@ --- -title: Setup +title: Configuración --- -- Please make sure you have a spreadsheet editor at hand, such as - LibreOffice, Microsoft Excel or Google Sheets. +- Asegúrese de tener a mano un editor de hojas de cálculo, como + LibreOffice, Microsoft Excel o Google Sheets. -- Install R, RStudio and packages (see below). +- Instale R, RStudio y paquetes (ver más abajo). -### R and RStudio +### R y RStudio -- R and RStudio are separate downloads and installations. R is the - underlying statistical computing environment, but using R alone is - no fun. RStudio is a graphical integrated development environment - (IDE) that makes using R much easier and more interactive. You need - to install R before you install RStudio. After installing both - programs, you will need to install some specific R packages within - RStudio. Follow the instructions below for your operating system, - and then follow the instructions to install packages. +- R y RStudio son descargas e instalaciones independientes. R es el + entorno informático estadístico subyacente, pero usar R solo + no es divertido. RStudio es un entorno de desarrollo gráfico integrado + (IDE) que hace que el uso de R sea mucho más fácil e interactivo. Necesita + para instalar R antes de instalar RStudio. Después de instalar ambos programas + , necesitarás instalar algunos paquetes R específicos dentro de + RStudio. Siga las instrucciones a continuación para su sistema operativo, + y luego siga las instrucciones para instalar paquetes. -### You are running Windows +### Estas ejecutando Windows <br> -::::::::::::::: solution +::::::::::::::: solución -## If you already have R and RStudio installed +## Si ya tienes R y RStudio instalados -- Open RStudio, and click on "Help" > "Check for updates". If a new version is - available, quit RStudio, and download the latest version for RStudio. +- Abra RStudio y haga clic en "Ayuda" > "Buscar actualizaciones". Si hay una nueva versión + disponible, salga de RStudio y descargue la última versión de RStudio. -- To check which version of R you are using, start RStudio and the first thing - that appears in the console indicates the version of R you are - running. Alternatively, you can type `sessionInfo()`, which will also display - which version of R you are running. Go on - the [CRAN website](https://cran.r-project.org/bin/windows/base/) and check - whether a more recent version is available. If so, please download and install - it. You can [check here](https://cran.r-project.org/bin/windows/base/rw-FAQ.html#How-do-I-UNinstall-R_003f) for - more information on how to remove old versions from your system if you wish to do so. +- Para verificar qué versión de R estás usando, inicia RStudio y lo primero + que aparece en la consola indica la versión de R que estás + ejecutando. Alternativamente, puede escribir `sessionInfo()`, que también mostrará + qué versión de R está ejecutando. Vaya + al [sitio web de CRAN](https://cran.r-project.org/bin/windows/base/) y verifique + si hay una versión más reciente disponible. Si es así, descárguelo e instálelo + . Puede [consulte aquí](https://cran.r-project.org/bin/windows/base/rw-FAQ.html#How-do-I-UNinstall-R_003f) para obtener + más información sobre cómo eliminar versiones antiguas de su sistema si así lo desea. -- Follow the steps in the instructions [for everyone](#for-everyone) at the - bottom of this page. +- Siga los pasos de las instrucciones [para todos](#para-todos) en la + parte inferior de esta página. -::::::::::::::::::::::::: +:::::::::::::::::::::::::::: -::::::::::::::: solution +::::::::::::::: solución -## If you don't have R and RStudio installed +## Si no tienes R y RStudio instalados -- Download R from - the [CRAN website](https://cran.r-project.org/bin/windows/base/release.htm). +- Descargue R desde + el [sitio web de CRAN](https://cran.r-project.org/bin/windows/base/release.htm). -- Run the `.exe` file that was just downloaded +- Ejecute el archivo `.exe` que acaba de descargar -- Go to the [RStudio download page](https://www.rstudio.com/products/rstudio/download/#download) +- Vaya a la [página de descarga de RStudio](https://www.rstudio.com/products/rstudio/download/#download) -- Under _All Installers_ select **RStudio xxxx.yy.zz-uuu.exe - Windows 10/11** (where x, y, z, and u represent version numbers) +- En _Todos los instaladores_ seleccione **RStudio xxxx.yy.zz-uuu.exe - Windows 10/11** (donde x, y, z y u representan números de versión) -- Double click the file to install it +- Haga doble clic en el archivo para instalarlo. -- Once it's installed, open RStudio to make sure it works and you don't get any - error messages +- Una vez que esté instalado, abra RStudio para asegurarse de que funcione y no reciba ningún mensaje de error + . -- Follow the steps in the instructions [for everyone](#for-everyone) at the - bottom of this page. +- Siga los pasos de las instrucciones [para todos](#para-todos) en la + parte inferior de esta página. -::::::::::::::::::::::::: +:::::::::::::::::::::::::::: -### You are running macOS +### Estás ejecutando macOS <br> -::::::::::::::: solution +::::::::::::::: solución -## If you already have R and RStudio installed +## Si ya tienes R y RStudio instalados -- Open RStudio, and click on "Help" > "Check for updates". If a new version is - available, quit RStudio, and download the latest version for RStudio. +- Abra RStudio y haga clic en "Ayuda" > "Buscar actualizaciones". Si hay una nueva versión + disponible, salga de RStudio y descargue la última versión de RStudio. -- To check the version of R you are using, start RStudio and the first thing - that appears on the terminal indicates the version of R you are running. Alternatively, you can type `sessionInfo()`, which will - also display which version of R you are running. Go on - the [CRAN website](https://cran.r-project.org/bin/macosx/) and check - whether a more recent version is available. If so, please download and install - it. +- Para comprobar la versión de R que estás utilizando, inicia RStudio y lo primero + que aparece en el terminal indica la versión de R que estás ejecutando. Alternativamente, puede escribir `sessionInfo()`, que + también mostrará qué versión de R está ejecutando. Vaya + al [sitio web de CRAN](https://cran.r-project.org/bin/macosx/) y verifique + si hay una versión más reciente disponible. Si es así, descárguelo e instálelo + . -- Follow the steps in the instructions [for everyone](#for-everyone) at the - bottom of this page. +- Siga los pasos de las instrucciones [para todos](#para-todos) en la + parte inferior de esta página. -::::::::::::::::::::::::: +:::::::::::::::::::::::::::: -::::::::::::::: solution +::::::::::::::: solución -## If you don't have R and RStudio installed +## Si no tienes R y RStudio instalados -- Download R from - the [CRAN website](https://cran.r-project.org/bin/macosx/). +- Descargue R desde + el [sitio web de CRAN](https://cran.r-project.org/bin/macosx/). -- Select the `.pkg` file for the latest R version +- Seleccione el archivo `.pkg` para la última versión de R -- Double click on the downloaded file to install R +- Haga doble clic en el archivo descargado para instalar R -- It is also a good idea to install [XQuartz](https://www.xquartz.org/) (needed - by some packages) +- También es una buena idea instalar [XQuartz](https://www.xquartz.org/) (necesario + en algunos paquetes) -- Go to the [RStudio download page](https://www.rstudio.com/products/rstudio/download/#download) +- Vaya a la [página de descarga de RStudio](https://www.rstudio.com/products/rstudio/download/#download) -- Under _All Installers_ select **RStudio xxxx.yy.zz-uuu.dmg - macOS 10.15+** (where x, y, z, and u represent version numbers) +- En _Todos los instaladores_ seleccione **RStudio xxxx.yy.zz-uuu.dmg - macOS 10.15+** (donde x, y, z y u representan números de versión) -- Double click the file to install RStudio +- Haga doble clic en el archivo para instalar RStudio -- Once it's installed, open RStudio to make sure it works and you don't get any - error messages. +- Una vez que esté instalado, abra RStudio para asegurarse de que funcione y no reciba ningún mensaje de error + . -- Follow the steps in the instructions [for everyone](#for-everyone) at the - bottom of this page. +- Siga los pasos de las instrucciones [para todos](#para-todos) en la + parte inferior de esta página. -::::::::::::::::::::::::: +:::::::::::::::::::::::::::: -### You are running Linux +### Estás ejecutando Linux <br> -::::::::::::::: solution +::::::::::::::: solución -## Install R using your package manager and RStudio +## Instale R usando su administrador de paquetes y RStudio -- Follow the instructions for your distribution - from [CRAN](https://cloud.r-project.org/bin/linux), they provide information - to get the most recent version of R for common distributions. For most - distributions, you could use your package manager (e.g., for Debian/Ubuntu run - `sudo apt-get install r-base`, and for Fedora `sudo yum install R`), but we - don't recommend this approach as the versions provided by this are - usually out of date. In any case, make sure you have at least R 4.2.0. -- Go to the RStudio download - page -- Under _All Installers_ select the version that matches your distribution, and - install it with your preferred method (e.g., with Debian/Ubuntu `sudo dpkg -i rstudio-xxxx.yy.zz-uuu-amd64.deb` at the terminal). -- Once it's installed, open RStudio to make sure it works and you don't get any - error messages. -- Follow the steps in the [instructions for everyone](#for-everyone) +- Siga las instrucciones para su distribución + de [CRAN](https://cloud.r-project.org/bin/linux), ellas brindan información + para obtener la versión más reciente de R para distribuciones comunes. Para la mayoría de las distribuciones + , puede usar su administrador de paquetes (por ejemplo, para Debian/Ubuntu ejecute + `sudo apt-get install r-base`, y para Fedora `sudo yum install R`), pero + no recomendamos este enfoque ya que las versiones proporcionadas por este + generalmente están desactualizadas. En cualquier caso, asegúrese de tener al menos R 4.2.0. +- Vaya a la página de descarga de RStudio -::::::::::::::::::::::::: +- En _Todos los instaladores_ seleccione la versión que coincida con su distribución e + instálela con su método preferido (por ejemplo, con Debian/Ubuntu `sudo dpkg -i rstudio-xxxx.yy.zz-uuu-amd64.deb ` en la terminal). +- Una vez que esté instalado, abra RStudio para asegurarse de que funcione y no reciba ningún mensaje de error + . +- Sigue los pasos de las [instrucciones para todos](#para-todos) -### For everyone +:::::::::::::::::::::::::::: -After installing R and RStudio, you need to install a couple of -packages that will be used during the workshop. We will also learn -about package installation during the course to explain the following -commands. For now, simply follow the instructions below: +### Para todo el mundo -- Start RStudio by double-clicking the icon and then type: +Después de instalar R y RStudio, necesita instalar un par de paquetes +que se utilizarán durante el taller. También aprenderemos +sobre la instalación de paquetes durante el curso para explicar los siguientes comandos +. Por ahora, simplemente siga las instrucciones a continuación: + +- Inicie RStudio haciendo doble clic en el icono y luego escriba: ```r install.packages(c("BiocManager", "remotes")) BiocManager::install(c("tidyverse", "SummarizedExperiment", "hexbin", - "patchwork", "gridExtra", "lubridate")) + "patchwork", "gridExtra ", "lubricar")) ``` From 1f3a6d4ca26146b34d5b7d37cefc4bf8b2797086 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Tue, 14 May 2024 00:30:59 +0900 Subject: [PATCH 179/334] New translations learner-profiles.md (Spanish) --- locale/es/profiles/learner-profiles.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/locale/es/profiles/learner-profiles.md b/locale/es/profiles/learner-profiles.md index 75b2c5cad..5ea1d5bc9 100644 --- a/locale/es/profiles/learner-profiles.md +++ b/locale/es/profiles/learner-profiles.md @@ -1,5 +1,5 @@ --- -title: FIXME +title: ARREGLARME --- -This is a placeholder file. Please add content here. +Este es un archivo de marcador de posición. Por favor agregue contenido aquí. From 297df4d6f016d3c67ff40ed4fca24b3a742e7bd2 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Tue, 14 May 2024 00:31:04 +0900 Subject: [PATCH 180/334] New translations code_of_conduct.md (Spanish) --- locale/es/CODE_OF_CONDUCT.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/locale/es/CODE_OF_CONDUCT.md b/locale/es/CODE_OF_CONDUCT.md index a820b8df5..39642f9b9 100644 --- a/locale/es/CODE_OF_CONDUCT.md +++ b/locale/es/CODE_OF_CONDUCT.md @@ -1,12 +1,12 @@ --- -title: Contributor Code of Conduct +title: Código de conducta del colaborador --- -As contributors and maintainers of this project, -we pledge to follow the [The Carpentries Code of Conduct][coc]. +Como contribuyentes y mantenedores de este proyecto, +nos comprometemos a seguir el [Código de conducta de Carpinterías][coc]. -Instances of abusive, harassing, or otherwise unacceptable behavior -may be reported by following our [reporting guidelines][coc-reporting]. +Los casos de comportamiento abusivo, acosador o de otro modo inaceptable +pueden denunciarse siguiendo nuestras [directrices de denuncia][coc-reporting]. [coc-reporting]: https://docs.carpentries.org/topic_folders/policies/incident-reporting.html [coc]: https://docs.carpentries.org/topic_folders/policies/code-of-conduct.html From 56c61ff361769e83362f119701721e5b654f8e4d Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Tue, 14 May 2024 00:31:08 +0900 Subject: [PATCH 181/334] New translations config.yaml (Spanish) --- locale/es/config.yaml | 20 ++++++++++---------- 1 file changed, 10 insertions(+), 10 deletions(-) diff --git a/locale/es/config.yaml b/locale/es/config.yaml index 204cb59c5..ee22ebf6d 100644 --- a/locale/es/config.yaml +++ b/locale/es/config.yaml @@ -7,22 +7,22 @@ #lc: Library Carpentry #cp: Carpentries (to use for instructor training for instance) #incubator: The Carpentries Incubator -carpentry: 'incubator' +carpentry: 'incubadora' #Overall title for pages. -title: 'Introduction to data analysis with R and Bioconductor' +title: 'Introducción al análisis de datos con R y Bioconductor' #Date the lesson was created (YYYY-MM-DD, this is empty by default) created: '2020-09-14' #Comma-separated list of keywords for the lesson -keywords: 'software, data, lesson, The Carpentries' +keywords: 'software, datos, lección, Las Carpinterías' #Life cycle stage of the lesson #possible values: pre-alpha, alpha, beta, stable -life_cycle: 'stable' +life_cycle: 'estable' #License of the lesson license: 'CC-BY 4.0' #Link to the source repository for this lesson source: 'https://github.com/carpentries-incubator/bioc-intro' #Default branch of your lesson -branch: 'main' +branch: 'principal' #Who to contact if there are any issues contact: 'laurent.gatto@uclouvain.be' #Navigation ------------------------------------------------ @@ -42,13 +42,13 @@ contact: 'laurent.gatto@uclouvain.be' #- another-learner.md #Order of episodes in your lesson episodes: - - 10-data-organisation.Rmd + - 10-organización-de-datos.Rmd - 20-r-rstudio.Rmd - - 23-starting-with-r.Rmd - - 25-starting-with-data.Rmd + - 23-empezando-con-r.Rmd + - 25-empezando-con-datos.Rmd - 30-dplyr.Rmd - - 40-visualization.Rmd - - 60-next-steps.Rmd + - 40-visualización.Rmd + - 60-siguientes-pasos.Rmd #Information for Learners learners: #Information for Instructors From 69464eb249164483652f0fcc5074115f3cf3ded6 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Tue, 14 May 2024 00:31:13 +0900 Subject: [PATCH 182/334] New translations contributing.md (Spanish) --- locale/es/CONTRIBUTING.md | 266 +++++++++++++++++++------------------- 1 file changed, 133 insertions(+), 133 deletions(-) diff --git a/locale/es/CONTRIBUTING.md b/locale/es/CONTRIBUTING.md index e5957a520..9fe9e17d8 100644 --- a/locale/es/CONTRIBUTING.md +++ b/locale/es/CONTRIBUTING.md @@ -1,142 +1,142 @@ -# Contributing - -[Software Carpentry][swc-site] and [Data Carpentry][dc-site] are open source projects, -and we welcome contributions of all kinds: -new lessons, -fixes to existing material, -bug reports, -and reviews of proposed changes are all welcome. - -## Contributor Agreement - -By contributing, -you agree that we may redistribute your work under [our license](LICENSE.md). -In exchange, -we will address your issues and/or assess your change proposal as promptly as we can, -and help you become a member of our community. -Everyone involved in [Software Carpentry][swc-site] and [Data Carpentry][dc-site] -agrees to abide by our [code of conduct](CONDUCT.md). - -## How to Contribute - -The easiest way to get started is to file an issue -to tell us about a spelling mistake, -some awkward wording, -or a factual error. -This is a good way to introduce yourself -and to meet some of our community members. - -1. If you do not have a [GitHub][github] account, - you can [send us comments by email][contact]. - However, - we will be able to respond more quickly if you use one of the other methods described below. - -2. If you have a [GitHub][github] account, - or are willing to [create one][github-join], - but do not know how to use Git, - you can report problems or suggest improvements by [creating an issue][issues]. - This allows us to assign the item to someone - and to respond to it in a threaded discussion. - -3. If you are comfortable with Git, - and would like to add or change material, - you can submit a pull request (PR). - Instructions for doing this are [included below](#using-github). - -## Where to Contribute - -1. If you wish to change this lesson, - please work in https://github.com/swcarpentry/shell-novice, - which can be viewed at https://swcarpentry.github.io/shell-novice. - -2. If you wish to change the example lesson, - please work in https://github.com/carpentries/lesson-example, - which documents the format of our lessons - and can be viewed at https://carpentries.github.io/lesson-example. - -3. If you wish to change the template used for workshop websites, - please work in https://github.com/carpentries/workshop-template. - The home page of that repository explains how to set up workshop websites, - while the extra pages in https://carpentries.github.io/workshop-template - provide more background on our design choices. - -4. If you wish to change CSS style files, tools, - or HTML boilerplate for lessons or workshops stored in `_includes` or `_layouts`, - please work in https://github.com/carpentries/styles. - -## What to Contribute +# Contribuyendo + +[Software Carpentry][swc-site] y [Data Carpentry][dc-site] son proyectos de código abierto, +y agradecemos contribuciones de todo tipo: +nuevas lecciones, +Se aceptan correcciones al material existente, +informes de errores, +y revisiones de los cambios propuestos. + +## Acuerdo de colaborador + +Al contribuir, +, aceptas que podemos redistribuir tu trabajo bajo [nuestra licencia](LICENSE.md). +A cambio, +abordaremos sus problemas y/o evaluaremos su propuesta de cambio lo antes posible, +y le ayudaremos a convertirse en miembro de nuestra comunidad. +Todos los involucrados en [Software Carpentry][swc-site] y [Data Carpentry][dc-site] +aceptan cumplir con nuestro [código de conducta](CONDUCT.md). + +## Cómo contribuir + +La forma más fácil de comenzar es presentar un problema +para informarnos sobre un error ortográfico, +alguna redacción incómoda, +o un error factual. +Esta es una buena manera de presentarse +y conocer a algunos de los miembros de nuestra comunidad. + +1. Si no tienes una cuenta de [GitHub][github], + puedes \[enviarnos comentarios por correo electrónico]\[contacto]. + Sin embargo, + podremos responder más rápidamente si utiliza uno de los otros métodos que se describen a continuación. + +2. Si tienes una cuenta [GitHub][github], + o estás dispuesto a [crear una][github-join], + pero no sabes cómo usar Git, + Puede informar problemas o sugerir mejoras \[creando un problema]\[problemas]. + Esto nos permite asignar el elemento a alguien + y responderle en una discusión encadenada. + +3. Si se siente cómodo con Git, + y le gustaría agregar o cambiar material, + , puede enviar una solicitud de extracción (PR). + Las instrucciones para hacer esto se [incluyen a continuación] (#using-github). + +## Dónde contribuir + +1. Si desea cambiar esta lección, + , trabaje en https://github.com/swcarpentry/shell-novice, + , que se puede ver en https://swcarpentry.github.io/shell-novice. + +2. Si desea cambiar la lección de ejemplo, + , trabaje en https://github.com/carpentries/lesson-example, + , que documenta el formato de nuestras lecciones + y se puede ver en https://carpentries.github.io/lesson-example. . + +3. Si desea cambiar la plantilla utilizada para los sitios web de los talleres, + trabaje en https://github.com/carpentries/workshop-template. + La página de inicio de ese repositorio explica cómo configurar sitios web de talleres, + , mientras que las páginas adicionales en https://carpentries.github.io/workshop-template + brindan más antecedentes sobre nuestras opciones de diseño. + +4. Si desea cambiar archivos de estilo CSS, herramientas, + o texto estándar HTML para lecciones o talleres almacenados en `_includes` o `_layouts`, + , trabaje en https://github.com/carpentries/styles. + +## Qué contribuir There are many ways to contribute, from writing new exercises and improving existing ones to updating or filling in the documentation and submitting [bug reports][issues] about things that don't work, aren't clear, or are missing. -If you are looking for ideas, -please see [the list of issues for this repository][issues], -or the issues for [Data Carpentry][dc-issues] -and [Software Carpentry][swc-issues] projects. - -Comments on issues and reviews of pull requests are just as welcome: -we are smarter together than we are on our own. -Reviews from novices and newcomers are particularly valuable: -it's easy for people who have been using these lessons for a while -to forget how impenetrable some of this material can be, -so fresh eyes are always welcome. - -## What _Not_ to Contribute - -Our lessons already contain more material than we can cover in a typical workshop, -so we are usually _not_ looking for more concepts or tools to add to them. -As a rule, -if you want to introduce a new idea, -you must (a) estimate how long it will take to teach -and (b) explain what you would take out to make room for it. -The first encourages contributors to be honest about requirements; -the second, to think hard about priorities. - -We are also not looking for exercises or other material that only run on one platform. -Our workshops typically contain a mixture of Windows, macOS, and Linux users; -in order to be usable, -our lessons must run equally well on all three. - -## Using GitHub - -If you choose to contribute via GitHub, -you may want to look at -[How to Contribute to an Open Source Project on GitHub][how-contribute]. -In brief: - -1. The published copy of the lesson is in the `gh-pages` branch of the repository - (so that GitHub will regenerate it automatically). - Please create all branches from that, - and merge the [master repository][repo]'s `gh-pages` branch into your `gh-pages` branch - before starting work. - Please do _not_ work directly in your `gh-pages` branch, - since that will make it difficult for you to work on other contributions. - -2. We use [GitHub flow][github-flow] to manage changes: - 1. Create a new branch in your desktop copy of this repository for each significant change. - 2. Commit the change in that branch. - 3. Push that branch to your fork of this repository on GitHub. - 4. Submit a pull request from that branch to the [master repository][repo]. - 5. If you receive feedback, - make changes on your desktop and push to your branch on GitHub: - the pull request will update automatically. - -Each lesson has two maintainers who review issues and pull requests -or encourage others to do so. -The maintainers are community volunteers, -and have final say over what gets merged into the lesson. - -## Other Resources - -General discussion of [Software Carpentry][swc-site] and [Data Carpentry][dc-site] -happens on the [discussion mailing list][discuss-list], -which everyone is welcome to join. -You can also [reach us by email][contact]. - -[contact]: mailto:admin@software-carpentry.org -[dc-issues]: https://github.com/issues?q=user%3Adatacarpentry +Si está buscando ideas, +consulte \[la lista de problemas para este repositorio]\[problemas], +o los problemas para [Data Carpentry][dc-issues] +y proyectos de [Carpintería de Software][swc-issues]. + +Los comentarios sobre problemas y revisiones de solicitudes de extracción son igualmente bienvenidos: +somos más inteligentes juntos que solos. +Las reseñas de principiantes y recién llegados son particularmente valiosas: +es fácil para las personas que han estado usando estas lecciones por un tiempo +olvidar lo impenetrable que puede ser parte de este material, +tan fresco Los ojos siempre son bienvenidos. + +## Qué _no_ contribuir + +Nuestras lecciones ya contienen más material del que podemos cubrir en un taller típico, +, por lo que generalmente _no_ buscamos más conceptos o herramientas para agregarles. +Como regla general, +si quieres presentar una idea nueva, +debes (a) estimar cuánto tiempo tomará enseñar +y (b) explicar lo que tomaría para hacerle espacio. +El primero anima a los contribuyentes a ser honestos acerca de los requisitos; +el segundo, pensar mucho en las prioridades. + +Tampoco buscamos ejercicios u otro material que solo se ejecute en una plataforma. +Nuestros talleres suelen contener una combinación de usuarios de Windows, macOS y Linux; +para que sean utilizables, +nuestras lecciones deben funcionar igualmente bien en los tres. + +## Usando GitHub + +Si eliges contribuir a través de GitHub, +es posible que desees consultar +\[Cómo contribuir a un proyecto de código abierto en GitHub]\[cómo contribuir]. +En breve: + +1. La copia publicada de la lección se encuentra en la rama `gh-pages` del repositorio + (para que GitHub la regenere automáticamente). + Cree todas las ramas a partir de eso, + y combine la rama `gh-pages` del [repositorio maestro][repo] con su rama `gh-pages` + antes de comenzar a trabajar. + Por favor _no_ trabaje directamente en su rama `gh-pages`, + ya que eso le dificultará trabajar en otras contribuciones. + +2. Usamos [GitHub flow][github-flow] para gestionar los cambios: + 1. Cree una nueva rama en su copia de escritorio de este repositorio para cada cambio significativo. + 2. Confirme el cambio en esa rama. + 3. Empuje esa rama a su bifurcación de este repositorio en GitHub. + 4. Envíe una solicitud de extracción desde esa rama al \[repositorio maestro]\[repositorio]. + 5. Si recibe comentarios, + realice cambios en su escritorio y envíelos a su sucursal en GitHub: + la solicitud de extracción se actualizará automáticamente. + +Cada lección tiene dos mantenedores que revisan los problemas y generan solicitudes +o alientan a otros a hacerlo. +Los mantenedores son voluntarios de la comunidad, +y tienen la última palabra sobre lo que se integra en la lección. + +## Otros recursos + +La discusión general sobre [Software Carpentry][swc-site] y [Data Carpentry][dc-site] +ocurre en la \[lista de correo de discusión]\[lista de discusión], +, donde todos son bienvenidos unir. +También puede \[comunicarse con nosotros por correo electrónico]\[contactar]. + +[contact]: <correo a:admin@software-carpentry.org> +[dc-issues]: <https://github.com/issues?q=user%3Acarpintería de datos> [dc-lessons]: http://datacarpentry.org/lessons/ [dc-site]: http://datacarpentry.org/ [discuss-list]: http://lists.software-carpentry.org/listinfo/discuss From 244e2cb9c9d5e18fd9c915091836049b2dbcdba0 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Tue, 14 May 2024 00:31:19 +0900 Subject: [PATCH 183/334] New translations license.md (Spanish) --- locale/es/LICENSE.md | 90 ++++++++++++++++++++++---------------------- 1 file changed, 45 insertions(+), 45 deletions(-) diff --git a/locale/es/LICENSE.md b/locale/es/LICENSE.md index 696cc3ae1..ac5c86cfd 100644 --- a/locale/es/LICENSE.md +++ b/locale/es/LICENSE.md @@ -1,55 +1,55 @@ --- -title: Licenses +title: Licencias --- -## Instructional Material +## Material didáctico -All Software Carpentry, Data Carpentry, and Library Carpentry instructional material is -made available under the [Creative Commons Attribution -license][cc-by-human]. The following is a human-readable summary of -(and not a substitute for) the [full legal text of the CC BY 4.0 -license][cc-by-legal]. +Todo el material instructivo de Carpintería de software, Carpintería de datos y Carpintería de biblioteca está +disponible bajo la [licencia Creative Commons Attribution +][cc-by-human]. El siguiente es un resumen legible por humanos de +(y no un sustituto de) la [texto legal completo de la licencia CC BY 4.0 +][cc-by-legal]. -You are free: +Estas libre: -- to **Share**---copy and redistribute the material in any medium or format -- to **Adapt**---remix, transform, and build upon the material +- **Compartir**---copiar y redistribuir el material en cualquier medio o formato +- **Adaptar**---remezclar, transformar y construir sobre el material -for any purpose, even commercially. +para cualquier fin, incluso comercial. -The licensor cannot revoke these freedoms as long as you follow the -license terms. +El licenciante no puede revocar estas libertades siempre y cuando cumpla con los términos de la licencia +. -Under the following terms: +Bajo los siguientes términos: -- **Attribution**---You must give appropriate credit (mentioning that - your work is derived from work that is Copyright © Software - Carpentry and, where practical, linking to - http\://software-carpentry.org/), provide a [link to the - license][cc-by-human], and indicate if changes were made. You may do - so in any reasonable manner, but not in any way that suggests the - licensor endorses you or your use. +- **Atribución**---Debes dar el crédito apropiado (mencionando que + tu trabajo se deriva de un trabajo con Copyright © Software + Carpintería y, cuando sea práctico, vinculando a + http\://software-carpentry.org/), proporcione un [enlace a la licencia + ][cc-by-human] e indique si se realizaron cambios. Puede hacerlo + de cualquier manera razonable, pero no de ninguna manera que sugiera que el + licenciante lo respalda a usted o su uso. -**No additional restrictions**---You may not apply legal terms or -technological measures that legally restrict others from doing -anything the license permits. With the understanding that: +**Sin restricciones adicionales**---No puede aplicar términos legales ni +medidas tecnológicas que restrinjan legalmente a otros hacer +cualquier cosa que la licencia permita. En el entendido de que: -Notices: +Avisos: -- You do not have to comply with the license for elements of the - material in the public domain or where your use is permitted by an - applicable exception or limitation. -- No warranties are given. The license may not give you all of the - permissions necessary for your intended use. For example, other - rights such as publicity, privacy, or moral rights may limit how you - use the material. +- No tiene que cumplir con la licencia para elementos del material + en el dominio público o donde su uso esté permitido por una excepción o limitación + aplicable. +- No se dan garantías. Es posible que la licencia no le otorgue todos los + permisos necesarios para el uso previsto. Por ejemplo, otros + derechos como publicidad, privacidad o derechos morales pueden limitar la forma en que usted + utiliza el material. ## Software -Except where otherwise noted, the example programs and other software -provided by Software Carpentry and Data Carpentry are made available under the -[OSI][osi]-approved -[MIT license][mit-license]. +Excepto que se indique lo contrario, los programas de ejemplo y otro software +proporcionado por Software Carpentry y Data Carpentry están disponibles bajo la +[OSI][osi] aprobada +\[licencia MIT] \[mit-licencia]. Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the @@ -59,21 +59,21 @@ distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: -The above copyright notice and this permission notice shall be -included in all copies or substantial portions of the Software. +El aviso de derechos de autor anterior y este aviso de permiso se incluirán +en todas las copias o partes sustanciales del Software. -THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, -EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF -MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND -NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE +EL SOFTWARE SE PROPORCIONA "TAL CUAL", SIN GARANTÍA DE NINGÚN TIPO, +EXPRESA O IMPLÍCITA, INCLUYENDO PERO NO LIMITADO A LAS GARANTÍAS DE +COMERCIABILIDAD, IDONEIDAD PARA UN PROPÓSITO PARTICULAR Y +NO INFRACCIÓN. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. -## Trademark +## Marca comercial -"Software Carpentry" and "Data Carpentry" and their respective logos -are registered trademarks of [Community Initiatives][ci]. +"Software Carpentry" y "Data Carpentry" y sus respectivos logotipos +son marcas comerciales registradas de [Community Initiatives][ci]. [cc-by-human]: https://creativecommons.org/licenses/by/4.0/ [cc-by-legal]: https://creativecommons.org/licenses/by/4.0/legalcode From 62602d46e51794186246c05ad1b5ab9f234efc47 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Tue, 14 May 2024 00:31:24 +0900 Subject: [PATCH 184/334] New translations readme.md (Spanish) --- locale/es/README.md | 92 ++++++++++++++++++++++----------------------- 1 file changed, 46 insertions(+), 46 deletions(-) diff --git a/locale/es/README.md b/locale/es/README.md index 0ec628ff2..1fd687088 100644 --- a/locale/es/README.md +++ b/locale/es/README.md @@ -1,74 +1,74 @@ -# Introduction to genomic data analysis with R and Bioconductor +# Introducción al análisis de datos genómicos con R y Bioconductor -[![Create a Slack Account with us](https://img.shields.io/badge/Create_Slack_Account-The_Carpentries-071159.svg)](https://swc-slack-invite.herokuapp.com/) +[![Crea una cuenta de Slack con nosotros](https://img.shields.io/badge/Create_Slack_Account-The_Carpentries-071159.svg)](https://swc-slack-invite.herokuapp.com/) -## Contributing +## Contribuyendo -We welcome all contributions to improve the lesson! Maintainers will -do their best to help you if you have any questions, concerns, or -experience any difficulties along the way. +¡Agradecemos todas las contribuciones para mejorar la lección! Los mantenedores +harán todo lo posible para ayudarlo si tiene alguna pregunta, inquietud o +experimenta alguna dificultad en el camino. -We'd like to ask you to familiarize yourself with our Contribution -Guide and have a look at the [more detailed -guidelines][lesson-example] on proper formatting, ways to render the -lesson locally, and even how to write new episodes. +Nos gustaría pedirle que se familiarice con nuestra Guía de contribución + y eche un vistazo a las \[directrices +más detalladas]\[ejemplo de lección] sobre el formato adecuado. , formas de renderizar la lección +localmente e incluso cómo escribir nuevos episodios. -Please see the current list of [issues][FIXME] for ideas for -contributing to this repository. For making your contribution, we use -the GitHub flow, which is nicely explained in the chapter -Contributing to a -Project -in Pro Git by Scott Chacon. +Consulte la lista actual de [problemas][FIXME] para obtener ideas sobre cómo +contribuir a este repositorio. Para hacer tu contribución, utilizamos +el flujo de GitHub, que está muy bien explicado en el capítulo +[Contribuyendo a un proyecto +](http://git-scm.com/ book/en/v2/GitHub-Contributing-to-a-Project) +en Pro Git por Scott Chacon. -Look for the tag -![good\\_first\\_issue](https://img.shields.io/badge/-good%20first%20issue-gold.svg). This -indicates that the maintainers will welcome a pull request fixing this -issue. +Busque la etiqueta +![good\\_first\\_issue](https://img.shields.io/badge/-good%20first%20issue-gold.svg). Este +indica que los mantenedores agradecerán una solicitud de extracción que solucione este problema +. -## Useful links +## Enlaces útiles -- If you're going to be developing lesson material for the first time - according to our design principles, consider reading the - [Carpentries Curriculum Development Handbook][cdh] -- Consult the [Lesson Example][lesson-example] website to find out more about - working with the lesson template +- Si va a desarrollar material didáctico por primera vez + de acuerdo con nuestros principios de diseño, considere leer el + [Manual de desarrollo curricular de carpintería][cdh] +- Consulte el sitio web \[Ejemplo de lección] \[ejemplo de lección] para obtener más información sobre + trabajar con la plantilla de lección -## Lesson team +## equipo de lección -This lesson has been developed and is current maintained by +Esta lección ha sido desarrollada y mantenida actualmente por -- Laurent Gatto (maintainer) -- Charlotte Soneson +- Laurent Gatto (mantenedor) +- Charlotte Sonson - Jenny Drnevich -- Robert Castelo -- Kevin Rue-Albert +- Roberto Castelo +- Kevin Rue Albert -We would also like to acknowledge the contributions of: +También nos gustaría reconocer las contribuciones de: -- Oliver Crook, Sarah Kaspar, Nick Hirschmueller, Lisa Breckels and Maria Doyle for their contributions during the Bioconductor introduction workshop in Heidelberg, as part of EuroBioc2021 |> 2022. -- Axelle Loriot, Marco Chiapelle, Manon Martin and Toby Hodges for various contributions and discussions. -- lmsimp, alorot, manonmartin, mchiapello, stavares843, JennyZadeh, csdaw, ninja-1337, fursham-h, lagerratrobe, fmichonneau, federicomarini, tobyhodges for pull requests. +- Oliver Crook, Sarah Kaspar, Nick Hirschmueller, Lisa Breckels y Maria Doyle por sus contribuciones durante el taller de introducción de bioconductores en Heidelberg, como parte de EuroBioc2021 |> 2022. +- Axelle Loriot, Marco Chiapelle, Manon Martin y Toby Hodges por sus diversas contribuciones y debates. +- lmsimp, alorot, manonmartin, mchiapello, stavares843, JennyZadeh, csdaw, ninja-1337, fursham-h, lagerratrobe, fmichonneau, federicomarini, tobyhodges para solicitudes de extracción. -If we have contributed but we missed you, apologies, and feel free to add yourself with a PR. +Si hemos contribuido pero te extrañamos, disculpas y no dudes en agregarte con un PR. -## Authors +## Autores -A list of contributors to the lesson can be found in [AUTHORS](AUTHORS) +Puede encontrar una lista de contribuyentes a la lección en [AUTORES](AUTORES) -## Citation +## Citación -To cite this lesson, please consult with [CITATION](CITATION) +Para citar esta lección, consulte con [CITACIÓN](CITACIÓN) [lesson-example]: https://carpentries.github.io/lesson-example [cdh]: https://cdh.carpentries.org -## Testing locally +## Pruebas localmente -To test locally, run the following in the lessons directory: +Para realizar la prueba localmente, ejecute lo siguiente en el directorio de lecciones: ```r -sandpaper::serve() +papel de lija::servir() ``` -For more details, see the [workbench installation -instructions](https://carpentries.github.io/workbench/#installation]. +Para obtener más detalles, consulte las [instrucciones de instalación del banco de trabajo +] (https://carpentries.github.io/workbench/#installation). From 173a46f133ee0457fcbeaa05a649f146f393d095 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Tue, 14 May 2024 00:31:29 +0900 Subject: [PATCH 185/334] New translations index.md (Spanish) --- locale/es/index.md | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/locale/es/index.md b/locale/es/index.md index daa0cf39d..eb88faccd 100644 --- a/locale/es/index.md +++ b/locale/es/index.md @@ -1,14 +1,14 @@ --- -permalink: index.html -site: sandpaper::sandpaper_site +permalink: índice.html +site: papel de lija::sitio_papel de lija --- -## About this course +## Acerca de este curso -:::::::::::::::::::::::::::::::::::::::::: prereq +::::::::::::::::::::::::::::::::::::::::::::::: prerrequisito -## Prerequisites +## Requisitos previos -- Familiarity with tabular data and spreadsheets. +- Familiaridad con datos tabulares y hojas de cálculo. -:::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::::::::: From f60b011bca77b77d387318a5dac7122e8d3753c0 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Mon, 10 Jun 2024 09:27:49 +0900 Subject: [PATCH 186/334] New translations 25-starting-with-data.md (Spanish) --- locale/es/episodes/25-starting-with-data.Rmd | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/locale/es/episodes/25-starting-with-data.Rmd b/locale/es/episodes/25-starting-with-data.Rmd index d60804211..bc7da42f4 100644 --- a/locale/es/episodes/25-starting-with-data.Rmd +++ b/locale/es/episodes/25-starting-with-data.Rmd @@ -1,6 +1,6 @@ --- source: Rmd -title: Comenzando con datos +title: Partiendo de datos teaching: 30 exercises: 30 --- @@ -10,8 +10,8 @@ exercises: 30 ::::::::::::::::::::::::::::::::::::::: objetivos -- Describe qué es un "marco.de.datos". -- Cargue datos externos desde un archivo .csv en un marco de datos. +- Describir un objeto de tipo `data.frame`. +- Cargar datos externos desde un archivo .csv a un objecto `data.frame`. - Resumir el contenido de un marco de datos. - Describe qué es un factor. - Convertir entre cadenas y factores. From 60f1f9a194cc1d9b2395ef92899a53ab73667061 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Mon, 8 Jul 2024 18:27:08 +0900 Subject: [PATCH 187/334] New translations 10-data-organisation.md (French) --- locale/fr/episodes/10-data-organisation.Rmd | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/locale/fr/episodes/10-data-organisation.Rmd b/locale/fr/episodes/10-data-organisation.Rmd index d702329b9..2e4b700aa 100644 --- a/locale/fr/episodes/10-data-organisation.Rmd +++ b/locale/fr/episodes/10-data-organisation.Rmd @@ -10,8 +10,8 @@ exercises: 30 ::::::::::::::::::::::::::::::::::::::: objectives -- Learn about spreadsheets, their strengths and weaknesses. -- How do we format data in spreadsheets for effective data use? +- Découvrez les feuilles de calcul, leurs forces et leurs faiblesses. +- Comment formater les données dans des feuilles de calcul pour une utilisation efficace des données ? - Learn about common spreadsheet errors and how to correct them. - Organise your data according to tidy data principles. - Learn about text-based spreadsheet formats such as the comma-separated (CSV) or tab-separated (TSV) formats. From 7826c1255f37c0c1588bb5b52bcaef1426e337f3 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Mon, 8 Jul 2024 19:26:47 +0900 Subject: [PATCH 188/334] New translations 10-data-organisation.md (French) --- locale/fr/episodes/10-data-organisation.Rmd | 30 ++++++++++----------- 1 file changed, 15 insertions(+), 15 deletions(-) diff --git a/locale/fr/episodes/10-data-organisation.Rmd b/locale/fr/episodes/10-data-organisation.Rmd index 2e4b700aa..16a06f574 100644 --- a/locale/fr/episodes/10-data-organisation.Rmd +++ b/locale/fr/episodes/10-data-organisation.Rmd @@ -1,6 +1,6 @@ --- source: Rmd -title: Data organisation with spreadsheets +title: Organisation des données avec des feuilles de calcul teaching: 30 exercises: 30 --- @@ -12,36 +12,36 @@ exercises: 30 - Découvrez les feuilles de calcul, leurs forces et leurs faiblesses. - Comment formater les données dans des feuilles de calcul pour une utilisation efficace des données ? -- Learn about common spreadsheet errors and how to correct them. -- Organise your data according to tidy data principles. -- Learn about text-based spreadsheet formats such as the comma-separated (CSV) or tab-separated (TSV) formats. +- Découvrez les erreurs courantes des feuilles de calcul et comment les corriger. +- Organisez vos données selon des principes de données propres. +- Découvrez les formats de feuilles de calcul textuels tels que les formats séparés par des virgules (CSV) ou par des tabulations (TSV). :::::::::::::::::::::::::::::::::::::::::::::::::: :::::::::::::::::::::::::::::::::::::::: questions -- How to organise tabular data? +- Comment organiser des données tabulaires ? :::::::::::::::::::::::::::::::::::::::::::::::::: -> This episode is based on the Data Carpentries's _Data Analysis and -> Visualisation in R for Ecologists_ lesson. +> Cet épisode est basé sur la leçon _Analyse des données et +> Visualisation dans R pour les écologistes_ de Data Carpentries. -## Spreadsheet programs +## Tableurs **Question** -- What are basic principles for using spreadsheets for good data - organization? +- Quels sont les principes de base d'utilisation des feuilles de calcul pour une bonne organisation des données + ? -**Objective** +**Objectifs** -- Describe best practices for organizing data so computers can make - the best use of datasets. +- Décrire les bonnes pratiques pour organiser les données afin que les ordinateurs puissent en faire + la meilleure utilisation. -**Keypoint** +**Point clé** -- Good data organization is the foundation of any research project. +- Une bonne organisation des données est la base de tout projet de recherche. Good data organization is the foundation of your research project. Most researchers have data or do data entry in From ff04d65a4eeaa19808a8d7538f1d11cb67f852a1 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Mon, 8 Jul 2024 22:32:20 +0900 Subject: [PATCH 189/334] New translations setup.md (French) --- locale/fr/learners/setup.md | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/locale/fr/learners/setup.md b/locale/fr/learners/setup.md index 2c33990f5..e21e15941 100644 --- a/locale/fr/learners/setup.md +++ b/locale/fr/learners/setup.md @@ -2,12 +2,11 @@ title: Setup --- -- Please make sure you have a spreadsheet editor at hand, such as - LibreOffice, Microsoft Excel or Google Sheets. +- Veuillez vous assurer d'avoir accès à un tableur, tel que LibreOffice, Microsoft Excel ou Google Sheets. -- Install R, RStudio and packages (see below). +- Installez R, RStudio et les packages (voir ci-dessous). -### R and RStudio +### R et RStudio - R and RStudio are separate downloads and installations. R is the underlying statistical computing environment, but using R alone is From f7c72091518643b3090daabe78d27d1c80394523 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Wed, 17 Jul 2024 22:11:03 +0900 Subject: [PATCH 190/334] New translations readme.md (French) --- locale/fr/README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/locale/fr/README.md b/locale/fr/README.md index 0ec628ff2..2c31e16fd 100644 --- a/locale/fr/README.md +++ b/locale/fr/README.md @@ -1,6 +1,6 @@ -# Introduction to genomic data analysis with R and Bioconductor +# Introduction à l'analyse de données génomiques avec le programme R et Bioconductor -[![Create a Slack Account with us](https://img.shields.io/badge/Create_Slack_Account-The_Carpentries-071159.svg)](https://swc-slack-invite.herokuapp.com/) +Contribution ## Contributing From c040d58affaf6bef5016944b2f88009319d2be18 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Wed, 17 Jul 2024 23:16:55 +0900 Subject: [PATCH 191/334] New translations readme.md (French) --- locale/fr/README.md | 21 +++++---------------- 1 file changed, 5 insertions(+), 16 deletions(-) diff --git a/locale/fr/README.md b/locale/fr/README.md index 2c31e16fd..c0323ec0a 100644 --- a/locale/fr/README.md +++ b/locale/fr/README.md @@ -1,27 +1,16 @@ -# Introduction à l'analyse de données génomiques avec le programme R et Bioconductor +# Introduction à l'analyse des données génomiques avec R et Bioconductor Contribution ## Contributing -We welcome all contributions to improve the lesson! Maintainers will -do their best to help you if you have any questions, concerns, or -experience any difficulties along the way. +Toutes les contributions visant à améliorer la leçon sont les bienvenues! Les responsables de la maintenance feront de leur mieux pour vous aider si vous avez des questions, des préoccupations ou si vous rencontrez des difficultés en cours de route. -We'd like to ask you to familiarize yourself with our Contribution -Guide and have a look at the [more detailed -guidelines][lesson-example] on proper formatting, ways to render the -lesson locally, and even how to write new episodes. +Nous vous invitons à vous familiariser avec notre guide de contribution et à consulter les directives plus détaillées concernant le formatage, les moyens de rendre la leçon localement, et même la manière d'écrire de nouveaux épisodes. -Please see the current list of [issues][FIXME] for ideas for -contributing to this repository. For making your contribution, we use -the GitHub flow, which is nicely explained in the chapter -Contributing to a -Project -in Pro Git by Scott Chacon. +Veuillez consulter la liste actuelle des [issues][FIXME] pour les idées de contribution à ce dépôt. Pour apporter votre contribution, nous utilisons le flux. Pour apporter votre contribution, utilisez GitHub, qui est bien expliqué dans le chapitre [Contribuer à un projet dans Pro Git par Scott Chacon.]. -Look for the tag -![good\\_first\\_issue](https://img.shields.io/badge/-good%20first%20issue-gold.svg). This +Recherchez l'étiquette "bonne première question/problème". Cela indique que la responsable accueillera favorablement une demande de répondre a la question ou problème. This indicates that the maintainers will welcome a pull request fixing this issue. From c8aeba4fa05cf2c4bc8b891bf010599727fded8e Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Tue, 23 Jul 2024 23:09:53 +0900 Subject: [PATCH 192/334] New translations readme.md (French) --- locale/fr/README.md | 9 +++------ 1 file changed, 3 insertions(+), 6 deletions(-) diff --git a/locale/fr/README.md b/locale/fr/README.md index c0323ec0a..e150cc879 100644 --- a/locale/fr/README.md +++ b/locale/fr/README.md @@ -14,13 +14,10 @@ Recherchez l'étiquette "bonne première question/problème". Cela indique que l indicates that the maintainers will welcome a pull request fixing this issue. -## Useful links +## Liens utiles -- If you're going to be developing lesson material for the first time - according to our design principles, consider reading the - [Carpentries Curriculum Development Handbook][cdh] -- Consult the [Lesson Example][lesson-example] website to find out more about - working with the lesson template +- Si vous développez pour la première fois du matériel pédagogique selon nos principes de conception, nous vous invitons à lire le Manuel de développement des programmes d'études de Carpentries. +- Consultez le site web de l'exemple de leçon pour en savoir plus sur l'utilisation du modèle de leçon. ## Lesson team From d1a9b2902b0363639298e05bad494315e07712e5 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Thu, 1 Aug 2024 00:07:12 +0900 Subject: [PATCH 193/334] New translations 10-data-organisation.md (French) --- locale/fr/episodes/10-data-organisation.Rmd | 1132 +++++++++---------- 1 file changed, 566 insertions(+), 566 deletions(-) diff --git a/locale/fr/episodes/10-data-organisation.Rmd b/locale/fr/episodes/10-data-organisation.Rmd index 16a06f574..0ba2fbaf4 100644 --- a/locale/fr/episodes/10-data-organisation.Rmd +++ b/locale/fr/episodes/10-data-organisation.Rmd @@ -43,261 +43,261 @@ exercises: 30 - Une bonne organisation des données est la base de tout projet de recherche. -Good data organization is the foundation of your research -project. Most researchers have data or do data entry in -spreadsheets. Spreadsheet programs are very useful graphical -interfaces for designing data tables and handling very basic data -quality control functions. See also @Broman:2018. +Une bonne organisation des données est la base de votre projet de recherche +. La plupart des chercheurs disposent de données ou effectuent la saisie de données dans des feuilles de calcul +. Les tableurs sont des interfaces graphiques +très utiles pour concevoir des tableaux de données et gérer des fonctions de contrôle qualité +de données très basiques. Voir aussi @Broman : 2018. -### Spreadsheet outline +### Aperçu de la feuille de calcul -Spreadsheets are good for data entry. Therefore we have a lot of data -in spreadsheets. Much of your time as a researcher will be spent in -this 'data wrangling' stage. It's not the most fun, but it's -necessary. We'll teach you how to think about data organization and -some practices for more effective data wrangling. +Les feuilles de calcul sont utiles pour la saisie de données. Nous avons donc beaucoup de données +dans des feuilles de calcul. Une grande partie de votre temps en tant que chercheur sera consacrée à +cette étape de « gestion des données ». Ce n'est pas le plus amusant, mais c'est +nécessaire. Nous vous apprendrons comment réfléchir à l'organisation des données et +quelques pratiques pour une gestion plus efficace des données. -### What this lesson will not teach you +### Ce que cette leçon ne vous apprendra pas -- How to do _statistics_ in a spreadsheet -- How to do _plotting_ in a spreadsheet -- How to _write code_ in spreadsheet programs +- Comment faire des _statistiques_ dans une feuille de calcul +- Comment faire un _traçage_ dans une feuille de calcul +- Comment _écrire du code_ dans des tableurs -If you're looking to do this, a good reference is Head First +Si vous cherchez à faire cela, une bonne référence est Head First Excel, -published by O'Reilly. +publié par O'Reilly. -### Why aren't we teaching data analysis in spreadsheets +### Pourquoi n'enseignons-nous pas l'analyse des données dans des feuilles de calcul -- Data analysis in spreadsheets usually requires a lot of manual - work. If you want to change a parameter or run an analysis with a - new dataset, you usually have to redo everything by hand. (We do - know that you can create macros, but see the next point.) +- L'analyse des données dans des feuilles de calcul nécessite généralement beaucoup de + travail manuel. Si vous souhaitez modifier un paramètre ou exécuter une analyse avec un nouvel ensemble de données + , vous devez généralement tout refaire à la main. (Nous + savons que vous pouvez créer des macros, mais voyez le point suivant.) -- It is also difficult to track or reproduce statistical or plotting - analyses done in spreadsheet programs when you want to go back to - your work or someone asks for details of your analysis. +- Il est également difficile de suivre ou de reproduire des analyses statistiques ou graphiques + effectuées dans des tableurs lorsque vous souhaitez revenir à + votre travail ou que quelqu'un vous demande des détails sur votre analyse. -Many spreadsheet programs are available. Since most participants -utilise Excel as their primary spreadsheet program, this lesson will -make use of Excel examples. A free spreadsheet program that can also -be used is LibreOffice. Commands may differ a bit between programs, -but the general idea is the same. +De nombreux tableurs sont disponibles. Étant donné que la plupart des participants +utilisent Excel comme tableur principal, cette leçon +utilisera des exemples Excel. Un tableur gratuit qui peut également +être utilisé est LibreOffice. Les commandes peuvent différer un peu selon les programmes, +mais l'idée générale est la même. -Spreadsheet programs encompass a lot of the things we need to be able -to do as researchers. We can use them for: +Les tableurs englobent de nombreuses choses que nous devons pouvoir +faire en tant que chercheurs. Nous pouvons les utiliser pour : -- Data entry -- Organizing data -- Subsetting and sorting data -- Statistics -- Plotting +- La saisie des données +- Organisation des données +- Sous-ensemble et tri des données +- Statistiques +- Traçage -Spreadsheet programs use tables to represent and display data. Data -formatted as tables is also the main theme of this chapter, and we -will see how to organise data into tables in a standardised way to -ensure efficient downstream analysis. +Les tableurs utilisent des tableaux pour représenter et afficher les données. Les données +formatées sous forme de tableaux sont également le thème principal de ce chapitre, et nous +verrons comment organiser les données en tableaux de manière standardisée pour +assurer une analyse efficace en aval. ::::::::::::::::::::::::::::::::::::::: challenge -## Challenge: Discuss the following points with your neighbour +## Défi : Discutez des points suivants avec votre voisin -- Have you used spreadsheets, in your research, courses, - or at home? -- What kind of operations do you do in spreadsheets? -- Which ones do you think spreadsheets are good for? -- Have you accidentally done something in a spreadsheet program that made you - frustrated or sad? +- Avez-vous utilisé des feuilles de calcul, dans vos recherches, vos cours, + ou à la maison ? +- Quel type d’opérations effectuez-vous dans des feuilles de calcul ? +- Selon vous, pour lesquels les feuilles de calcul sont-elles utiles ? +- Avez-vous accidentellement fait quelque chose dans un tableur qui vous a rendu + frustré ou triste ? :::::::::::::::::::::::::::::::::::::::::::::::::: -### Problems with spreadsheets +### Problèmes avec les feuilles de calcul -Spreadsheets are good for data entry, but in reality we tend to -use spreadsheet programs for much more than data entry. We use them -to create data tables for publications, to generate summary -statistics, and make figures. +Les feuilles de calcul sont utiles pour la saisie de données, mais en réalité, nous avons tendance à +utiliser des tableurs pour bien plus que la saisie de données. Nous les utilisons +pour créer des tableaux de données pour les publications, pour générer des statistiques récapitulatives +et réaliser des chiffres. Generating tables for publications in a spreadsheet is not optimal - often, when formatting a data table for publication, we're reporting key summary statistics in a way that is not really meant to be read as data, and often involves special formatting -(merging cells, creating borders, making it pretty). We advise you to -do this sort of operation within your document editing software. +(merging cells, creating borders, making it pretty). Nous vous conseillons de +effectuer ce genre d'opération au sein de votre logiciel d'édition de documents. -The latter two applications, generating statistics and figures, should -be used with caution: because of the graphical, drag and drop nature of -spreadsheet programs, it can be very difficult, if not impossible, to -replicate your steps (much less retrace anyone else's), particularly if your -stats or figures require you to do more complex calculations. Furthermore, -in doing calculations in a spreadsheet, it's easy to accidentally apply a -slightly different formula to multiple adjacent cells. When using a -command-line based statistics program like R or SAS, it's practically -impossible to apply a calculation to one observation in your -dataset but not another unless you're doing it on purpose. +Ces deux dernières applications, génératrices de statistiques et de chiffres, doivent +être utilisées avec précaution : en raison de la nature graphique, par glisser-déposer des +tableurs, il peut être très difficile, voire impossible, de +reproduisez vos pas (et encore moins retracer ceux de quelqu'un d'autre), en particulier si vos +statistiques ou chiffres nécessitent que vous fassiez des calculs plus complexes. De plus, +en effectuant des calculs dans une feuille de calcul, il est facile d'appliquer accidentellement une +formule légèrement différente à plusieurs cellules adjacentes. Lorsque vous utilisez un programme de statistiques basé sur une ligne de commande +comme R ou SAS, il est pratiquement +impossible d'appliquer un calcul à une observation de votre ensemble de données +mais pas à une autre, sauf si vous le faites cela exprès. -### Using spreadsheets for data entry and cleaning +### Utiliser des feuilles de calcul pour la saisie et le nettoyage des données -In this lesson, we will assume that you are most likely using Excel as -your primary spreadsheet program - there are others (gnumeric, Calc -from OpenOffice), and their functionality is similar, but Excel seems -to be the program most used by biologists and biomedical researchers. +Dans cette leçon, nous supposerons que vous utilisez très probablement Excel comme +votre tableur principal - il en existe d'autres (gnumeric, Calc +d'OpenOffice), et leurs fonctionnalités sont similaires, mais Excel semble +est le programme le plus utilisé par les biologistes et les chercheurs biomédicaux. -In this lesson we're going to talk about: +Dans cette leçon, nous allons parler de : -1. Formatting data tables in spreadsheets -2. Formatting problems -3. Exporting data +1. Formatage des tableaux de données dans des feuilles de calcul +2. Problèmes de formatage +3. Exporter des données -## Formatting data tables in spreadsheets +## Formatage des tableaux de données dans des feuilles de calcul -**Questions** +**Des questions** -- How do we format data in spreadsheets for effective data use? +- Comment formater les données dans des feuilles de calcul pour une utilisation efficace des données ? -**Objectives** +**Objectifs** -- Describe best practices for data entry and formatting in - spreadsheets. +- Décrire les meilleures pratiques pour la saisie et le formatage des données dans les feuilles de calcul + . -- Apply best practices to arrange variables and observations in a - spreadsheet. +- Appliquez les meilleures pratiques pour organiser les variables et les observations dans une feuille de calcul + . -**Keypoints** +**Points clés** -- Never modify your raw data. Always make a copy before making any - changes. +- Ne modifiez jamais vos données brutes. Faites toujours une copie avant d'apporter des + modifications. -- Keep track of all of the steps you take to clean your data in a - plain text file. +- Gardez une trace de toutes les étapes que vous suivez pour nettoyer vos données dans un fichier texte brut + . -- Organise your data according to tidy data principles. +- Organisez vos données selon des principes de données ordonnés. -The most common mistake made is treating spreadsheet programs like lab -notebooks, that is, relying on context, notes in the margin, spatial -layout of data and fields to convey information. As humans, we can -(usually) interpret these things, but computers don't view information -the same way, and unless we explain to the computer what every single -thing means (and that can be hard!), it will not be able to see how -our data fits together. +L'erreur la plus courante est de traiter les tableurs comme des cahiers de laboratoire +, c'est-à-dire de s'appuyer sur le contexte, les notes dans la marge, la disposition spatiale +des données et des champs pour transmettre des informations. En tant qu'humains, nous pouvons +(généralement) interpréter ces choses, mais les ordinateurs ne voient pas les informations +de la même manière, et à moins que nous expliquions à l'ordinateur ce que chaque +signifie (et ça peut être dur !), il ne pourra pas voir comment +nos données s'emboîtent. -Using the power of computers, we can manage and analyse data in much -more effective and faster ways, but to use that power, we have to set -up our data for the computer to be able to understand it (and -computers are very literal). +En utilisant la puissance des ordinateurs, nous pouvons gérer et analyser les données de manière beaucoup +plus efficace et plus rapide, mais pour utiliser cette puissance, nous devons +configurer nos données pour que l'ordinateur puisse comprenez-le (et +les ordinateurs sont très littéraux). -This is why it's extremely important to set up well-formatted tables -from the outset - before you even start entering data from your very -first preliminary experiment. Data organization is the foundation of -your research project. It can make it easier or harder to work with -your data throughout your analysis, so it's worth thinking about when -you're doing your data entry or setting up your experiment. You can -set things up in different ways in spreadsheets, but some of these -choices can limit your ability to work with the data in other programs -or have the you-of-6-months-from-now or your collaborator work with -the data. +C'est pourquoi il est extrêmement important de mettre en place des tableaux +bien formatés dès le départ - avant même de commencer à saisir les données de votre toute première expérience préliminaire +. L’organisation des données est le fondement de +votre projet de recherche. Cela peut rendre plus facile ou plus difficile le travail avec +vos données tout au long de votre analyse, il vaut donc la peine de réfléchir au moment +où vous effectuez votre saisie de données ou configurez votre expérience. Vous pouvez +configurer les choses de différentes manières dans des feuilles de calcul, mais certains de ces +choix peuvent limiter votre capacité à travailler avec les données dans d'autres programmes +ou vous empêcher de- Dans 6 mois ou votre collaborateur travaillera avec +les données. -**Note:** the best layouts/formats (as well as software and -interfaces) for data entry and data analysis might be different. It is -important to take this into account, and ideally automate the -conversion from one to another. +**Remarque :** les meilleures mises en page/formats (ainsi que les logiciels et les interfaces +) pour la saisie et l'analyse des données peuvent être différentes. Il est +important d’en tenir compte, et idéalement d’automatiser la conversion +de l’un à l’autre. -### Keeping track of your analyses +### Garder une trace de vos analyses -When you're working with spreadsheets, during data clean up or -analyses, it's very easy to end up with a spreadsheet that looks very -different from the one you started with. In order to be able to -reproduce your analyses or figure out what you did when a reviewer or -instructor asks for a different analysis, you should +Lorsque vous travaillez avec des feuilles de calcul, lors d'un nettoyage de données ou d'analyses +, il est très facile de vous retrouver avec une feuille de calcul qui semble très +différente de celle avec laquelle vous avez commencé. Afin de pouvoir +reproduire vos analyses ou comprendre ce que vous avez fait lorsqu'un évaluateur ou un +instructeur demande une analyse différente, vous devez -- create a new file with your cleaned or analysed data. Don't modify - the original dataset, or you will never know where you started! +- créez un nouveau fichier avec vos données nettoyées ou analysées. Ne modifiez pas + l'ensemble de données d'origine, sinon vous ne saurez jamais par où vous avez commencé ! -- keep track of the steps you took in your clean up or analysis. You - should track these steps as you would any step in an experiment. We - recommend that you do this in a plain text file stored in the same - folder as the data file. +- gardez une trace des étapes que vous avez suivies lors de votre nettoyage ou de votre analyse. Vous + devez suivre ces étapes comme vous le feriez pour n’importe quelle étape d’une expérience. Nous + vous recommandons de le faire dans un fichier texte brut stocké dans le même dossier + que le fichier de données. -This might be an example of a spreadsheet setup: +Ceci pourrait être un exemple de configuration de feuille de calcul : ![](fig/spreadsheet-setup-updated.png) -Put these principles in to practice today during your exercises. +Mettez ces principes en pratique aujourd’hui lors de vos exercices. -While versioning is out of scope for this course, you can look at the -Carpentries lesson on -['Git'](https://swcarpentry.github.io/git-novice/) to learn how to -maintain **version control** over your data. See also this blog -post for a quick tutorial or -@Perez-Riverol:2016 for a more research-oriented use-case. +Bien que la gestion des versions soit hors de portée de ce cours, vous pouvez consulter la leçon +Menuiseries sur +['Git'](https://swcarpentry.github.io/git-novice/) pour découvrez comment +maintenir le **contrôle de version** sur vos données. Voir aussi ce blog +post pour un tutoriel rapide ou +@Perez-Riverol:2016 pour une approche plus orientée recherche cas d'utilisation. -### Structuring data in spreadsheets +### Structuration des données dans des feuilles de calcul -The cardinal rules of using spreadsheet programs for data: +Les règles cardinales de l’utilisation des tableurs pour les données : -1. Put all your variables in columns - the thing you're measuring, - like 'weight' or 'temperature'. -2. Put each observation in its own row. -3. Don't combine multiple pieces of information in one cell. Sometimes - it just seems like one thing, but think if that's the only way - you'll want to be able to use or sort that data. -4. Leave the raw data raw - don't change it! -5. Export the cleaned data to a text-based format like CSV - (comma-separated values) format. This ensures that anyone can use - the data, and is required by most data repositories. +1. Mettez toutes vos variables dans des colonnes - la chose que vous mesurez, + comme « poids » ou « température ». +2. Placez chaque observation dans sa propre rangée. +3. Ne combinez pas plusieurs informations dans une seule cellule. Parfois + cela semble être une chose, mais pensez que si c'est la seule façon + vous voudrez pouvoir utiliser ou trier ces données. +4. Laissez les données brutes brutes – ne les modifiez pas ! +5. Exportez les données nettoyées dans un format texte tel que le format CSV + (valeurs séparées par des virgules). Cela garantit que n’importe qui peut utiliser + les données et est requis par la plupart des référentiels de données. -For instance, we have data from patients that visited several -hospitals in Brussels, Belgium. They recorded the date of the visit, -the hospital, the patients' gender, weight and blood group. +Par exemple, nous disposons de données provenant de patients ayant visité plusieurs +hôpitaux à Bruxelles, en Belgique. Ils ont enregistré la date de la visite, +l'hôpital, le sexe, le poids et le groupe sanguin des patients. -If we were to keep track of the data like this: +Si nous devions garder une trace des données comme ceci : ![](fig/multiple-info.png) -the problem is that the ABO and Rhesus groups are in the same `Blood` -type column. So, if they wanted to look at all observations of the A -group or look at weight distributions by ABO group, it would be tricky -to do this using this data setup. If instead we put the ABO and Rhesus -groups in different columns, you can see that it would be much easier. +le problème est que les groupes ABO et Rhésus sont dans la même colonne de type `Blood` +. Donc, s'ils voulaient examiner toutes les observations du groupe A +ou examiner les distributions de poids par groupe ABO, il serait difficile +de le faire en utilisant cette configuration de données. Si à la place nous mettions les groupes ABO et Rhésus +dans des colonnes différentes, vous voyez que ce serait beaucoup plus facile. ![](fig/single-info.png) -An important rule when setting up a datasheet, is that **columns are -used for variables** and **rows are used for observations**: +Une règle importante lors de la création d'une feuille de données est que les **colonnes sont +utilisées pour les variables** et les **lignes sont utilisées pour les observations** : -- columns are variables -- rows are observations -- cells are individual values +- les colonnes sont des variables +- les lignes sont des observations +- les cellules sont des valeurs individuelles ::::::::::::::::::::::::::::::::::::::: challenge -## Challenge: We're going to take a messy dataset and describe how we would clean it up. +## Défi : Nous allons prendre un ensemble de données désordonné et décrire comment nous allons le nettoyer. -1. Download a messy dataset by clicking - [here](https://github.com/UCLouvain-CBIO/WSBIM1207/raw/master/data/messy_covid.xlsx). +1. Téléchargez un ensemble de données désordonné en cliquant sur + [ici](https://github.com/UCLouvain-CBIO/WSBIM1207/raw/master/data/messy_covid.xlsx). -2. Open up the data in a spreadsheet program. +2. Ouvrez les données dans un tableur. -3. You can see that there are two tabs. The data contains various - clinical variables recorded in various hospitals in Brussels during - the first and second COVID-19 waves in 2020. As you can see, the - data have been recorded differently during the March and November - waves. Now you're the person in charge of this project and you want - to be able to start analyzing the data. +3. Vous pouvez voir qu'il y a deux onglets. Les données contiennent diverses + variables cliniques enregistrées dans divers hôpitaux bruxellois lors des + première et deuxième vagues de COVID-19 en 2020. Comme vous pouvez le constater, les données + ont été enregistrées différemment lors des vagues + de mars et novembre. Vous êtes désormais la personne en charge de ce projet et vous souhaitez que + puisse commencer à analyser les données. -4. With the person next to you, identify what is wrong with this - spreadsheet. Also discuss the steps you would need to take to clean - up first and second wave tabs, and to put them all together in one - spreadsheet. +4. Avec la personne à côté de vous, identifiez ce qui ne va pas avec cette feuille de calcul + . Discutez également des étapes que vous devrez suivre pour nettoyer + les onglets de la première et de la deuxième vague, et pour les rassembler tous dans une seule feuille de calcul + . -**Important:** Do not forget our first piece of advice: to create a -new file (or tab) for the cleaned data, never modify your original -(raw) data. +**Important :** N'oubliez pas notre premier conseil : pour créer un +nouveau fichier (ou onglet) pour les données nettoyées, ne modifiez jamais vos données +(brutes) d'origine. :::::::::::::::::::::::::::::::::::::::::::::::::: -After you go through this exercise, we'll discuss as a group what was -wrong with this data and how you would fix it. +Après avoir effectué cet exercice, nous discuterons en groupe de ce qui n'allait pas +avec ces données et de la manière dont vous pourriez y remédier. <!-- - Take about 10 minutes to work on this exercise. --> @@ -319,45 +319,45 @@ wrong with this data and how you would fix it. ::::::::::::::::::::::::::::::::::::::: challenge -## Challenge: Once you have tidied up the data, answer the following questions: +## Défi : Une fois que vous avez rangé les données, répondez aux questions suivantes : -- How many men and women took part in the study? -- How many A, AB, and B types have been tested? -- As above, but disregarding the contaminated samples? -- How many Rhesus + and - have been tested? -- How many universal donors (O-) have been tested? -- What is the average weight of AB men? -- How many samples have been tested in the different hospitals? +- Combien d’hommes et de femmes ont participé à l’étude ? +- Combien de types A, AB et B ont été testés ? +- Comme ci-dessus, mais sans tenir compte des échantillons contaminés ? +- Combien de Rhésus + et - ont été testés ? +- Combien de donneurs universels (O-) ont été testés ? +- Quel est le poids moyen des hommes AB ? +- Combien d’échantillons ont été testés dans les différents hôpitaux ? :::::::::::::::::::::::::::::::::::::::::::::::::: -An **excellent reference**, in particular with regard to R scripting -is the _Tidy Data_ paper @Wickham:2014. +Une **excellente référence**, en particulier en ce qui concerne les scripts R +est l'article _Tidy Data_ @Wickham:2014. -## Common spreadsheet errors +## Erreurs courantes dans les feuilles de calcul -**Questions** +**Des questions** -- What are some common challenges with formatting data in spreadsheets - and how can we avoid them? +- Quels sont les défis courants liés au formatage des données dans les feuilles de calcul + et comment pouvons-nous les éviter ? -**Objectives** +**Objectifs** -- Recognise and resolve common spreadsheet formatting problems. +- Reconnaître et résoudre les problèmes courants de formatage des feuilles de calcul. -**Keypoints** +**Points clés** -- Avoid using multiple tables within one spreadsheet. -- Avoid spreading data across multiple tabs. -- Record zeros as zeros. -- Use an appropriate null value to record missing data. -- Don't use formatting to convey information or to make your spreadsheet look pretty. -- Place comments in a separate column. -- Record units in column headers. -- Include only one piece of information in a cell. -- Avoid spaces, numbers and special characters in column headers. -- Avoid special characters in your data. -- Record metadata in a separate plain text file. +- Évitez d'utiliser plusieurs tableaux dans une même feuille de calcul. +- Évitez de répartir les données sur plusieurs onglets. +- Enregistrez les zéros comme des zéros. +- Utilisez une valeur nulle appropriée pour enregistrer les données manquantes. +- N'utilisez pas de formatage pour transmettre des informations ou pour donner une jolie apparence à votre feuille de calcul. +- Placez les commentaires dans une colonne séparée. +- Enregistrez les unités dans les en-têtes de colonnes. +- Incluez une seule information dans une cellule. +- Évitez les espaces, les chiffres et les caractères spéciaux dans les en-têtes de colonnes. +- Évitez les caractères spéciaux dans vos données. +- Enregistrez les métadonnées dans un fichier texte brut séparé. <!-- This lesson is meant to be used as a reference for discussion as --> @@ -367,376 +367,376 @@ is the _Tidy Data_ paper @Wickham:2014. <!-- refer to responses to the exercise in the previous lesson. --> -There are a few potential errors to be on the lookout for in your own -data as well as data from collaborators or the Internet. If you are -aware of the errors and the possible negative effect on downstream -data analysis and result interpretation, it might motivate yourself -and your project members to try and avoid them. Making small changes -to the way you format your data in spreadsheets, can have a great -impact on efficiency and reliability when it comes to data cleaning -and analysis. - -- [Using multiple tables](#tables) -- [Using multiple tabs](#tabs) -- [Not filling in zeros](#zeros) -- [Using problematic null values](#null) -- [Using formatting to convey information](#formatting) -- [Using formatting to make the data sheet look pretty](#formatting_pretty) -- [Placing comments or units in cells](#units) -- [Entering more than one piece of information in a cell](#info) -- [Using problematic field names](#field_name) -- [Using special characters in data](#special) -- [Inclusion of metadata in data table](#metadata) - -### Using multiple tables {#tables} - -A common strategy is creating multiple data tables within one -spreadsheet. This confuses the computer, so don't do this! When you -create multiple tables within one spreadsheet, you're drawing false -associations between things for the computer, which sees each row as -an observation. You're also potentially using the same field name in -multiple places, which will make it harder to clean your data up into -a usable form. The example below depicts the problem: +Il y a quelques erreurs potentielles à surveiller dans vos propres données +ainsi que dans les données de vos collaborateurs ou d'Internet. Si vous êtes +conscient des erreurs et de l'effet négatif possible sur l'analyse des données +en aval et l'interprétation des résultats, cela pourrait vous motiver +ainsi que les membres de votre projet à essayer de les éviter. Apporter de petits changements +à la façon dont vous formatez vos données dans des feuilles de calcul peut avoir un grand +impact sur l'efficacité et la fiabilité en matière de nettoyage +et d'analyse des données. + +- [Utiliser plusieurs tables](#tables) +- [Utiliser plusieurs onglets](#tabs) +- [Ne pas remplir les zéros](#zéros) +- [Utilisation de valeurs nulles problématiques](#null) +- [Utiliser le formatage pour transmettre des informations](#formatting) +- [Utiliser le formatage pour rendre la fiche technique jolie](#formatting_pretty) +- [Placer des commentaires ou des unités dans des cellules](#unités) +- [Saisie de plusieurs informations dans une cellule](#info) +- [Utilisation de noms de champs problématiques](#field_name) +- [Utilisation de caractères spéciaux dans les données](#special) +- [Inclusion de métadonnées dans le tableau de données](#metadata) + +### Utilisation de plusieurs tables {#tables} + +Une stratégie courante consiste à créer plusieurs tableaux de données dans une seule feuille de calcul +. Cela perturbe l'ordinateur, alors ne faites pas ça ! Lorsque vous +créez plusieurs tableaux dans une même feuille de calcul, vous établissez de fausses +associations entre les éléments pour l'ordinateur, qui considère chaque ligne comme +une observation. Vous utilisez également potentiellement le même nom de champ à +plusieurs endroits, ce qui rendra plus difficile le nettoyage de vos données dans +un formulaire utilisable. L'exemple ci-dessous illustre le problème : ![](fig/2_datasheet_example.jpg) -In the example above, the computer will see (for example) row 4 and -assume that all columns A-AF refer to the same sample. This row -actually represents four distinct samples (sample 1 for each of four -different collection dates - May 29th, June 12th, June 19th, and June -26th), as well as some calculated summary statistics (an average (avr) -and standard error of measurement (SEM)) for two of those -samples. Other rows are similarly problematic. +Dans l'exemple ci-dessus, l'ordinateur verra (par exemple) la ligne 4 et +supposera que toutes les colonnes A-AF font référence au même échantillon. Cette ligne +représente en fait quatre échantillons distincts (échantillon 1 pour chacune des quatre +dates de collecte différentes - 29 mai, 12 juin, 19 juin et +26 juin), ainsi que quelques statistiques récapitulatives calculées (une moyenne (avr) +et une erreur type de mesure (SEM)) pour deux de ces +échantillons. D'autres lignes posent également problème. -### Using multiple tabs {#tabs} +### Utiliser plusieurs onglets {#tabs} -But what about workbook tabs? That seems like an easy way to organise -data, right? Well, yes and no. When you create extra tabs, you fail to -allow the computer to see connections in the data that are there (you -have to introduce spreadsheet application-specific functions or -scripting to ensure this connection). Say, for instance, you make a -separate tab for each day you take a measurement. +Mais qu’en est-il des onglets du classeur ? Cela semble être un moyen simple d'organiser les données +, n'est-ce pas ? Eh bien, oui et non. Lorsque vous créez des onglets supplémentaires, vous ne parvenez pas +à permettre à l'ordinateur de voir les connexions dans les données qui s'y trouvent (vous devez +introduire des fonctions spécifiques à l'application de feuille de calcul ou +des scripts pour garantir cette connexion). Supposons, par exemple, que vous créiez un +onglet séparé pour chaque jour où vous prenez une mesure. -This isn't good practice for two reasons: +Ce n'est pas une bonne pratique pour deux raisons : -1. you are more likely to accidentally add inconsistencies to your - data if each time you take a measurement, you start recording data - in a new tab, and +1. vous êtes plus susceptible d'ajouter accidentellement des incohérences à vos données + si à chaque fois que vous prenez une mesure, vous commencez à enregistrer les données + dans un nouvel onglet, et 2. even if you manage to prevent all inconsistencies from creeping in, you will add an extra step for yourself before you analyse the data because you will have to combine these data into a single - datatable. You will have to explicitly tell the computer how to - combine tabs - and if the tabs are inconsistently formatted, you - might even have to do it manually. - -The next time you're entering data, and you go to create another tab -or table, ask yourself if you could avoid adding this tab by adding -another column to your original spreadsheet. We used multiple tabs in -our example of a messy data file, but now you've seen how you can -reorganise your data to consolidate across tabs. - -Your data sheet might get very long over the course of the -experiment. This makes it harder to enter data if you can't see your -headers at the top of the spreadsheet. But don't repeat your header -row. These can easily get mixed into the data, leading to problems -down the road. Instead you can freeze the column -headers -so that they remain visible even when you have a spreadsheet with many -rows. - -### Not filling in zeros {#zeros} - -It might be that when you're measuring something, it's usually a zero, -say the number of times a rabbit is observed in the survey. Why bother -writing in the number zero in that column, when it's mostly zeros? - -However, there's a difference between a zero and a blank cell in a -spreadsheet. To the computer, a zero is actually data. You measured or -counted it. A blank cell means that it wasn't measured and the -computer will interpret it as an unknown value (also known as a null -or missing value). - -The spreadsheets or statistical programs will likely misinterpret -blank cells that you intend to be zeros. By not entering the value of -your observation, you are telling your computer to represent that data -as unknown or missing (null). This can cause problems with subsequent -calculations or analyses. For example, the average of a set of numbers -which includes a single null value is always null (because the -computer can't guess the value of the missing observations). Because -of this, it's very important to record zeros as zeros and truly -missing data as nulls. - -### Using problematic null values {#null} - -**Example**: using -999 or other numerical values (or zero) to -represent missing data. + datatable. Vous devrez indiquer explicitement à l'ordinateur comment + combiner les onglets - et si les onglets ne sont pas formatés de manière cohérente, vous + devrez peut-être même le faire manuellement. + +La prochaine fois que vous saisirez des données et que vous créerez un autre onglet +ou un autre tableau, demandez-vous si vous pourriez éviter d'ajouter cet onglet en ajoutant +une autre colonne à votre feuille de calcul d'origine. Nous avons utilisé plusieurs onglets dans +notre exemple de fichier de données désordonné, mais vous avez maintenant vu comment vous pouvez +réorganiser vos données pour les consolider entre les onglets. + +Votre fiche technique peut devenir très longue au cours de l'expérience +. Cela rend plus difficile la saisie des données si vous ne voyez pas vos en-têtes +en haut de la feuille de calcul. Mais ne répétez pas votre ligne d'en-tête +. Ceux-ci peuvent facilement être mélangés aux données, entraînant des problèmes +plus tard. Au lieu de cela, vous pouvez [geler les en-têtes de la colonne +](https://support.office.com/en-ca/article/Freeze-column-headings-for-easy-scrolling-57ccce0c-cf85-4725-9579 -c5d13106ca6a) +afin qu'ils restent visibles même lorsque vous disposez d'une feuille de calcul comportant plusieurs +lignes. + +### Ne pas remplir les zéros {#zeros} + +Il se peut que lorsque vous mesurez quelque chose, il s'agisse généralement d'un zéro, +, par exemple le nombre de fois qu'un lapin est observé dans l'enquête. Pourquoi s'embêter +à écrire le chiffre zéro dans cette colonne, alors qu'il s'agit principalement de zéros ? + +Cependant, il existe une différence entre un zéro et une cellule vide dans une feuille de calcul +. Pour l’ordinateur, un zéro est en réalité une donnée. Vous l'avez mesuré ou +compté. Une cellule vide signifie qu'elle n'a pas été mesurée et l'ordinateur +l'interprétera comme une valeur inconnue (également appelée valeur nulle +ou valeur manquante). + +Les feuilles de calcul ou les programmes statistiques interpréteront probablement mal +les cellules vides que vous envisagez d'être des zéros. En n'entrant pas la valeur de +votre observation, vous dites à votre ordinateur de représenter ces données +comme inconnues ou manquantes (nulles). Cela peut entraîner des problèmes lors des +calculs ou analyses ultérieurs. Par exemple, la moyenne d'un ensemble de nombres +qui comprend une seule valeur nulle est toujours nulle (car l'ordinateur +ne peut pas deviner la valeur des observations manquantes). Parce que +de cela, il est très important d'enregistrer les zéros comme des zéros et vraiment +les données manquantes comme des valeurs nulles. + +### Utilisation de valeurs nulles problématiques {#null} + +**Exemple** : utiliser -999 ou d'autres valeurs numériques (ou zéro) pour +représente les données manquantes. **Solutions**: -There are a few reasons why null values get represented differently -within a dataset. Sometimes confusing null values are automatically -recorded from the measuring device. If that's the case, there's not -much you can do, but it can be addressed in data cleaning with a tool -like -[OpenRefine](https://www.datacarpentry.org/OpenRefine-ecology-lesson/) -before analysis. Other times different null values are used to convey -different reasons why the data isn't there. This is important -information to capture, but is in effect using one column to capture -two pieces of information. Like for using formatting to convey +Il existe plusieurs raisons pour lesquelles les valeurs nulles sont représentées différemment +dans un ensemble de données. Parfois, des valeurs nulles déroutantes sont automatiquement +enregistrées à partir de l'appareil de mesure. Si tel est le cas, vous ne pouvez pas +faire grand-chose, mais cela peut être résolu lors du nettoyage des données avec un outil +comme +[OpenRefine](https://www.datacarpentry .org/OpenRefine-ecology-lesson/) +avant analyse. D'autres fois, différentes valeurs nulles sont utilisées pour transmettre +différentes raisons pour lesquelles les données ne sont pas là. Il s'agit d'une +information importante à capturer, mais elle utilise en fait une seule colonne pour capturer +deux informations. Like for using formatting to convey information it would be good here to create a new column like 'data\_missing' and use that column to capture the different reasons. -Whatever the reason, it's a problem if unknown or missing data is -recorded as -999, 999, or 0. +Quelle que soit la raison, c'est un problème si des données inconnues ou manquantes sont +enregistrées comme -999, 999 ou 0. Many statistical programs will not recognise that these are intended -to represent missing (null) values. How these values are interpreted -will depend on the software you use to analyse your data. It is -essential to use a clearly defined and consistent null indicator. +to represent missing (null) values. La façon dont ces valeurs sont interprétées +dépendra du logiciel que vous utilisez pour analyser vos données. Il est +essentiel d’utiliser un indicateur nul clairement défini et cohérent. -Blanks (most applications) and NA (for R) are good -choices. @White:2013 explain good choices for indicating null values -for different software applications in their article: +Les blancs (la plupart des applications) et NA (pour R) sont de bons choix +. @White : 2013 explique les bons choix pour indiquer des valeurs nulles +pour différentes applications logicielles dans leur article : ![](fig/3_white_table_1.jpg) -### Using formatting to convey information {#formatting} +### Utiliser le formatage pour transmettre des informations {#formatting} -**Example**: highlighting cells, rows or columns that should be -excluded from an analysis, leaving blank rows to indicate -separations in data. +**Exemple** : mise en évidence des cellules, des lignes ou des colonnes qui doivent être +exclues d'une analyse, en laissant des lignes vides pour indiquer +séparations dans les données. ![](fig/formatting.png) -**Solution**: create a new field to encode which data should be -excluded. +**Solution** : créez un nouveau champ pour coder les données qui doivent être +exclues. ![](fig/good_formatting.png) -### Using formatting to make the data sheet look pretty {#formatting\_pretty} +### Utiliser le formatage pour rendre la fiche technique jolie {#formatting\_pretty} -**Example**: merging cells. +**Exemple** : fusion de cellules. -**Solution**: If you're not careful, formatting a worksheet to be more -aesthetically pleasing can compromise your computer's ability to see -associations in the data. Merged cells will make your data unreadable -by statistics software. Consider restructuring your data in such a way -that you will not need to merge cells to organise your data. +**Solution** : Si vous ne faites pas attention, le formatage d'une feuille de calcul pour qu'elle soit plus +esthétiquement peut compromettre la capacité de votre ordinateur à voir les +associations dans les données. Les cellules fusionnées rendront vos données illisibles +par les logiciels de statistiques. Pensez à restructurer vos données de telle manière +que vous n'aurez pas besoin de fusionner des cellules pour organiser vos données. -### Placing comments or units in cells {#units} +### Placer des commentaires ou des unités dans des cellules {#units} -Most analysis software can't see Excel or LibreOffice comments, and -would be confused by comments placed within your data cells. As -described above for formatting, create another field if you need to -add notes to cells. Similarly, don't include units in cells: ideally, -all the measurements you place in one column should be in the same -unit, but if for some reason they aren't, create another field and -specify the units the cell is in. +La plupart des logiciels d'analyse ne peuvent pas voir les commentaires Excel ou LibreOffice, et +serait dérouté par les commentaires placés dans vos cellules de données. Comme +décrit ci-dessus pour le formatage, créez un autre champ si vous devez +ajouter des notes aux cellules. De même, n'incluez pas d'unités dans les cellules : idéalement, +toutes les mesures que vous placez dans une colonne devraient être dans la même unité +, mais si pour une raison quelconque ce n'est pas le cas, créez un autre champ et +spécifient les unités dans lesquelles se trouve la cellule. -### Entering more than one piece of information in a cell {#info} +### Saisir plusieurs informations dans une cellule {#info} -**Example**: Recording ABO and Rhesus groups in one cell, such as A+, +**Exemple** : Enregistrement des groupes ABO et Rhésus dans une seule cellule, tels que A+, B+, A-, ... -**Solution**: Don't include more than one piece of information in a -cell. This will limit the ways in which you can analyse your data. If -you need both these measurements, design your data sheet to include -this information. For example, include one column for the ABO group and -one for the Rhesus group. - -### Using problematic field names {#field\_name} - -Choose descriptive field names, but be careful not to include spaces, -numbers, or special characters of any kind. Spaces can be -misinterpreted by parsers that use whitespace as delimiters and some -programs don't like field names that are text strings that start with -numbers. - -Underscores (`_`) are a good alternative to spaces. Consider writing -names in camel case (like this: ExampleFileName) to improve -readability. Remember that abbreviations that make sense at the moment -may not be so obvious in 6 months, but don't overdo it with names that -are excessively long. Including the units in the field names avoids -confusion and enables others to readily interpret your fields. - -**Examples** - -| Good Name | Good Alternative | Avoid | -| -------------------------------------------------------------- | ---------------------------------------- | ------------------------------------ | -| Max\_temp\_C | MaxTemp | Maximum Temp (°C) | -| Precipitation\_mm | Precipitation | precmm | -| Mean\_year\_growth | MeanYearGrowth | Mean growth/year | -| sex | sex | M/F | -| weight | weight | w. | -| cell\_type | CellType | Cell Type | -| Observation\_01 | first\_observation | 1st Obs | - -### Using special characters in data {#special} - -**Example**: You treat your spreadsheet program as a word processor -when writing notes, for example copying data directly from Word or -other applications. - -**Solution**: This is a common strategy. For example, when writing -longer text in a cell, people often include line breaks, em-dashes, -etc. in their spreadsheet. Also, when copying data in from -applications such as Word, formatting and fancy non-standard -characters (such as left- and right-aligned quotation marks) are -included. When exporting this data into a coding/statistical -environment or into a relational database, dangerous things may occur, -such as lines being cut in half and encoding errors being thrown. - -General best practice is to avoid adding characters such as newlines, -tabs, and vertical tabs. In other words, treat a text cell as if it +**Solution** : N'incluez pas plus d'une information dans une cellule +. Cela limitera les façons dont vous pourrez analyser vos données. Si +vous avez besoin de ces deux mesures, concevez votre fiche technique pour inclure +ces informations. Par exemple, incluez une colonne pour le groupe ABO et +une pour le groupe Rhésus. + +### Utilisation de noms de champs problématiques {#field\_name} + +Choisissez des noms de champs descriptifs, mais veillez à ne pas inclure d'espaces, de chiffres +ou de caractères spéciaux de quelque nature que ce soit. Les espaces peuvent être +mal interprétés par les analyseurs qui utilisent des espaces comme délimiteurs et certains programmes +n'aiment pas les noms de champs qui sont des chaînes de texte commençant par +nombres. + +Les traits de soulignement (`_`) sont une bonne alternative aux espaces. Pensez à écrire les noms +en casse chameau (comme ceci : SampleFileName) pour améliorer la lisibilité de +. N'oubliez pas que les abréviations qui ont un sens pour le moment +ne seront peut-être pas si évidentes dans 6 mois, mais n'en faites pas trop avec des noms qui +sont excessivement longs. L'inclusion des unités dans les noms de champs évite +toute confusion et permet aux autres d'interpréter facilement vos champs. + +**Exemples** + +| Réputation | Bonne alternative | Éviter | +| ---------------------------------------------------------------------- | ------------------------------------------- | -------------------------------------------- | +| Max\_temp\_C | Température maximale | Température maximale (°C) | +| Précipitations\_mm | Précipitation | précm | +| Moyenne\_année\_croissance | Croissance annuelle moyenne | Croissance moyenne/an | +| sexe | sexe | H/F | +| poids | poids | w. | +| cellule\_type | Type de cellule | Type de cellule | +| Observation\_01 | première\_observation | 1er Obs. | + +### Utilisation de caractères spéciaux dans les données {#special} + +**Exemple** : Vous traitez votre tableur comme un traitement de texte +lorsque vous rédigez des notes, par exemple en copiant des données directement depuis Word ou +d'autres applications. + +**Solution** : Il s'agit d'une stratégie courante. Par exemple, lorsqu'ils écrivent +un texte plus long dans une cellule, les utilisateurs incluent souvent des sauts de ligne, des tirets cadratins, +, etc. dans leur feuille de calcul. De plus, lors de la copie de données à partir d'applications +telles que Word, le formatage et les caractères +non standard (tels que les guillemets alignés à gauche et à droite) sont +inclus. Lors de l'exportation de ces données dans un environnement de codage/statistique +ou dans une base de données relationnelle, des choses dangereuses peuvent se produire, +comme des lignes coupées en deux et des erreurs d'encodage générées. + +La meilleure pratique générale consiste à éviter d'ajouter des caractères tels que des nouvelles lignes, des tabulations +et des tabulations verticales. In other words, treat a text cell as if it were a simple web form that can only contain text and spaces. -### Inclusion of metadata in data table {#metadata} +### Inclusion de métadonnées dans le tableau de données {#metadata} -**Example**: You add a legend at the top or bottom of your data table -explaining column meaning, units, exceptions, etc. +**Exemple** : Vous ajoutez une légende en haut ou en bas de votre tableau de données +expliquant la signification des colonnes, les unités, les exceptions, etc. -**Solution**: Recording data about your data ("metadata") is -essential. You may be on intimate terms with your dataset while you +**Solution** : L'enregistrement des données sur vos données ("métadonnées") est +essentiel. You may be on intimate terms with your dataset while you are collecting and analysing it, but the chances that you will still remember that the variable "sglmemgp" means single member of group, for example, or the exact algorithm you used to transform a variable or create a derived one, after a few months, a year, or more are slim. -As well, there are many reasons other people may want to examine or -use your data - to understand your findings, to verify your findings, -to review your submitted publication, to replicate your results, to -design a similar study, or even to archive your data for access and -re-use by others. While digital data by definition are -machine-readable, understanding their meaning is a job for human -beings. The importance of documenting your data during the collection -and analysis phase of your research cannot be overestimated, -especially if your research is going to be part of the scholarly -record. - -However, metadata should not be contained in the data file -itself. Unlike a table in a paper or a supplemental file, metadata (in -the form of legends) should not be included in a data file since this -information is not data, and including it can disrupt how computer -programs interpret your data file. Rather, metadata should be stored -as a separate file in the same directory as your data file, preferably -in plain text format with a name that clearly associates it with your -data file. Because metadata files are free text format, they also +De plus, il existe de nombreuses raisons pour lesquelles d'autres personnes pourraient vouloir examiner ou +utiliser vos données : pour comprendre vos conclusions, pour vérifier vos conclusions, +pour examiner la publication que vous avez soumise, pour reproduire vos résultats, pour +concevoir une étude similaire, ou même archiver vos données pour y accéder et +réutiliser par d'autres. Bien que les données numériques soient par définition +lisibles par machine, comprendre leur signification est un travail pour les +êtres humains. L'importance de documenter vos données pendant la phase de collecte +et d'analyse de votre recherche ne peut être surestimée, +surtout si votre recherche doit faire partie du dossier scientifique +. + +Cependant, les métadonnées ne doivent pas être contenues dans le fichier de données +lui-même. Contrairement à un tableau dans un article ou un fichier supplémentaire, les métadonnées (sous +sous forme de légendes) ne doivent pas être incluses dans un fichier de données puisque ces +informations ne sont pas des données, et leur inclusion peut perturber la façon dont les programmes informatiques +interprètent votre fichier de données. Les métadonnées doivent plutôt être stockées +en tant que fichier distinct dans le même répertoire que votre fichier de données, de préférence +au format texte brut avec un nom qui l'associe clairement à votre +fichier de données. . Because metadata files are free text format, they also allow you to encode comments, units, information about how null values are encoded, etc. that are important to document but can disrupt the formatting of your data file. -Additionally, file or database level metadata describes how files that -make up the dataset relate to each other; what format they are in; and -whether they supercede or are superceded by previous files. A -folder-level readme.txt file is the classic way of accounting for all -the files and folders in a project. +De plus, les métadonnées au niveau du fichier ou de la base de données décrivent comment les fichiers qui +constituent l'ensemble de données sont liés les uns aux autres ; dans quel format ils se trouvent ; et +s'ils remplacent ou sont remplacés par les fichiers précédents. Un fichier readme.txt au niveau du dossier +est la manière classique de comptabiliser tous les +fichiers et dossiers d'un projet. -(Text on metadata adapted from the online course Research Data -[MANTRA](https://datalib.edina.ac.uk/mantra) by EDINA and Data Library, -University of Edinburgh. MANTRA is licensed under a Creative Commons +(Texte sur les métadonnées adapté du cours en ligne Research Data +[MANTRA](https://datalib.edina.ac.uk/mantra) par EDINA et Data Library, +Université d'Édimbourg. MANTRA est sous licence Creative Commons Attribution 4.0 International -License.) +Licence.) -## Exporting data +## Exporter des données **Question** -- How can we export data from spreadsheets in a way that is useful for - downstream applications? - -**Objectives** - -- Store spreadsheet data in universal file formats. -- Export data from a spreadsheet to a CSV file. - -**Keypoints** - -- Data stored in common spreadsheet formats will often not be read - correctly into data analysis software, introducing errors into your - data. - -- Exporting data from spreadsheets to formats like CSV or TSV puts it - in a format that can be used consistently by most programs. - -Storing the data you're going to work with for your analyses in Excel -default file format (`*.xls` or `*.xlsx` - depending on the Excel -version) isn't a good idea. Why? - -- Because it is a proprietary format, and it is possible that in the - future, technology won't exist (or will become sufficiently rare) to - make it inconvenient, if not impossible, to open the file. +- Comment pouvons-nous exporter des données à partir de feuilles de calcul d'une manière utile pour + les applications en aval ? -- Other spreadsheet software may not be able to open files saved in a - proprietary Excel format. - -- Different versions of Excel may handle data differently, leading to - inconsistencies. [Dates](https://datacarpentry.org/spreadsheet-ecology-lesson/03-dates-as-data/index.html) - is a well-documented example of inconsistencies in data storage. - -- Finally, more journals and grant agencies are requiring you to - deposit your data in a data repository, and most of them don't - accept Excel format. It needs to be in one of the formats discussed - below. - -- The above points also apply to other formats such as open data - formats used by LibreOffice / Open Office. These formats are not - static and do not get parsed the same way by different software - packages. - -Storing data in a universal, open, and static format will help deal -with this problem. Try tab-delimited (tab separated values or TSV) or -comma-delimited (comma separated values or CSV). CSV files are plain -text files where the columns are separated by commas, hence 'comma -separated values' or CSV. The advantage of a CSV file over an -Excel/SPSS/etc. file is that we can open and read a CSV file using -just about any software, including plain text editors like TextEdit or -NotePad. Data in a CSV file can also be easily imported into other -formats and environments, such as SQLite and R. We're not tied to a -certain version of a certain expensive program when we work with CSV -files, so it's a good format to work with for maximum portability and -endurance. Most spreadsheet programs can save to delimited text -formats like CSV easily, although they may give you a warning during -the file export. - -To save a file you have opened in Excel in CSV format: - -1. From the top menu select 'File' and 'Save as'. -2. In the 'Format' field, from the list, select 'Comma Separated - Values' (`*.csv`). -3. Double check the file name and the location where you want to save - it and hit 'Save'. +**Objectifs** -An important note for backwards compatibility: you can open CSV files -in Excel! +- Stockez les données des feuilles de calcul dans des formats de fichiers universels. +- Exportez les données d'une feuille de calcul vers un fichier CSV. + +**Points clés** + +- Les données stockées dans des formats de feuilles de calcul courants ne seront souvent pas lues + correctement dans un logiciel d'analyse de données, introduisant des erreurs dans vos données + . + +- L'exportation de données à partir de feuilles de calcul vers des formats tels que CSV ou TSV les place + dans un format qui peut être utilisé de manière cohérente par la plupart des programmes. + +Le stockage des données avec lesquelles vous allez travailler pour vos analyses dans le format de fichier Excel +par défaut (`*.xls` ou `*.xlsx` - selon la version d'Excel +) n'est pas une bonne idée. Pourquoi? + +- Parce qu'il s'agit d'un format propriétaire, et qu'il est possible que dans le + futur, la technologie n'existe pas (ou devienne suffisamment rare) pour + rendre l'ouverture du fichier peu pratique, voire impossible. déposer. + +- D'autres logiciels de tableur peuvent ne pas être en mesure d'ouvrir les fichiers enregistrés dans un format Excel propriétaire + . + +- Différentes versions d'Excel peuvent gérer les données différemment, entraînant + des incohérences. [Dates](https://datacarpentry.org/spreadsheet-ecology-lesson/03-dates-as-data/index.html) + est un exemple bien documenté d'incohérences dans le stockage de données. + +- Enfin, de plus en plus de revues et d'organismes subventionnaires vous demandent de + déposer vos données dans un référentiel de données, et la plupart d'entre elles n'acceptent pas + le format Excel. Il doit être dans l’un des formats discutés + ci-dessous. + +- Les points ci-dessus s'appliquent également à d'autres formats tels que les formats open data + utilisés par LibreOffice / Open Office. Ces formats ne sont pas + statiques et ne sont pas analysés de la même manière par différents packages logiciels + . + +Le stockage des données dans un format universel, ouvert et statique aidera à résoudre +ce problème. Essayez les valeurs délimitées par des tabulations (valeurs séparées par des tabulations ou TSV) ou +délimitées par des virgules (valeurs séparées par des virgules ou CSV). Les fichiers CSV sont des fichiers texte simples +où les colonnes sont séparées par des virgules, d'où « valeurs séparées par des virgules +» ou CSV. L'avantage d'un fichier CSV par rapport à un +Excel/SPSS/etc. est que nous pouvons ouvrir et lire un fichier CSV en utilisant +à peu près n'importe quel logiciel, y compris des éditeurs de texte brut comme TextEdit ou +NotePad. Les données d'un fichier CSV peuvent également être facilement importées dans d'autres +formats et environnements, tels que SQLite et R. Nous ne sommes pas liés à une +certaine version d'un certain programme coûteux lorsque nous travaillons avec CSV +fichiers, c'est donc un bon format avec lequel travailler pour une portabilité maximale et une +endurance. La plupart des tableurs peuvent facilement enregistrer au format texte délimité +comme CSV, bien qu'ils puissent vous avertir lors de +l'exportation du fichier. + +Pour enregistrer un fichier que vous avez ouvert dans Excel au format CSV : + +1. Dans le menu supérieur, sélectionnez « Fichier » et « Enregistrer sous ». +2. Dans le champ « Format », dans la liste, sélectionnez « Valeurs séparées par des virgules + » (`*.csv`). +3. Vérifiez le nom du fichier et l'emplacement où vous souhaitez l'enregistrer + et cliquez sur « Enregistrer ». + +Une remarque importante pour la rétrocompatibilité : vous pouvez ouvrir les fichiers CSV +dans Excel ! ```{r, results="markup", fig.cap="Saving an Excel file to CSV.", echo=FALSE, purl=FALSE, out.width="70%", fig.align="center"} knitr::include_graphics("fig/excel-to-csv.png") ``` -**A note on R and `xls`**: There are R packages that can read `xls` -files (as well as Google spreadsheets). It is even possible to access -different worksheets in the `xls` documents. +**Une note sur R et `xls`** : Il existe des packages R qui peuvent lire les fichiers `xls` +(ainsi que les feuilles de calcul Google). Il est même possible d'accéder à +différentes feuilles de calcul dans les documents `xls`. -**But** +**Mais** -- some of these only work on Windows. -- this equates to replacing a (simple but manual) export to `csv` with - additional complexity/dependencies in the data analysis R code. -- data formatting best practice still apply. -- Is there really a good reason why `csv` (or similar) is not - adequate? +- certains d'entre eux ne fonctionnent que sous Windows. +- cela équivaut à remplacer une exportation (simple mais manuelle) vers `csv` par + complexité/dépendances supplémentaires dans le code R d'analyse des données. +- Les meilleures pratiques en matière de formatage des données s’appliquent toujours. +- Y a-t-il vraiment une bonne raison pour laquelle `csv` (ou similaire) n'est pas + adéquat ? -### Caveats on commas +### Mises en garde concernant les virgules -In some datasets, the data values themselves may include commas -(,). In that case, the software which you use (including Excel) will -most likely incorrectly display the data in columns. This is because -the commas which are a part of the data values will be interpreted as -delimiters. +Dans certains ensembles de données, les valeurs des données elles-mêmes peuvent inclure des virgules +(,). Dans ce cas, le logiciel que vous utilisez (y compris Excel) +affichera très probablement de manière incorrecte les données en colonnes. En effet, +les virgules qui font partie des valeurs de données seront interprétées comme des délimiteurs +. -For example, our data might look like this: +Par exemple, nos données pourraient ressembler à ceci : ``` species_id,genus,species,taxa @@ -746,79 +746,79 @@ AS,Ammodramus,savannarum,Bird BA,Baiomys,taylori,Rodent ``` -In the record `AH,Ammospermophilus,harrisi,Rodent, not censused` the -value for `taxa` includes a comma (`Rodent, not censused`). If we try -to read the above into Excel (or other spreadsheet program), we will -get something like this: +Dans l'enregistrement « AH, Ammospermophilus, harrisi, Rongeur, non recensé », la valeur +pour « taxons » comprend une virgule (« Rongeur, non recensé »). Si nous essayons +de lire ce qui précède dans Excel (ou un autre tableur), nous obtiendrons +quelque chose comme ceci : ```{r, results="markup", fig.cap="The risks of having commas inside comma-separated data.", echo=FALSE, purl=FALSE, out.width="80%", fig.align="center"} knitr::include_graphics("fig/csv-mistake.png") ``` -The value for `taxa` was split into two columns (instead of being put -in one column `D`). This can propagate to a number of further -errors. For example, the extra column will be interpreted as a column -with many missing values (and without a proper header). In addition to -that, the value in column `D` for the record in row 3 (so the one -where the value for 'taxa' contained the comma) is now incorrect. +La valeur de « taxons » a été divisée en deux colonnes (au lieu d'être placée +dans une seule colonne « D »). Cela peut se propager à un certain nombre d'autres erreurs +. Par exemple, la colonne supplémentaire sera interprétée comme une colonne +avec de nombreuses valeurs manquantes (et sans en-tête approprié). En plus de +cela, la valeur dans la colonne « D » pour l'enregistrement de la ligne 3 (donc celle +où la valeur de « taxons » contenait la virgule) est désormais incorrecte. -If you want to store your data in `csv` format and expect that your -data values may contain commas, you can avoid the problem discussed -above by putting the values in quotes (""). Applying this rule, our -data might look like this: +Si vous souhaitez stocker vos données au format `csv` et vous attendez à ce que vos valeurs de données +contiennent des virgules, vous pouvez éviter le problème évoqué +ci-dessus en mettant les valeurs entre guillemets (""). En appliquant cette règle, nos données +pourraient ressembler à ceci : ``` -species_id,genus,species,taxa -"AB","Amphispiza","bilineata","Bird" -"AH","Ammospermophilus","harrisi","Rodent, not censused" -"AS","Ammodramus","savannarum","Bird" -"BA","Baiomys","taylori","Rodent" +id_espèce, genre, espèce, taxons +"AB", "Amphispiza", "bilineata", "Oiseau" +"AH", "Ammospermophilus", "harrisi", "Rongeur, non recensé" +"AS", "Ammodramus", "savannarum", "Oiseau" +"BA", "Baiomys", "taylori", "Rongeur" ``` -Now opening this file as a `csv` in Excel will not lead to an extra -column, because Excel will only use commas that fall outside of -quotation marks as delimiting characters. - -Alternatively, if you are working with data that contains commas, you -likely will need to use another delimiter when working in a -spreadsheet[^decsep]. In this case, consider using tabs as your delimiter and -working with TSV files. TSV files can be exported from spreadsheet -programs in the same way as CSV files. - -[^decsep]: This is particularly relevant in European - countries where the comma is used as a decimal - separator. In such cases, the default value separator in a - csv file will be the semi-colon (;), or values will be - systematically quoted. - -If you are working with an already existing dataset in which the data -values are not included in "" but which have commas as both delimiters -and parts of data values, you are potentially facing a major problem -with data cleaning. If the dataset you're dealing with contains -hundreds or thousands of records, cleaning them up manually (by either -removing commas from the data values or putting the values into -quotes - "") is not only going to take hours and hours but may -potentially end up with you accidentally introducing many errors. - -Cleaning up datasets is one of the major problems in many scientific -disciplines. The approach almost always depends on the particular -context. However, it is a good practice to clean the data in an -automated fashion, for example by writing and running a script. The -Python and R lessons will give you the basis for developing skills to -build relevant scripts. - -## Summary +Désormais, l'ouverture de ce fichier en tant que « csv » dans Excel n'entraînera pas une colonne +supplémentaire, car Excel n'utilisera que des virgules qui se trouvent en dehors des guillemets +comme caractères de délimitation. + +Alternativement, si vous travaillez avec des données contenant des virgules, vous devrez probablement +utiliser un autre délimiteur lorsque vous travaillerez dans une feuille de calcul +[^decsep]. Dans ce cas, pensez à utiliser des tabulations comme délimiteur et +à travailler avec des fichiers TSV. Les fichiers TSV peuvent être exportés à partir de feuilles de calcul +de la même manière que les fichiers CSV. + +[^decsep]: Ceci est particulièrement pertinent dans les pays européens + où la virgule est utilisée comme séparateur décimal + . Dans de tels cas, le séparateur de valeurs par défaut dans un fichier csv + sera le point-virgule (;), ou les valeurs seront systématiquement + entre guillemets. + +Si vous travaillez avec un ensemble de données déjà existant dans lequel les valeurs de données +ne sont pas incluses entre "" mais qui ont à la fois des virgules comme délimiteurs +et des parties de valeurs de données, vous êtes potentiellement confronté à un problème majeur. +avec nettoyage des données. Si l'ensemble de données que vous traitez contient +des centaines ou des milliers d'enregistrements, nettoyez-les manuellement (soit en +supprimant les virgules des valeurs de données, soit en mettant les valeurs entre +guillemets - "") non seulement va prendre des heures et des heures, mais peut +finir par vous amener à introduire accidentellement de nombreuses erreurs. + +Le nettoyage des ensembles de données est l’un des problèmes majeurs dans de nombreuses disciplines scientifiques +. L’approche dépend presque toujours du contexte +particulier. Cependant, il est recommandé de nettoyer les données de manière +automatisée, par exemple en écrivant et en exécutant un script. Les leçons +Python et R vous donneront les bases pour développer des compétences permettant de +créer des scripts pertinents. + +## Résumé ```{r analysis, results="asis", fig.margin=TRUE, fig.cap="A typical data analysis workflow.", fig.width=7, fig.height=4, echo=FALSE, purl=FALSE} knitr::include_graphics("fig/analysis.png") ``` -A typical data analysis workflow is illustrated in the figure above, -where data is repeatedly transformed, visualised, and modelled. This -iteration is repeated multiple times until the data is understood. In -many real-life cases, however, most time is spent cleaning up and -preparing the data, rather than actually analysing and understanding -it. +Un flux de travail typique d'analyse de données est illustré dans la figure ci-dessus, +où les données sont transformées, visualisées et modélisées à plusieurs reprises. Cette itération +est répétée plusieurs fois jusqu'à ce que les données soient comprises. Cependant, dans +de nombreux cas réels, la plupart du temps est consacré au nettoyage et +à la préparation des données, plutôt qu'à leur analyse et à leur compréhension +. An agile data analysis workflow, with several fast iterations of the transform/visualise/model cycle is only feasible if the data is @@ -827,6 +827,6 @@ without having to look at it and/or fix it. :::::::::::::::::::::::::::::::::::::::: keypoints -- Good data organization is the foundation of any research project. +- Une bonne organisation des données est la base de tout projet de recherche. -:::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::: From e7c6655e7028d076534121610c969ba7fc014dba Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Thu, 1 Aug 2024 00:07:17 +0900 Subject: [PATCH 194/334] New translations 10-data-organisation.md (Chinese Simplified) --- locale/zh/episodes/10-data-organisation.Rmd | 1178 +++++++++---------- 1 file changed, 589 insertions(+), 589 deletions(-) diff --git a/locale/zh/episodes/10-data-organisation.Rmd b/locale/zh/episodes/10-data-organisation.Rmd index d702329b9..49797bf78 100644 --- a/locale/zh/episodes/10-data-organisation.Rmd +++ b/locale/zh/episodes/10-data-organisation.Rmd @@ -1,8 +1,8 @@ --- -source: Rmd -title: Data organisation with spreadsheets -teaching: 30 -exercises: 30 +source: 放射科 +title: 使用电子表格组织数据 +teaching: 三十 +exercises: 三十 --- ```{r, include=FALSE} @@ -10,120 +10,120 @@ exercises: 30 ::::::::::::::::::::::::::::::::::::::: objectives -- Learn about spreadsheets, their strengths and weaknesses. -- How do we format data in spreadsheets for effective data use? -- Learn about common spreadsheet errors and how to correct them. -- Organise your data according to tidy data principles. -- Learn about text-based spreadsheet formats such as the comma-separated (CSV) or tab-separated (TSV) formats. +- 了解电子表格及其优点和缺点。 +- 我们该如何在电子表格中格式化数据以有效使用数据? +- 了解常见的电子表格错误以及如何纠正它们。 +- 根据整洁数据原则组织您的数据。 +- 了解基于文本的电子表格格式,例如逗号分隔 (CSV) 或制表符分隔 (TSV) 格式。 :::::::::::::::::::::::::::::::::::::::::::::::::: :::::::::::::::::::::::::::::::::::::::: questions -- How to organise tabular data? +- 如何组织表格数据? :::::::::::::::::::::::::::::::::::::::::::::::::: -> This episode is based on the Data Carpentries's _Data Analysis and -> Visualisation in R for Ecologists_ lesson. +> 本集基于 Data Carpentries 的_面向生态学家的 R 语言数据分析和 +> 可视化_课程。 -## Spreadsheet programs +## 电子表格程序 -**Question** +**问题** -- What are basic principles for using spreadsheets for good data - organization? +- 使用电子表格进行良好的数据 + 组织的基本原则是什么? -**Objective** +**客观的** - Describe best practices for organizing data so computers can make the best use of datasets. -**Keypoint** +**关键** -- Good data organization is the foundation of any research project. +- 良好的数据组织是任何研究项目的基础。 -Good data organization is the foundation of your research -project. Most researchers have data or do data entry in -spreadsheets. Spreadsheet programs are very useful graphical -interfaces for designing data tables and handling very basic data -quality control functions. See also @Broman:2018. +良好的数据组织是您的研究 +项目的基础。 大多数研究人员都有数据或在 +电子表格中输入数据。 电子表格程序是非常有用的图形 +界面,用于设计数据表和处理非常基本的数据 +质量控制功能。 另请参阅@Broman:2018。 -### Spreadsheet outline +### 电子表格大纲 -Spreadsheets are good for data entry. Therefore we have a lot of data -in spreadsheets. Much of your time as a researcher will be spent in -this 'data wrangling' stage. It's not the most fun, but it's -necessary. We'll teach you how to think about data organization and -some practices for more effective data wrangling. +电子表格适合数据输入。 因此,我们在电子表格中有大量数据 +。 作为研究人员,你的大部分时间将花在 +这个“数据整理”阶段。 这虽然不是最有趣的,但它是 +必要的。 我们将教您如何思考数据组织和 +一些更有效的数据整理实践。 -### What this lesson will not teach you +### 这堂课不会教你什么 -- How to do _statistics_ in a spreadsheet -- How to do _plotting_ in a spreadsheet -- How to _write code_ in spreadsheet programs +- 如何在电子表格中进行_统计_ +- 如何在电子表格中绘图 +- 如何在电子表格程序中“编写代码” -If you're looking to do this, a good reference is Head First +如果你想这样做,一个很好的参考是 O'Reilly 出版的 Head First Excel, -published by O'Reilly. +。 -### Why aren't we teaching data analysis in spreadsheets +### 为什么我们不在电子表格中教授数据分析 -- Data analysis in spreadsheets usually requires a lot of manual - work. If you want to change a parameter or run an analysis with a - new dataset, you usually have to redo everything by hand. (We do - know that you can create macros, but see the next point.) +- 电子表格中的数据分析通常需要大量的手动 + 工作。 如果您想更改参数或使用 + 新数据集运行分析,通常必须手动重做所有操作。 (我们确实 + 知道您可以创建宏,但请参阅下一点。) -- It is also difficult to track or reproduce statistical or plotting - analyses done in spreadsheet programs when you want to go back to - your work or someone asks for details of your analysis. +- 当您想要返回到 + 您的工作或有人询问您的分析细节时,追踪或重现电子表格程序中完成的统计或绘制 + 分析也很困难。 -Many spreadsheet programs are available. Since most participants -utilise Excel as their primary spreadsheet program, this lesson will -make use of Excel examples. A free spreadsheet program that can also -be used is LibreOffice. Commands may differ a bit between programs, -but the general idea is the same. +有许多电子表格程序可供使用。 由于大多数参与者 +使用 Excel 作为主要电子表格程序,本课将 +使用 Excel 示例。 A free spreadsheet program that can also +be used is LibreOffice. 程序之间的命令可能略有不同, +但总体思路是相同的。 Spreadsheet programs encompass a lot of the things we need to be able -to do as researchers. We can use them for: +to do as researchers. 我们可以用它们来做: -- Data entry -- Organizing data -- Subsetting and sorting data -- Statistics -- Plotting +- 数据输入 +- 组织数据 +- 数据子集和排序 +- 统计数据 +- 绘图 -Spreadsheet programs use tables to represent and display data. Data -formatted as tables is also the main theme of this chapter, and we -will see how to organise data into tables in a standardised way to -ensure efficient downstream analysis. +电子表格程序使用表格来表示和显示数据。 格式化为表格的数据 +也是本章的主题,我们 +将看到如何以标准化的方式将数据组织成表格,以 +确保高效的下游分析。 ::::::::::::::::::::::::::::::::::::::: challenge -## Challenge: Discuss the following points with your neighbour +## 挑战:与邻居讨论以下几点 -- Have you used spreadsheets, in your research, courses, - or at home? -- What kind of operations do you do in spreadsheets? -- Which ones do you think spreadsheets are good for? -- Have you accidentally done something in a spreadsheet program that made you - frustrated or sad? +- 您在研究、课程、 + 或在家里使用过电子表格吗? +- 您在电子表格中进行哪些类型的操作? +- 您认为电子表格适合用于哪些方面? +- 您是否曾在电子表格程序中意外地做了一些令您 + 感到沮丧或悲伤的事情? :::::::::::::::::::::::::::::::::::::::::::::::::: -### Problems with spreadsheets +### 电子表格问题 -Spreadsheets are good for data entry, but in reality we tend to -use spreadsheet programs for much more than data entry. We use them -to create data tables for publications, to generate summary -statistics, and make figures. +电子表格适合于数据输入,但实际上我们倾向于 +使用电子表格程序进行更多数据输入以外的用途。 我们使用它们 +来创建出版物的数据表,生成摘要 +统计数据,并制作图表。 -Generating tables for publications in a spreadsheet is not -optimal - often, when formatting a data table for publication, we're -reporting key summary statistics in a way that is not really meant to -be read as data, and often involves special formatting -(merging cells, creating borders, making it pretty). We advise you to -do this sort of operation within your document editing software. +在电子表格中生成用于出版的表格并不是 +最佳选择——通常,在格式化用于出版的数据表时,我们 +以一种实际上并不打算 +被读取为数据的方式报告关键汇总统计数据,并且通常涉及特殊格式 +(合并单元格、创建边框、使其美观)。 我们建议您 +在文档编辑软件中执行此类操作。 The latter two applications, generating statistics and figures, should be used with caution: because of the graphical, drag and drop nature of @@ -131,168 +131,168 @@ spreadsheet programs, it can be very difficult, if not impossible, to replicate your steps (much less retrace anyone else's), particularly if your stats or figures require you to do more complex calculations. Furthermore, in doing calculations in a spreadsheet, it's easy to accidentally apply a -slightly different formula to multiple adjacent cells. When using a -command-line based statistics program like R or SAS, it's practically -impossible to apply a calculation to one observation in your -dataset but not another unless you're doing it on purpose. +slightly different formula to multiple adjacent cells. 当使用基于 +命令行的统计程序(例如 R 或 SAS)时,除非您有意为之,否则几乎不可能 +将计算应用于 +数据集中的一个观察结果,而不应用于另一个观察结果。 -### Using spreadsheets for data entry and cleaning +### 使用电子表格进行数据输入和清理 -In this lesson, we will assume that you are most likely using Excel as -your primary spreadsheet program - there are others (gnumeric, Calc -from OpenOffice), and their functionality is similar, but Excel seems -to be the program most used by biologists and biomedical researchers. +在本课中,我们假设您很可能使用 Excel 作为 +您主要的电子表格程序 - 还有其他程序(gnumeric、OpenOffice 的 Calc +),它们的功能类似,但 Excel 似乎 +是生物学家和生物医学研究人员最常用的程序。 -In this lesson we're going to talk about: +在本课中我们将讨论: -1. Formatting data tables in spreadsheets -2. Formatting problems -3. Exporting data +1. 格式化电子表格中的数据表 +2. 格式问题 +3. 导出数据 -## Formatting data tables in spreadsheets +## 格式化电子表格中的数据表 -**Questions** +**问题** -- How do we format data in spreadsheets for effective data use? +- 我们该如何在电子表格中格式化数据以有效使用数据? -**Objectives** +**目标** -- Describe best practices for data entry and formatting in - spreadsheets. +- 描述在 + 电子表格中输入和格式化数据的最佳实践。 -- Apply best practices to arrange variables and observations in a - spreadsheet. +- 应用最佳实践在 + 电子表格中排列变量和观察结果。 -**Keypoints** +**关键点** -- Never modify your raw data. Always make a copy before making any - changes. +- 切勿修改原始数据。 在进行任何 + 更改之前,务必先进行复制。 -- Keep track of all of the steps you take to clean your data in a - plain text file. +- 在 + 纯文本文件中跟踪您清理数据所采取的所有步骤。 -- Organise your data according to tidy data principles. +- 根据整洁数据原则组织您的数据。 -The most common mistake made is treating spreadsheet programs like lab -notebooks, that is, relying on context, notes in the margin, spatial -layout of data and fields to convey information. As humans, we can -(usually) interpret these things, but computers don't view information -the same way, and unless we explain to the computer what every single -thing means (and that can be hard!), it will not be able to see how -our data fits together. +最常见的错误是将电子表格程序视为实验室 +笔记本,即依赖上下文、边缘注释、数据和字段的空间 +布局来传达信息。 作为人类,我们可以 +(通常)解释这些事物,但计算机不会以相同的方式查看信息 +,并且除非我们向计算机解释每个 +事物的含义(这可能很难!),否则它将无法看到 +我们的数据是如何组合在一起的。 Using the power of computers, we can manage and analyse data in much more effective and faster ways, but to use that power, we have to set up our data for the computer to be able to understand it (and computers are very literal). -This is why it's extremely important to set up well-formatted tables -from the outset - before you even start entering data from your very -first preliminary experiment. Data organization is the foundation of -your research project. It can make it easier or harder to work with -your data throughout your analysis, so it's worth thinking about when -you're doing your data entry or setting up your experiment. You can -set things up in different ways in spreadsheets, but some of these -choices can limit your ability to work with the data in other programs -or have the you-of-6-months-from-now or your collaborator work with -the data. - -**Note:** the best layouts/formats (as well as software and -interfaces) for data entry and data analysis might be different. It is +这就是为什么从一开始就设置格式良好的表格 +非常重要 - 甚至在您开始输入您的 +第一次初步实验的数据之前。 数据组织是 +研究项目的基础。 它可以使您在整个分析过程中处理 +数据变得更容易或更难,因此在 +进行数据输入或设置实验时值得考虑。 您可以在电子表格中以不同的方式进行 +设置,但其中一些 +选择可能会限制您处理其他程序中数据的能力 +,或者限制 6 个月后的您或您的合作者处理 +数据。 + +\*\*注意:\*\*数据输入和数据分析的最佳布局/格式(以及软件和 +界面)可能不同。 It is important to take this into account, and ideally automate the conversion from one to another. -### Keeping track of your analyses +### 跟踪你的分析 -When you're working with spreadsheets, during data clean up or -analyses, it's very easy to end up with a spreadsheet that looks very -different from the one you started with. In order to be able to -reproduce your analyses or figure out what you did when a reviewer or -instructor asks for a different analysis, you should +当您使用电子表格时,在数据清理或 +分析期间,很容易得到与您开始时非常 +不同的电子表格。 为了能够 +重现你的分析,或者在审稿人或 +导师要求进行不同的分析时弄清楚你做了什么,你应该 -- create a new file with your cleaned or analysed data. Don't modify - the original dataset, or you will never know where you started! +- 使用您清理或分析过的数据创建一个新文件。 不要修改 + 原始数据集,否则您将永远不知道从哪里开始! -- keep track of the steps you took in your clean up or analysis. You - should track these steps as you would any step in an experiment. We - recommend that you do this in a plain text file stored in the same - folder as the data file. +- 跟踪您在清理或分析中所采取的步骤。 您 + 应该像追踪实验中的任何步骤一样追踪这些步骤。 我们 + 建议您在与数据文件存储在同一个 + 文件夹中的纯文本文件中执行此操作。 -This might be an example of a spreadsheet setup: +这可能是电子表格设置的一个示例: -![](fig/spreadsheet-setup-updated.png) +![](图/电子表格设置更新.png) -Put these principles in to practice today during your exercises. +今天在练习中将这些原则付诸实践。 -While versioning is out of scope for this course, you can look at the -Carpentries lesson on -['Git'](https://swcarpentry.github.io/git-novice/) to learn how to -maintain **version control** over your data. See also this blog -post for a quick tutorial or -@Perez-Riverol:2016 for a more research-oriented use-case. +虽然版本控制超出了本课程的范围,但您可以查看 +Carpentries 课程中关于 +['Git'](https://swcarpentry.github.io/git-novice/) 的内容,了解如何 +对数据进行**版本控制**。 另请参阅此 博客 +帖子 了解快速教程或 +@Perez-Riverol:2016 了解更面向研究的用例。 -### Structuring data in spreadsheets +### 在电子表格中构建数据 -The cardinal rules of using spreadsheet programs for data: +使用电子表格程序处理数据的基本规则: -1. Put all your variables in columns - the thing you're measuring, - like 'weight' or 'temperature'. -2. Put each observation in its own row. -3. Don't combine multiple pieces of information in one cell. Sometimes - it just seems like one thing, but think if that's the only way - you'll want to be able to use or sort that data. -4. Leave the raw data raw - don't change it! -5. Export the cleaned data to a text-based format like CSV - (comma-separated values) format. This ensures that anyone can use - the data, and is required by most data repositories. +1. 将所有变量放在列中 - 您要测量的东西, + 例如“重量”或“温度”。 +2. 将每个观察结果放在其自己的行中。 +3. 不要在一个单元格中合并多条信息。 有时 + 看起来只是一件事,但想想如果这是唯一的方法 + 你会希望能够使用或排序这些数据。 +4. 保留原始数据 - 不要更改它! +5. 将清理后的数据导出为基于文本的格式,如 CSV + (逗号分隔值)格式。 这确保任何人都可以使用 + 数据,并且是大多数数据存储库所要求的。 -For instance, we have data from patients that visited several -hospitals in Brussels, Belgium. They recorded the date of the visit, +例如,我们拥有来自比利时布鲁塞尔几家 +家医院的患者的数据。 They recorded the date of the visit, the hospital, the patients' gender, weight and blood group. -If we were to keep track of the data like this: +如果我们像这样跟踪数据: -![](fig/multiple-info.png) +![](图/multiple-info.png) -the problem is that the ABO and Rhesus groups are in the same `Blood` -type column. So, if they wanted to look at all observations of the A -group or look at weight distributions by ABO group, it would be tricky -to do this using this data setup. If instead we put the ABO and Rhesus -groups in different columns, you can see that it would be much easier. +问题在于 ABO 和 Rh 血型位于同一个“血液” +类型列中。 因此,如果他们想要查看 A +组的所有观察结果或查看 ABO 组的体重分布,使用此数据设置执行此操作会很棘手 +。 如果我们将 ABO 和 Rh 血型 +组放在不同的列中,您会发现这会容易得多。 -![](fig/single-info.png) +![](图/单个信息.png) -An important rule when setting up a datasheet, is that **columns are -used for variables** and **rows are used for observations**: +设置数据表时的一个重要规则是**列 +用于变量**并且**行用于观察**: -- columns are variables -- rows are observations -- cells are individual values +- 列是变量 +- 行是观察值 +- 单元格是单独的值 ::::::::::::::::::::::::::::::::::::::: challenge -## Challenge: We're going to take a messy dataset and describe how we would clean it up. +## 挑战:我们将获取一个混乱的数据集并描述如何清理它。 -1. Download a messy dataset by clicking - [here](https://github.com/UCLouvain-CBIO/WSBIM1207/raw/master/data/messy_covid.xlsx). +1. 点击 + [此处](https://github.com/UCLouvain-CBIO/WSBIM1207/raw/master/data/messy_covid.xlsx) 下载混乱的数据集。 -2. Open up the data in a spreadsheet program. +2. 在电子表格程序中打开数据。 -3. You can see that there are two tabs. The data contains various +3. 您可以看到有两个选项卡。 The data contains various clinical variables recorded in various hospitals in Brussels during - the first and second COVID-19 waves in 2020. As you can see, the - data have been recorded differently during the March and November - waves. Now you're the person in charge of this project and you want - to be able to start analyzing the data. + the first and second COVID-19 waves in 2020. 如您所见,在 3 月和 11 月的 + 波期间, + 数据的记录方式有所不同。 现在您是该项目的负责人,并且您希望 + 能够开始分析数据。 -4. With the person next to you, identify what is wrong with this - spreadsheet. Also discuss the steps you would need to take to clean - up first and second wave tabs, and to put them all together in one - spreadsheet. +4. 与您旁边的人一起,找出这个 + 电子表格中存在的问题。 还要讨论清理 + 第一和第二波标签所需采取的步骤,并将它们全部放在一个 + 电子表格中。 -**Important:** Do not forget our first piece of advice: to create a -new file (or tab) for the cleaned data, never modify your original -(raw) data. +\*\*重要:\*\*不要忘记我们的第一条建议:为清理后的数据创建一个 +新文件(或选项卡),切勿修改原始 +(原始)数据。 :::::::::::::::::::::::::::::::::::::::::::::::::: @@ -319,45 +319,45 @@ wrong with this data and how you would fix it. ::::::::::::::::::::::::::::::::::::::: challenge -## Challenge: Once you have tidied up the data, answer the following questions: +## 挑战:整理好数据后,回答以下问题: -- How many men and women took part in the study? -- How many A, AB, and B types have been tested? -- As above, but disregarding the contaminated samples? -- How many Rhesus + and - have been tested? -- How many universal donors (O-) have been tested? -- What is the average weight of AB men? -- How many samples have been tested in the different hospitals? +- 有多少男性和女性参加了这项研究? +- 已检测出 A 型、AB 型、B 型各有多少个? +- 如上所述,但忽略受污染的样本? +- 已检测了多少 Rh 阳性和 Rh 阴性患者? +- 有多少位通用捐献者(O-)接受了检测? +- AB型男人的平均体重是多少? +- 不同医院检测了多少样本? :::::::::::::::::::::::::::::::::::::::::::::::::: -An **excellent reference**, in particular with regard to R scripting -is the _Tidy Data_ paper @Wickham:2014. +一个**优秀的参考资料**,特别是关于 R 脚本 +的是_Tidy Data_ 论文@Wickham:2014。 -## Common spreadsheet errors +## 常见的电子表格错误 -**Questions** +**问题** -- What are some common challenges with formatting data in spreadsheets - and how can we avoid them? +- 在电子表格中格式化数据时会遇到哪些常见挑战? + 我们如何避免它们? -**Objectives** +**目标** -- Recognise and resolve common spreadsheet formatting problems. +- 识别并解决常见的电子表格格式问题。 -**Keypoints** +**关键点** -- Avoid using multiple tables within one spreadsheet. -- Avoid spreading data across multiple tabs. -- Record zeros as zeros. -- Use an appropriate null value to record missing data. -- Don't use formatting to convey information or to make your spreadsheet look pretty. -- Place comments in a separate column. -- Record units in column headers. -- Include only one piece of information in a cell. -- Avoid spaces, numbers and special characters in column headers. -- Avoid special characters in your data. -- Record metadata in a separate plain text file. +- 避免在一个电子表格中使用多个表格。 +- 避免将数据分散到多个选项卡中。 +- 将零记录为零。 +- 使用适当的空值来记录缺失的数据。 +- 不要使用格式来传达信息或让你的电子表格看起来漂亮。 +- 将评论放在单独的栏中。 +- 在列标题中记录单位。 +- 每个单元格中仅包含一条信息。 +- 避免在列标题中使用空格、数字和特殊字符。 +- 避免在数据中包含特殊字符。 +- 在单独的纯文本文件中记录元数据。 <!-- This lesson is meant to be used as a reference for discussion as --> @@ -367,466 +367,466 @@ is the _Tidy Data_ paper @Wickham:2014. <!-- refer to responses to the exercise in the previous lesson. --> -There are a few potential errors to be on the lookout for in your own -data as well as data from collaborators or the Internet. If you are -aware of the errors and the possible negative effect on downstream -data analysis and result interpretation, it might motivate yourself -and your project members to try and avoid them. Making small changes -to the way you format your data in spreadsheets, can have a great -impact on efficiency and reliability when it comes to data cleaning -and analysis. - -- [Using multiple tables](#tables) -- [Using multiple tabs](#tabs) -- [Not filling in zeros](#zeros) -- [Using problematic null values](#null) -- [Using formatting to convey information](#formatting) -- [Using formatting to make the data sheet look pretty](#formatting_pretty) -- [Placing comments or units in cells](#units) -- [Entering more than one piece of information in a cell](#info) -- [Using problematic field names](#field_name) -- [Using special characters in data](#special) -- [Inclusion of metadata in data table](#metadata) - -### Using multiple tables {#tables} - -A common strategy is creating multiple data tables within one -spreadsheet. This confuses the computer, so don't do this! When you -create multiple tables within one spreadsheet, you're drawing false -associations between things for the computer, which sees each row as -an observation. You're also potentially using the same field name in -multiple places, which will make it harder to clean your data up into -a usable form. The example below depicts the problem: - -![](fig/2_datasheet_example.jpg) - -In the example above, the computer will see (for example) row 4 and -assume that all columns A-AF refer to the same sample. This row -actually represents four distinct samples (sample 1 for each of four -different collection dates - May 29th, June 12th, June 19th, and June -26th), as well as some calculated summary statistics (an average (avr) -and standard error of measurement (SEM)) for two of those -samples. Other rows are similarly problematic. - -### Using multiple tabs {#tabs} - -But what about workbook tabs? That seems like an easy way to organise -data, right? Well, yes and no. When you create extra tabs, you fail to -allow the computer to see connections in the data that are there (you -have to introduce spreadsheet application-specific functions or -scripting to ensure this connection). Say, for instance, you make a -separate tab for each day you take a measurement. - -This isn't good practice for two reasons: - -1. you are more likely to accidentally add inconsistencies to your - data if each time you take a measurement, you start recording data - in a new tab, and - -2. even if you manage to prevent all inconsistencies from creeping in, - you will add an extra step for yourself before you analyse the data - because you will have to combine these data into a single - datatable. You will have to explicitly tell the computer how to - combine tabs - and if the tabs are inconsistently formatted, you - might even have to do it manually. - -The next time you're entering data, and you go to create another tab -or table, ask yourself if you could avoid adding this tab by adding -another column to your original spreadsheet. We used multiple tabs in -our example of a messy data file, but now you've seen how you can -reorganise your data to consolidate across tabs. - -Your data sheet might get very long over the course of the -experiment. This makes it harder to enter data if you can't see your -headers at the top of the spreadsheet. But don't repeat your header -row. These can easily get mixed into the data, leading to problems -down the road. Instead you can freeze the column -headers -so that they remain visible even when you have a spreadsheet with many -rows. - -### Not filling in zeros {#zeros} - -It might be that when you're measuring something, it's usually a zero, -say the number of times a rabbit is observed in the survey. Why bother -writing in the number zero in that column, when it's mostly zeros? - -However, there's a difference between a zero and a blank cell in a -spreadsheet. To the computer, a zero is actually data. You measured or -counted it. A blank cell means that it wasn't measured and the -computer will interpret it as an unknown value (also known as a null -or missing value). - -The spreadsheets or statistical programs will likely misinterpret -blank cells that you intend to be zeros. By not entering the value of -your observation, you are telling your computer to represent that data -as unknown or missing (null). This can cause problems with subsequent -calculations or analyses. For example, the average of a set of numbers -which includes a single null value is always null (because the -computer can't guess the value of the missing observations). Because -of this, it's very important to record zeros as zeros and truly -missing data as nulls. - -### Using problematic null values {#null} - -**Example**: using -999 or other numerical values (or zero) to -represent missing data. - -**Solutions**: - -There are a few reasons why null values get represented differently -within a dataset. Sometimes confusing null values are automatically -recorded from the measuring device. If that's the case, there's not +在您自己的 +数据以及来自合作者或互联网的数据中,有一些潜在的错误需要注意。 如果您 +意识到错误以及对下游 +数据分析和结果解释可能产生的负面影响,它可能会激励您自己 +和您的项目成员尝试避免这些错误。 对电子表格中数据格式的方式进行一些小的改变 +,可以在数据清理 +和分析时对效率和可靠性产生很大的 +影响。 + +- [使用多个表格](#tables) +- [使用多个标签](#tabs) +- [不填零](#zeros) +- [使用有问题的空值](#null) +- [使用格式传达信息](#formatting) +- [使用格式让数据表看起来更美观](#formatting_pretty) +- [在单元格中放置注释或单位](#units) +- [在一个单元格中输入多条信息](#info) +- [使用有问题的字段名称](#field_name) +- [在数据中使用特殊字符](#special) +- [在数据表中包含元数据](#metadata) + +### 使用多个表 {#tables} + +一种常见的策略是在一个 +电子表格内创建多个数据表。 这会让计算机感到困惑,所以不要这样做! 当您 +在一个电子表格中创建多个表格时,您会在计算机中绘制事物之间的错误 +关联,计算机会将每一行视为 +一个观察结果。 您还可能会在 +多个地方使用相同的字段名称,这将使您更难将数据清理为 +可用形式。 下面的例子描述了这个问题: + +![](图/2_数据表_示例.jpg) + +在上面的例子中,计算机将看到(例如)第 4 行和 +假设所有 A-AF 列都指的是同一个样本。 此行 +实际上代表四个不同的样本(四个 +不同收集日期中的每个日期都有一个样本 1 - 5 月 29 日、6 月 12 日、6 月 19 日和 6 月 +26 日),以及一些计算出的汇总统计数据(平均值 (avr) +和标准测量误差 (SEM))对于其中两个 +样本。 其他行也存在类似问题。 + +### 使用多个标签 {#tabs} + +但是工作簿标签怎么办? 这似乎是组织 +数据的一种简单方法,对吗? 嗯,是也不是。 当您创建额外的选项卡时,您无法 +允许计算机查看数据中的连接(您 +必须引入电子表格应用程序特定的函数或 +脚本来确保这种连接)。 举例来说,你为每天进行的测量创建一个 +单独的标签。 + +这不是一个好的做法,原因有二: + +1. 如果你每次测量时都在新标签页中开始记录数据 + ,那么你很可能会意外地在你的 + 数据中添加不一致的内容,并且 + +2. 即使你设法防止所有不一致性出现, + 你也需要在分析数据 + 之前为自己添加一个额外的步骤,因为你必须将这些数据合并到单个 + 数据表中。 您必须明确地告诉计算机如何 + 组合标签 - 如果标签格式不一致,您 + 甚至可能必须手动执行此操作。 + +下次输入数据并创建另一个选项卡 +或表格时,问问自己是否可以通过在原始电子表格中添加 +另一列来避免添加此选项卡。 在 +混乱数据文件示例中,我们使用了多个选项卡,但现在您已经了解了如何 +重新组织数据以跨选项卡合并。 + +在 +实验过程中,您的数据表可能会变得很长。 如果您在电子表格顶部看不到 +标题,这将使输入数据变得更加困难。 但不要重复你的标题 +行。 这些很容易混入数据中,从而导致以后出现问题 +。 相反,您可以冻结列 +标题 +,这样即使您的电子表格包含许多 +行,它们仍然可见。 + +### 不填零 {#zeros} + +当您测量某个东西时,它通常为零, +表示在调查中观察到兔子的次数。 既然该列大部分都是零,为什么还要费心 +在该列中写入数字零? + +但是,在 +电子表格中,零和空白单元格之间存在差异。 对于计算机来说,零实际上是数据。 您测量了或者 +计算了它。 空白单元格表示未经测量,并且 +计算机将其解释为未知值(也称为空 +或缺失值)。 + +电子表格或统计程序可能会误解您希望为零的 +空白单元格。 通过不输入观察值 +,您就是在告诉计算机将数据 +表示为未知或缺失(空)。 这可能会导致后续 +计算或分析出现问题。 例如,包含单个空值的一组数字 +的平均值始终为空(因为 +计算机无法猜测缺失观测值的值)。 由于 +这个原因,将零记录为零以及将真正 +缺失数据记录为空值非常重要。 + +### 使用有问题的空值 {#null} + +**示例**:使用 -999 或其他数值(或零)至 +表示缺失数据。 + +**解决方案**: + +有几个原因导致数据集中的空值以不同方式表示 +。 有时,测量设备会自动记录令人困惑的空值 +。 If that's the case, there's not much you can do, but it can be addressed in data cleaning with a tool like [OpenRefine](https://www.datacarpentry.org/OpenRefine-ecology-lesson/) -before analysis. Other times different null values are used to convey -different reasons why the data isn't there. This is important -information to capture, but is in effect using one column to capture -two pieces of information. Like for using formatting to convey -information it would be good here to create a new -column like 'data\_missing' and use that column to capture the -different reasons. - -Whatever the reason, it's a problem if unknown or missing data is -recorded as -999, 999, or 0. - -Many statistical programs will not recognise that these are intended -to represent missing (null) values. How these values are interpreted -will depend on the software you use to analyse your data. It is +before analysis. 其他时候,不同的空值用于传达 +数据不存在的不同原因。 这是需要捕获的重要 +信息,但实际上是使用一列来捕获 +两条信息。 就像 使用格式来传达 +信息 一样,在这里创建一个新的 +列(如 'data\_missing')并使用该列来捕获 +不同的原因会很好。 + +无论原因是什么,如果未知或缺失的数据 +记录为 -999、999 或 0,那就有问题了。 + +许多统计程序不会认识到这些旨在用 +来表示缺失(空)值。 如何解释这些值 +将取决于您用来分析数据的软件。 It is essential to use a clearly defined and consistent null indicator. -Blanks (most applications) and NA (for R) are good -choices. @White:2013 explain good choices for indicating null values -for different software applications in their article: +空白(大多数应用)和 NA(对于 R)是良好的 +选择。 @White:2013 在他们的文章中解释了为不同的软件应用程序指示空值 +的良好选择: -![](fig/3_white_table_1.jpg) +![](图/3_white_table_1.jpg) -### Using formatting to convey information {#formatting} +### 使用格式传达信息 {#formatting} -**Example**: highlighting cells, rows or columns that should be -excluded from an analysis, leaving blank rows to indicate -separations in data. +**示例**:突出显示应从分析中排除的 +单元格、行或列,留下空白行以指示数据中的 +分离。 -![](fig/formatting.png) +![](图/格式化.png) -**Solution**: create a new field to encode which data should be -excluded. +**解决方案**:创建一个新字段来编码哪些数据应该被 +排除。 -![](fig/good_formatting.png) +![](图/good_formatting.png) -### Using formatting to make the data sheet look pretty {#formatting\_pretty} +### 使用格式化使数据表看起来更漂亮{#formatting\_pretty} -**Example**: merging cells. +**示例**:合并单元格。 -**Solution**: If you're not careful, formatting a worksheet to be more -aesthetically pleasing can compromise your computer's ability to see -associations in the data. Merged cells will make your data unreadable -by statistics software. Consider restructuring your data in such a way -that you will not need to merge cells to organise your data. +**解决方案**:如果您不小心,将工作表格式化为更美观的 +可能会影响您的计算机查看数据中的 +关联的能力。 合并的单元格将导致统计软件无法读取您的数据 +。 考虑以这样的方式重构您的数据 +,这样您就不需要合并单元格来组织您的数据。 -### Placing comments or units in cells {#units} +### 将注释或单元放置在单元格中 {#units} -Most analysis software can't see Excel or LibreOffice comments, and -would be confused by comments placed within your data cells. As -described above for formatting, create another field if you need to -add notes to cells. Similarly, don't include units in cells: ideally, -all the measurements you place in one column should be in the same -unit, but if for some reason they aren't, create another field and -specify the units the cell is in. +大多数分析软件无法看到 Excel 或 LibreOffice 注释,并且 +会对数据单元内的注释感到困惑。 正如 +上面描述的格式化一样,如果您需要 +向单元格添加注释,请创建另一个字段。 类似地,不要在单元格中包含单位:理想情况下, +放在一列中的所有测量值都应该在同一个 +单位中,但如果由于某种原因它们不在,请创建另一个字段并 +指定单元格所在的单位。 -### Entering more than one piece of information in a cell {#info} +### 在一个单元格中输入多条信息 {#info} -**Example**: Recording ABO and Rhesus groups in one cell, such as A+, -B+, A-, ... +**示例**:在一个单元格中记录 ABO 和 Rh 血型,例如 A+、 +B+、A-、... -**Solution**: Don't include more than one piece of information in a -cell. This will limit the ways in which you can analyse your data. If -you need both these measurements, design your data sheet to include -this information. For example, include one column for the ABO group and +**解决方案**:不要在 +单元格中包含多条信息。 这将限制您分析数据的方式。 如果 +您需要这两种测量值,请设计您的数据表以包含 +此信息。 For example, include one column for the ABO group and one for the Rhesus group. -### Using problematic field names {#field\_name} - -Choose descriptive field names, but be careful not to include spaces, -numbers, or special characters of any kind. Spaces can be -misinterpreted by parsers that use whitespace as delimiters and some -programs don't like field names that are text strings that start with -numbers. - -Underscores (`_`) are a good alternative to spaces. Consider writing -names in camel case (like this: ExampleFileName) to improve -readability. Remember that abbreviations that make sense at the moment -may not be so obvious in 6 months, but don't overdo it with names that -are excessively long. Including the units in the field names avoids -confusion and enables others to readily interpret your fields. - -**Examples** - -| Good Name | Good Alternative | Avoid | -| -------------------------------------------------------------- | ---------------------------------------- | ------------------------------------ | -| Max\_temp\_C | MaxTemp | Maximum Temp (°C) | -| Precipitation\_mm | Precipitation | precmm | -| Mean\_year\_growth | MeanYearGrowth | Mean growth/year | -| sex | sex | M/F | -| weight | weight | w. | -| cell\_type | CellType | Cell Type | -| Observation\_01 | first\_observation | 1st Obs | - -### Using special characters in data {#special} - -**Example**: You treat your spreadsheet program as a word processor -when writing notes, for example copying data directly from Word or -other applications. - -**Solution**: This is a common strategy. For example, when writing -longer text in a cell, people often include line breaks, em-dashes, -etc. in their spreadsheet. Also, when copying data in from -applications such as Word, formatting and fancy non-standard -characters (such as left- and right-aligned quotation marks) are -included. When exporting this data into a coding/statistical -environment or into a relational database, dangerous things may occur, -such as lines being cut in half and encoding errors being thrown. - -General best practice is to avoid adding characters such as newlines, -tabs, and vertical tabs. In other words, treat a text cell as if it -were a simple web form that can only contain text and spaces. - -### Inclusion of metadata in data table {#metadata} - -**Example**: You add a legend at the top or bottom of your data table -explaining column meaning, units, exceptions, etc. - -**Solution**: Recording data about your data ("metadata") is -essential. You may be on intimate terms with your dataset while you -are collecting and analysing it, but the chances that you will still -remember that the variable "sglmemgp" means single member of group, -for example, or the exact algorithm you used to transform a variable -or create a derived one, after a few months, a year, or more are slim. - -As well, there are many reasons other people may want to examine or -use your data - to understand your findings, to verify your findings, -to review your submitted publication, to replicate your results, to -design a similar study, or even to archive your data for access and -re-use by others. While digital data by definition are -machine-readable, understanding their meaning is a job for human -beings. The importance of documenting your data during the collection -and analysis phase of your research cannot be overestimated, -especially if your research is going to be part of the scholarly -record. - -However, metadata should not be contained in the data file -itself. Unlike a table in a paper or a supplemental file, metadata (in -the form of legends) should not be included in a data file since this -information is not data, and including it can disrupt how computer -programs interpret your data file. Rather, metadata should be stored -as a separate file in the same directory as your data file, preferably -in plain text format with a name that clearly associates it with your -data file. Because metadata files are free text format, they also +### 使用有问题的字段名称 {#field\_name} + +选择描述性的字段名称,但注意不要包含空格、 +数字或任何类型的特殊字符。 使用空格作为分隔符的解析器可能会误解空格 +,并且某些 +程序不喜欢以 +数字开头的文本字符串作为字段名称。 + +下划线(`_`)是空格的良好替代品。 考虑以驼峰式命名法书写 +名称(像这样:ExampleFileName)以提高 +的可读性。 请记住,目前有意义的缩写 +可能在 6 个月后就不那么明显了,但不要使用过长的名称 +。 在字段名称中包含单位可避免 +混淆,并使其他人能够轻松解释您的字段。 + +**例子** + +| 好名字 | 不错的选择 | 避免 | +| ----------------------------- | ----- | ---------------------------- | +| 最高温度 | 最高温度 | 最高温度 (°C) | +| 降水量\_mm | 沉淀 | 预CMM | +| 平均年增长率 | 年均增长率 | 平均年增长率 | +| 性别 | 性别 | 男/女 | +| 重量 | 重量 | w. | +| 单元格类型 | 单元格类型 | 单元格类型 | +| 观察\_01 | 第一次观察 | 第一次观察 | + +### 在数据中使用特殊字符 {#special} + +**示例**:在写笔记时,您将电子表格程序视为文字处理器 +,例如直接从 Word 或 +其他应用程序复制数据。 + +**解决方案**:这是一种常见的策略。 例如,当在单元格中写入 +较长的文本时,人们通常会在电子表格中包含换行符、破折号、 +等。 另外,从 +应用程序(例如 Word)复制数据时,格式和花哨的非标准 +字符(例如左对齐和右对齐的引号)将包含在 +中。 将这些数据导出到编码/统计 +环境或关系数据库时,可能会发生危险的事情, +例如行被切成两半,并出现编码错误。 + +一般的最佳做法是避免添加换行符、 +制表符和垂直制表符等字符。 换句话说,将文本单元格视为 +一个只能包含文本和空格的简单 Web 表单。 + +### 在数据表中包含元数据 {#metadata} + +**示例**:在数据表的顶部或底部添加图例 +,解释列的含义、单位、例外等。 + +**解决方案**:记录有关您的数据的数据(“元数据”)是 +至关重要。 在 +收集和分析数据集时,您可能对数据集了如指掌,但几个月、一年或更长时间后,您仍然 +记得变量“sglmemgp”表示组中的单个成员,例如 +,或者您用来转换变量 +或创建派生变量的确切算法的可能性很小。 + +同样,其他人可能出于多种原因想要检查或 +使用您的数据 - 了解您的发现、验证您的发现、 +审查您提交的出版物、复制您的结果、 +设计类似的研究,甚至存档您的数据以供他人访问和 +重复使用。 虽然数字数据从定义上来说是 +机器可读的,但理解其含义却是人类 +的工作。 在研究的收集 +和分析阶段记录数据的重要性怎么强调也不为过, +尤其是当您的研究将成为学术 +记录的一部分时。 + +但是,元数据不应该包含在数据文件 +本身中。 与论文或补充文件中的表格不同,元数据(以 +图例的形式)不应包含在数据文件中,因为此 +信息不是数据,并且包含它可能会破坏计算机 +程序对数据文件的解释方式。 相反,元数据应该作为单独的文件存储在 +与数据文件位于同一目录中,最好以纯文本格式存储在 +中,并且其名称应与 +数据文件明确关联。 Because metadata files are free text format, they also allow you to encode comments, units, information about how null values are encoded, etc. that are important to document but can disrupt the formatting of your data file. -Additionally, file or database level metadata describes how files that -make up the dataset relate to each other; what format they are in; and -whether they supercede or are superceded by previous files. A -folder-level readme.txt file is the classic way of accounting for all -the files and folders in a project. +此外,文件或数据库级别的元数据描述了组成数据集的 +文件彼此间的关系、它们的格式是什么;以及 +它们是否取代了以前的文件或者被以前的文件取代。 +文件夹级别的 readme.txt 文件是记录项目中所有 +文件和文件夹的经典方式。 (Text on metadata adapted from the online course Research Data [MANTRA](https://datalib.edina.ac.uk/mantra) by EDINA and Data Library, -University of Edinburgh. MANTRA is licensed under a Creative Commons +University of Edinburgh. MANTRA 已获得 Creative Commons Attribution 4.0 International -License.) +License 的许可。) -## Exporting data +## 导出数据 -**Question** +**问题** -- How can we export data from spreadsheets in a way that is useful for - downstream applications? +- 我们如何才能以对 + 下游应用程序有用的方式从电子表格中导出数据? -**Objectives** +**目标** -- Store spreadsheet data in universal file formats. -- Export data from a spreadsheet to a CSV file. +- 以通用文件格式存储电子表格数据。 +- 将数据从电子表格导出到 CSV 文件。 -**Keypoints** +**关键点** - Data stored in common spreadsheet formats will often not be read correctly into data analysis software, introducing errors into your data. -- Exporting data from spreadsheets to formats like CSV or TSV puts it - in a format that can be used consistently by most programs. +- 将数据从电子表格导出为 CSV 或 TSV 等格式,会将其 + 置于大多数程序可以一致使用的格式中。 -Storing the data you're going to work with for your analyses in Excel -default file format (`*.xls` or `*.xlsx` - depending on the Excel -version) isn't a good idea. Why? +将您要用于分析的数据存储在 Excel +默认文件格式(`*.xls` 或 `*.xlsx` - 取决于 Excel +版本)中并不是一个好主意。 为什么? -- Because it is a proprietary format, and it is possible that in the - future, technology won't exist (or will become sufficiently rare) to - make it inconvenient, if not impossible, to open the file. +- 因为它是一种专有格式,并且有可能在 + 未来,技术将不再存在(或者变得足够稀有),从而导致 + 打开文件变得不方便,甚至不可能。 -- Other spreadsheet software may not be able to open files saved in a - proprietary Excel format. +- 其他电子表格软件可能无法打开以 + 专有 Excel 格式保存的文件。 -- Different versions of Excel may handle data differently, leading to - inconsistencies. [Dates](https://datacarpentry.org/spreadsheet-ecology-lesson/03-dates-as-data/index.html) - is a well-documented example of inconsistencies in data storage. +- 不同版本的 Excel 可能以不同的方式处理数据,导致 + 不一致。 [日期](https://datacarpentry.org/spreadsheet-ecology-lesson/03-dates-as-data/index.html) + 是数据存储不一致的一个有据可查的例子。 - Finally, more journals and grant agencies are requiring you to deposit your data in a data repository, and most of them don't - accept Excel format. It needs to be in one of the formats discussed - below. + accept Excel format. 它需要采用下面讨论的 + 格式之一。 -- The above points also apply to other formats such as open data - formats used by LibreOffice / Open Office. These formats are not - static and do not get parsed the same way by different software - packages. +- 上述要点也适用于其他格式,例如 LibreOffice / Open Office 使用的开放数据 + 格式。 这些格式不是 + 静态的,并且不会被不同的软件 + 包以相同的方式解析。 Storing data in a universal, open, and static format will help deal -with this problem. Try tab-delimited (tab separated values or TSV) or -comma-delimited (comma separated values or CSV). CSV files are plain -text files where the columns are separated by commas, hence 'comma -separated values' or CSV. The advantage of a CSV file over an -Excel/SPSS/etc. file is that we can open and read a CSV file using -just about any software, including plain text editors like TextEdit or -NotePad. Data in a CSV file can also be easily imported into other -formats and environments, such as SQLite and R. We're not tied to a -certain version of a certain expensive program when we work with CSV -files, so it's a good format to work with for maximum portability and -endurance. Most spreadsheet programs can save to delimited text +with this problem. 尝试制表符分隔(制表符分隔值或 TSV)或 +逗号分隔(逗号分隔值或 CSV)。 CSV 文件是普通的 +文本文件,其中列由逗号分隔,因此为“逗号 +分隔值”或 CSV。 CSV 文件相对于 +Excel/SPSS/等文件的优势在于,我们可以使用 +几乎任何软件打开和读取 CSV 文件,包括纯文本编辑器,如 TextEdit 或 +NotePad。 CSV 文件中的数据还可以轻松导入到其他 +格式和环境中,例如 SQLite 和 R。当我们使用 CSV +文件时,我们不受 +某个昂贵程序的某个版本的限制,因此它是一种很好的格式,可以实现最大的可移植性和 +耐用性。 Most spreadsheet programs can save to delimited text formats like CSV easily, although they may give you a warning during the file export. -To save a file you have opened in Excel in CSV format: +要以 CSV 格式保存在 Excel 中打开的文件: -1. From the top menu select 'File' and 'Save as'. -2. In the 'Format' field, from the list, select 'Comma Separated - Values' (`*.csv`). -3. Double check the file name and the location where you want to save - it and hit 'Save'. +1. 从顶部菜单中选择“文件”和“另存为”。 +2. 在“格式”字段中,从列表中选择“以逗号分隔的 + 值”(`*.csv`)。 +3. 仔细检查文件名和要保存的位置 + 然后点击“保存”。 -An important note for backwards compatibility: you can open CSV files -in Excel! +关于向后兼容性的重要说明:您可以在 Excel 中打开 CSV 文件 +! ```{r, results="markup", fig.cap="Saving an Excel file to CSV.", echo=FALSE, purl=FALSE, out.width="70%", fig.align="center"} -knitr::include_graphics("fig/excel-to-csv.png") +knitr::include_graphics(“fig/excel-to-csv.png”) ``` -**A note on R and `xls`**: There are R packages that can read `xls` -files (as well as Google spreadsheets). It is even possible to access -different worksheets in the `xls` documents. +**关于 R 和 `xls`** 的注释:有一些 R 包可以读取 `xls` +文件(以及 Google 电子表格)。 甚至可以访问 `xls` 文档中的 +个不同的工作表。 -**But** +**但** -- some of these only work on Windows. -- this equates to replacing a (simple but manual) export to `csv` with - additional complexity/dependencies in the data analysis R code. -- data formatting best practice still apply. -- Is there really a good reason why `csv` (or similar) is not - adequate? +- 其中一些仅适用于 Windows。 +- 这相当于用数据分析 R 代码中的 + 额外的复杂性/依赖性替换 (简单但手动的) 导出到 `csv`。 +- 数据格式的最佳实践仍然适用。 +- 真的有充分的理由说明为什么 `csv`(或类似的东西)不适合 + 吗? -### Caveats on commas +### 关于逗号的注意事项 -In some datasets, the data values themselves may include commas -(,). In that case, the software which you use (including Excel) will -most likely incorrectly display the data in columns. This is because -the commas which are a part of the data values will be interpreted as -delimiters. +在某些数据集中,数据值本身可能包含逗号 +(,)。 In that case, the software which you use (including Excel) will +most likely incorrectly display the data in columns. 这是因为 +作为数据值一部分的逗号将被解释为 +分隔符。 -For example, our data might look like this: +例如,我们的数据可能如下所示: ``` -species_id,genus,species,taxa -AB,Amphispiza,bilineata,Bird -AH,Ammospermophilus,harrisi,Rodent, not censused -AS,Ammodramus,savannarum,Bird -BA,Baiomys,taylori,Rodent +species_id、属、种类、分类群 +AB、Amphispiza、bilineata、鸟类 +AH、Ammospermophilus、harrisi、啮齿动物、未普查 +AS、Ammodramus、savannarum、鸟类 +BA、Baiomys、taylori、啮齿动物 ``` -In the record `AH,Ammospermophilus,harrisi,Rodent, not censused` the -value for `taxa` includes a comma (`Rodent, not censused`). If we try -to read the above into Excel (or other spreadsheet program), we will -get something like this: +在记录“AH,Ammospermophilus,harrisi,Rodent, not censused”中,“taxa”的 +值包含逗号(“Rodent, not censused”)。 如果我们尝试 +将上述内容读入 Excel(或其他电子表格程序),我们将 +得到如下内容: ```{r, results="markup", fig.cap="The risks of having commas inside comma-separated data.", echo=FALSE, purl=FALSE, out.width="80%", fig.align="center"} -knitr::include_graphics("fig/csv-mistake.png") +knitr::include_graphics(“fig/csv-mistake.png”) ``` -The value for `taxa` was split into two columns (instead of being put -in one column `D`). This can propagate to a number of further -errors. For example, the extra column will be interpreted as a column -with many missing values (and without a proper header). In addition to -that, the value in column `D` for the record in row 3 (so the one -where the value for 'taxa' contained the comma) is now incorrect. +“taxa” 的值被分成两列(而不是被放在“D”列中的 +)。 这可能会导致更多 +错误。 例如,额外的列将被解释为具有许多缺失值(并且没有适当的标题)的列 +。 除 +之外,第 3 行记录的 `D` 列中的值(因此 +其中 'taxa' 的值包含逗号)现在不正确。 -If you want to store your data in `csv` format and expect that your -data values may contain commas, you can avoid the problem discussed -above by putting the values in quotes (""). Applying this rule, our -data might look like this: +如果您想以 `csv` 格式存储数据,并预计 +数据值可能包含逗号,则可以通过将值放在引号(“”)中来避免上面讨论的 +问题。 应用此规则,我们的 +数据可能如下所示: ``` -species_id,genus,species,taxa -"AB","Amphispiza","bilineata","Bird" -"AH","Ammospermophilus","harrisi","Rodent, not censused" -"AS","Ammodramus","savannarum","Bird" -"BA","Baiomys","taylori","Rodent" +species_id、属、种类、分类群 +“AB”、“Amphispiza”、“bilineata”、“鸟类” +“AH”、“Ammospermophilus”、“harrisi”、“啮齿类,未经普查” +“AS”、“Ammodramus”、“savannarum”、“鸟类” +“BA”、“Baiomys”、“taylori”、“啮齿类” ``` -Now opening this file as a `csv` in Excel will not lead to an extra -column, because Excel will only use commas that fall outside of -quotation marks as delimiting characters. +现在在 Excel 中将此文件作为 `csv` 打开不会导致出现多余的 +列,因为 Excel 只会使用超出 +引号的逗号作为分隔字符。 Alternatively, if you are working with data that contains commas, you likely will need to use another delimiter when working in a -spreadsheet[^decsep]. In this case, consider using tabs as your delimiter and -working with TSV files. TSV files can be exported from spreadsheet -programs in the same way as CSV files. - -[^decsep]: This is particularly relevant in European - countries where the comma is used as a decimal - separator. In such cases, the default value separator in a - csv file will be the semi-colon (;), or values will be - systematically quoted. - -If you are working with an already existing dataset in which the data -values are not included in "" but which have commas as both delimiters -and parts of data values, you are potentially facing a major problem -with data cleaning. If the dataset you're dealing with contains -hundreds or thousands of records, cleaning them up manually (by either -removing commas from the data values or putting the values into -quotes - "") is not only going to take hours and hours but may -potentially end up with you accidentally introducing many errors. - -Cleaning up datasets is one of the major problems in many scientific -disciplines. The approach almost always depends on the particular -context. However, it is a good practice to clean the data in an -automated fashion, for example by writing and running a script. The -Python and R lessons will give you the basis for developing skills to -build relevant scripts. - -## Summary +spreadsheet[^decsep]. 在这种情况下,请考虑使用制表符作为分隔符,并使用 +来处理 TSV 文件。 TSV 文件可以以与 CSV 文件相同的方式从电子表格 +程序中导出。 + +[^decsep]: 这在欧洲 + 国家尤其重要,这些国家使用逗号作为小数 + 分隔符。 在这种情况下, + csv 文件中的默认值分隔符将是分号 (;),或者值将是 + 系统引用。 + +如果您正在处理一个已经存在的数据集,其中数据 +值未包含在“”中,但其中逗号既作为分隔符 +又作为数据值的一部分,则您可能会面临数据清理的一个主要问题 +。 如果您处理的数据集包含 +数百或数千条记录,手动清理它们(通过 +从数据值中删除逗号或将值放入 +引号中 - “”)不仅会花费数小时,而且可能 +最终导致您意外引入许多错误。 + +清理数据集是许多科学 +学科的主要问题之一。 该方法几乎总是依赖于特定的 +环境。 但是,以 +自动化的方式清理数据是一种很好的做法,例如通过编写和运行脚本。 +Python 和 R 课程将为您提供开发 +构建相关脚本的技能的基础。 + +## 概括 ```{r analysis, results="asis", fig.margin=TRUE, fig.cap="A typical data analysis workflow.", fig.width=7, fig.height=4, echo=FALSE, purl=FALSE} knitr::include_graphics("fig/analysis.png") ``` -A typical data analysis workflow is illustrated in the figure above, -where data is repeatedly transformed, visualised, and modelled. This -iteration is repeated multiple times until the data is understood. In -many real-life cases, however, most time is spent cleaning up and -preparing the data, rather than actually analysing and understanding -it. +上图展示了典型的数据分析工作流程, +其中数据被重复转换、可视化和建模。 这个 +迭代重复多次,直到数据被理解。 然而,在 +许多现实生活中,大部分时间都花在清理和 +准备数据上,而不是实际分析和理解 +数据上。 -An agile data analysis workflow, with several fast iterations of the -transform/visualise/model cycle is only feasible if the data is -formatted in a predictable way and one can reason about the data -without having to look at it and/or fix it. +敏捷的数据分析工作流程(包含 +转换/可视化/模型循环的几次快速迭代)只有在数据 +以可预测的方式格式化,并且可以推断数据 +而不必查看和/或修复它时才可行。 :::::::::::::::::::::::::::::::::::::::: keypoints -- Good data organization is the foundation of any research project. +- 良好的数据组织是任何研究项目的基础。 -:::::::::::::::::::::::::::::::::::::::::::::::::: +::::::::::::::::::::::::::::::::::::::::::::::::::::: From 4d22bb709544fe05600aa045834b63d8feb9e510 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Thu, 1 Aug 2024 00:07:19 +0900 Subject: [PATCH 195/334] New translations 20-r-rstudio.md (French) --- locale/fr/episodes/20-r-rstudio.Rmd | 893 ++++++++++++++-------------- 1 file changed, 446 insertions(+), 447 deletions(-) diff --git a/locale/fr/episodes/20-r-rstudio.Rmd b/locale/fr/episodes/20-r-rstudio.Rmd index 6b0ca4095..8c4aeb061 100644 --- a/locale/fr/episodes/20-r-rstudio.Rmd +++ b/locale/fr/episodes/20-r-rstudio.Rmd @@ -1,6 +1,6 @@ --- source: Rmd -title: R and RStudio +title: R et RStudio teaching: 30 exercises: 0 --- @@ -10,329 +10,328 @@ exercises: 0 ::::::::::::::::::::::::::::::::::::::: objectives -- Describe the purpose of the RStudio Script, Console, Environment, and Plots panes. -- Organise files and directories for a set of analyses as an R project, and understand the purpose of the working directory. -- Use the built-in RStudio help interface to search for more information on R functions. -- Demonstrate how to provide sufficient information for troubleshooting with the R user community. +- Décrivez l'objectif des volets RStudio Script, Console, Environment et Plots. +- Organisez les fichiers et les répertoires pour un ensemble d'analyses en tant que projet R et comprenez le but du répertoire de travail. +- Utilisez l'interface d'aide intégrée de RStudio pour rechercher plus d'informations sur les fonctions R. +- Montrez comment fournir suffisamment d’informations pour le dépannage avec la communauté des utilisateurs R. -:::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::: :::::::::::::::::::::::::::::::::::::::: questions -- What are R and RStudio? +- Que sont R et RStudio ? -:::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::: -> This episode is based on the Data Carpentries's _Data Analysis and -> Visualisation in R for Ecologists_ lesson. +> Cet épisode est basé sur la leçon _Analyse des données et +> Visualisation dans R pour les écologistes_ de Data Carpentries. -## What is R? What is RStudio? +## Qu’est-ce que R ? Qu’est-ce que RStudio ? -The term [R](https://www.r-project.org/) is used to refer to the -_programming language_, the _environment for statistical computing_ -and _the software_ that interprets the scripts written using it. +Le terme [R](https://www.r-project.org/) est utilisé pour désigner le +_langage de programmation_, l'_environnement de calcul statistique_ +et _le logiciel_ qui interprète les scripts écrits à l'aide de celui-ci. -[RStudio](https://rstudio.com) is currently a very popular way to not -only write your R scripts but also to interact with the R -software[^plainr]. To function correctly, RStudio needs R and -therefore both need to be installed on your computer. +[RStudio](https://rstudio.com) est actuellement un moyen très populaire non seulement +d'écrire vos scripts R mais aussi d'interagir avec le logiciel R +[^plainr]. Pour fonctionner correctement, RStudio a besoin de R et +donc les deux doivent être installés sur votre ordinateur. -[^plainr]: As opposed to using R directly from the command line - console. There exist other software that interface and integrate - with R, but RStudio is particularly well suited for beginners - while providing numerous very advanced features. +[^plainr]: Au lieu d'utiliser R directement depuis la console de ligne de commande + . Il existe d'autres logiciels qui s'interfacent et intègrent + avec R, mais RStudio est particulièrement bien adapté aux débutants + tout en proposant de nombreuses fonctionnalités très avancées. -The RStudio IDE Cheat +La RStudio IDE Cheat Sheet -provides much more information than will be covered here, but can be -useful to learn keyboard shortcuts and discover new features. +fournit beaucoup plus d'informations que ce qui sera couvert ici, mais peut être +utile pour apprendre les raccourcis clavier et découvrir de nouvelles fonctionnalités. -## Why learn R? +## Pourquoi apprendre R ? -### R does not involve lots of pointing and clicking, and that's a good thing +### R n'implique pas beaucoup de pointage et de clic, et c'est une bonne chose The learning curve might be steeper than with other software, but with R, the results of your analysis do not rely on remembering a succession of pointing and clicking, but instead on a series of -written commands, and that's a good thing! So, if you want to redo -your analysis because you collected more data, you don't have to -remember which button you clicked in which order to obtain your -results; you just have to run your script again. +written commands, and that's a good thing! Ainsi, si vous souhaitez refaire +votre analyse parce que vous avez collecté plus de données, vous n'avez pas besoin de +vous rappeler sur quel bouton vous avez cliqué dans quel ordre pour obtenir vos +résultats ; il vous suffit de réexécuter votre script. -Working with scripts makes the steps you used in your analysis clear, -and the code you write can be inspected by someone else who can give -you feedback and spot mistakes. +Travailler avec des scripts rend les étapes que vous avez utilisées dans votre analyse claires, +et le code que vous écrivez peut être inspecté par quelqu'un d'autre qui peut vous donner +des commentaires et repérer les erreurs. -Working with scripts forces you to have a deeper understanding of what -you are doing, and facilitates your learning and comprehension of the -methods you use. +Travailler avec des scripts vous oblige à avoir une compréhension plus profonde de ce que +vous faites et facilite votre apprentissage et votre compréhension des méthodes +que vous utilisez. -### R code is great for reproducibility +### Le code R est idéal pour la reproductibilité -Reproducibility means that someone else (including your future self) can -obtain the same results from the same dataset when using the same -analysis code. +La reproductibilité signifie que quelqu'un d'autre (y compris votre futur moi) peut +obtenir les mêmes résultats à partir du même ensemble de données en utilisant le même code d'analyse +. -R integrates with other tools to generate manuscripts or reports from your -code. If you collect more data, or fix a mistake in your dataset, the -figures and the statistical tests in your manuscript or report are updated -automatically. +R s'intègre à d'autres outils pour générer des manuscrits ou des rapports à partir de votre code +. Si vous collectez plus de données ou corrigez une erreur dans votre ensemble de données, les chiffres +et les tests statistiques de votre manuscrit ou rapport sont mis à jour +automatiquement. -An increasing number of journals and funding agencies expect analyses -to be reproducible, so knowing R will give you an edge with these -requirements. +Un nombre croissant de revues et d'agences de financement s'attendent à ce que les analyses +soient reproductibles, donc connaître R vous donnera un avantage avec ces +exigences. -### R is interdisciplinary and extensible +### R est interdisciplinaire et extensible -With 10000+ packages[^whatarepkgs] that can be installed to extend its -capabilities, R provides a framework that allows you to combine -statistical approaches from many scientific disciplines to best suit -the analytical framework you need to analyse your data. For instance, -R has packages for image analysis, GIS, time series, population -genetics, and a lot more. +Avec plus de 10 000 packages[^whatarepkgs] pouvant être installés pour étendre ses +capacités, R fournit un cadre qui vous permet de combiner +des approches statistiques de nombreuses disciplines scientifiques pour s'adapter au mieux à +le cadre analytique dont vous avez besoin pour analyser vos données. Par exemple, +R propose des packages pour l'analyse d'images, le SIG, les séries chronologiques, la génétique +de population et bien plus encore. -[^whatarepkgs]: i.e. add-ons that confer R with new functionality, - such as bioinformatics data analysis. +[^whatarepkgs]: c'est-à-dire des modules complémentaires qui confèrent à R de nouvelles fonctionnalités, + telles que l'analyse de données bioinformatiques. ```{r, fig.cap="Exponential increase of the number of packages available on [CRAN](https://cran.r-project.org/), the Comprehensive R Archive Network. From the R Journal, Volume 10/2, December 2018.", echo=FALSE, message=FALSE} knitr::include_graphics("fig/cran.png") ``` -### R works on data of all shapes and sizes +### R fonctionne sur des données de toutes formes et tailles -The skills you learn with R scale easily with the size of your -dataset. Whether your dataset has hundreds or millions of lines, it -won't make much difference to you. +Les compétences que vous apprenez avec R évoluent facilement avec la taille de votre ensemble de données +. Que votre ensemble de données comporte des centaines ou des millions de lignes, cela +ne fera pas beaucoup de différence pour vous. -R is designed for data analysis. It comes with special data structures -and data types that make handling of missing data and statistical -factors convenient. +R est conçu pour l’analyse des données. Il est livré avec des structures de données spéciales +et des types de données qui facilitent la gestion des données manquantes et des facteurs statistiques +. -R can connect to spreadsheets, databases, and many other data formats, -on your computer or on the web. +R peut se connecter à des feuilles de calcul, des bases de données et à de nombreux autres formats de données, +sur votre ordinateur ou sur le Web. -### R produces high-quality graphics +### R produit des graphiques de haute qualité -The plotting functionalities in R are extensive, and allow you to adjust -any aspect of your graph to convey most effectively the message from -your data. +Les fonctionnalités de traçage de R sont étendues et vous permettent d'ajuster +n'importe quel aspect de votre graphique pour transmettre le plus efficacement possible le message de +vos données. -### R has a large and welcoming community +### R a une communauté nombreuse et accueillante -Thousands of people use R daily. Many of them are willing to help you -through mailing lists and websites such as Stack -Overflow, or on the RStudio -community. These broad user communities -extend to specialised areas such as bioinformatics. One such subset of the R community is [Bioconductor](https://bioconductor.org/), a scientific project for analysis and comprehension "of data from current and emerging biological assays." This workshop was developed by members of the Bioconductor community; for more information on Bioconductor, please see the companion workshop ["The Bioconductor Project"](https://carpentries-incubator.github.io/bioc-project/). +Des milliers de personnes utilisent R quotidiennement. Beaucoup d'entre eux sont prêts à vous aider +via des listes de diffusion et des sites Web tels que Stack +Overflow, ou sur le RStudio +communauté. Ces larges communautés d'utilisateurs +s'étendent à des domaines spécialisés tels que la bioinformatique. L'un de ces sous-ensembles de la communauté R est [Bioconductor](https://bioconductor.org/), un projet scientifique pour l'analyse et la compréhension « des données provenant d'essais biologiques actuels et émergents ». Cet atelier a été développé par des membres de la communauté Bioconductor ; pour plus d'informations sur Bioconductor, veuillez consulter l'atelier complémentaire ["The Bioconductor Project"](https://carpentries-incubator.github.io/bioc-project/). -### Not only is R free, but it is also open-source and cross-platform +### Non seulement R est gratuit, mais il est également open source et multiplateforme -Anyone can inspect the source code to see how R works. Because of this -transparency, there is less chance for mistakes, and if you (or -someone else) find some, you can report and fix bugs. +N'importe qui peut inspecter le code source pour voir comment R fonctionne. Grâce à cette +transparence, il y a moins de risques d'erreurs, et si vous (ou +quelqu'un d'autre) en trouvez, vous pouvez signaler et corriger des bugs. -## Knowing your way around RStudio +## Connaître RStudio -Let's start by learning about [RStudio](https://www.rstudio.com/), -which is an Integrated Development Environment (IDE) for working with +Commençons par découvrir [RStudio](https://www.rstudio.com/), +qui est un environnement de développement intégré (IDE) permettant de travailler avec R. -The RStudio IDE open-source product is free under the Affero General +Le produit open source RStudio IDE est gratuit sous la Affero General Public License (AGPL) v3. -The RStudio IDE is also available with a commercial license and -priority email support from Posit, Inc. +L'IDE RStudio est également disponible avec une licence commerciale et +une assistance prioritaire par courrier électronique de Posit, Inc. -We will use the RStudio IDE to write code, navigate the files on our -computer, inspect the variables we are going to create, and visualise -the plots we will generate. RStudio can also be used for other things -(e.g., version control, developing packages, writing Shiny apps) that -we will not cover during the workshop. +Nous utiliserons l'IDE RStudio pour écrire du code, parcourir les fichiers sur notre +ordinateur, inspecter les variables que nous allons créer et visualiser +les tracés que nous allons générer. RStudio peut également être utilisé pour d'autres choses +(par exemple, le contrôle de version, le développement de packages, l'écriture d'applications Shiny) que +nous n'aborderons pas pendant l'atelier. ```{r, results="markup", fig.cap="RStudio interface screenshot. Clockwise from top left: Source, Environment/History, Files/Plots/Packages/Help/Viewer, Console.", echo=FALSE, purl=FALSE, out.width="100%", fig.align="center"} knitr::include_graphics("fig/rstudio-screenshot.png") ``` -The RStudio window is divided into 4 "Panes": - -- the **Source** for your scripts and documents (top-left, in the - default layout) -- your **Environment/History** (top-right), -- your **Files/Plots/Packages/Help/Viewer** (bottom-right), and -- the R **Console** (bottom-left). - -The placement of these panes and their content can be customised (see -menu, `Tools -> Global Options -> Pane Layout`). - -One of the advantages of using RStudio is that all the information you -need to write code is available in a single window. Additionally, with -many shortcuts, **autocompletion**, and **highlighting** for the major -file types you use while developing in R, RStudio will make typing -easier and less error-prone. - -## Getting set up - -It is good practice to keep a set of related data, analyses, and text -self-contained in a single folder, called the **working -directory**. All of the scripts within this folder can then use -**relative paths** to files that indicate where inside the project a -file is located (as opposed to absolute paths, which point to where a -file is on a specific computer). Working this way makes it a lot -easier to move your project around on your computer and share it with -others without worrying about whether or not the underlying scripts -will still work. - -RStudio provides a helpful set of tools to do this through its "Projects" -interface, which not only creates a working directory for you, but also remembers -its location (allowing you to quickly navigate to it) and optionally preserves -custom settings and open files to make it easier to resume work after a -break. Go through the steps for creating an "R Project" for this -tutorial below. - -1. Start RStudio. -2. Under the `File` menu, click on `New project`. Choose `New directory`, then - `New project`. -3. Enter a name for this new folder (or "directory"), and choose a - convenient location for it. This will be your **working directory** - for this session (or whole course) (e.g., `bioc-intro`). -4. Click on `Create project`. -5. (Optional) Set Preferences to 'Never' save workspace in RStudio. - -RStudio's default preferences generally work well, but saving a workspace to -.RData can be cumbersome, especially if you are working with larger datasets. -To turn that off, go to Tools --> 'Global Options' and select the 'Never' option -for 'Save workspace to .RData' on exit. +La fenêtre RStudio est divisée en 4 "Volets" : + +- la **Source** de vos scripts et documents (en haut à gauche, dans la mise en page par défaut + ) +- votre **Environnement/Historique** (en haut à droite), +- vos **Fichiers/Tracés/Packages/Aide/Visionneuse** (en bas à droite), et +- la R **Console** (en bas à gauche). + +L'emplacement de ces volets et leur contenu peuvent être personnalisés (voir le menu +, `Outils -> Options globales -> Disposition des volets`). + +L'un des avantages de l'utilisation de RStudio est que toutes les informations dont vous +avez besoin pour écrire du code sont disponibles dans une seule fenêtre. De plus, avec +de nombreux raccourcis, la **complétion automatique** et la **mise en surbrillance** pour les principaux +types de fichiers que vous utilisez lors du développement dans R, RStudio facilitera la saisie de +et moins sujet aux erreurs. + +## Mise en place + +Il est recommandé de conserver un ensemble de données, d'analyses et de textes connexes +autonomes dans un seul dossier, appelé \*\*répertoire de travail +\*\*. Tous les scripts de ce dossier peuvent alors utiliser +**chemins relatifs** vers les fichiers qui indiquent où dans le projet se trouve un fichier +(par opposition aux chemins absolus, qui pointent vers l'endroit où se trouve un fichier +). +se trouve sur un ordinateur spécifique). Travailler de cette façon rend +beaucoup plus facile le déplacement de votre projet sur votre ordinateur et le partage avec +d'autres sans vous soucier de savoir si les scripts sous-jacents +fonctionneront toujours. + +RStudio fournit un ensemble d'outils utiles pour ce faire via son interface "Projets" +, qui non seulement crée un répertoire de travail pour vous, mais mémorise également +son emplacement (vous permettant d'y accéder rapidement ) et conserve éventuellement +les paramètres personnalisés et les fichiers ouverts pour faciliter la reprise du travail après une +pause. Suivez les étapes de création d'un "Projet R" pour ce tutoriel +ci-dessous. + +1. Démarrez RStudio. +2. Dans le menu « Fichier », cliquez sur « Nouveau projet ». Choisissez `Nouveau répertoire`, puis + `Nouveau projet`. +3. Entrez un nom pour ce nouveau dossier (ou "répertoire") et choisissez un + emplacement pratique pour celui-ci. Ce sera votre **répertoire de travail** + pour cette session (ou tout le cours) (par exemple, `bioc-intro`). +4. Cliquez sur « Créer un projet ». +5. (Facultatif) Définissez les préférences sur « Jamais » pour enregistrer l'espace de travail dans RStudio. + +Les préférences par défaut de RStudio fonctionnent généralement bien, mais enregistrer un espace de travail dans +.RData peut être fastidieux, surtout si vous travaillez avec des ensembles de données plus volumineux. +Pour désactiver cela, allez dans Outils --> « Options globales » et sélectionnez l'option « Jamais » +pour « Enregistrer l'espace de travail dans .RData » à la sortie. ```{r, results="markup", fig.cap="Set 'Save workspace to .RData on exit' to 'Never'", echo=FALSE, purl=FALSE, out.width="70%", fig.align="center"} knitr::include_graphics("fig/rstudio-preferences.png") ``` -To avoid character encoding issues between Windows and other operating -systems, we are -going to set UTF-8 by default: +Pour éviter les problèmes d'encodage des caractères entre Windows et d'autres +systèmes d'exploitation, nous allons +définir UTF-8 par défaut : ```{r, results="markup", fig.cap="Set the default text encoding to UTF-8 to save us headache in the coming future. (Figure from the link above).", echo=FALSE, purl=FALSE, out.width="70%", fig.align="center"} knitr::include_graphics("fig/utf8.png") ``` -### Organizing your working directory - -Using a consistent folder structure across your projects will help keep things -organised, and will also make it easy to find/file things in the future. This -can be especially helpful when you have multiple projects. In general, you may -create directories (folders) for **scripts**, **data**, and **documents**. - -- **`data/`** Use this folder to store your raw data and intermediate - datasets you may create for the need of a particular analysis. For - the sake of transparency and - [provenance](https://en.wikipedia.org/wiki/Provenance), you should - _always_ keep a copy of your raw data accessible and do as much of - your data cleanup and preprocessing programmatically (i.e., with - scripts, rather than manually) as possible. Separating raw data - from processed data is also a good idea. For example, you could - have files `data/raw/tree_survey.plot1.txt` and `...plot2.txt` kept - separate from a `data/processed/tree.survey.csv` file generated by - the `scripts/01.preprocess.tree_survey.R` script. -- **`documents/`** This would be a place to keep outlines, drafts, - and other text. -- **`scripts/`** (or `src`) This would be the location to keep your R - scripts for different analyses or plotting, and potentially a - separate folder for your functions (more on that later). - -You may want additional directories or subdirectories depending on -your project needs, but these should form the backbone of your working -directory. +### Organiser votre répertoire de travail + +L'utilisation d'une structure de dossiers cohérente dans vos projets aidera à garder les choses +organisées et facilitera également la recherche/le classement des éléments à l'avenir. Ce +peut être particulièrement utile lorsque vous avez plusieurs projets. En général, vous pouvez +créer des répertoires (dossiers) pour les **scripts**, **données** et **documents**. + +- **`data/`** Utilisez ce dossier pour stocker vos données brutes et les ensembles de données intermédiaires + que vous pouvez créer pour les besoins d'une analyse particulière. Par + par souci de transparence et de + [provenance](https://en.wikipedia.org/wiki/Provenance), vous devez + _toujours_ conserver une copie de votre données brutes accessibles et effectuez autant de + le nettoyage et le prétraitement de vos données par programme (c'est-à-dire avec + scripts, plutôt que manuellement) que possible. Séparer les données brutes + des données traitées est également une bonne idée. Par exemple, vous pourriez + avoir les fichiers `data/raw/tree_survey.plot1.txt` et `...plot2.txt` conservés + séparés d'un `data/processed/tree.survey. fichier csv` généré par + le script `scripts/01.preprocess.tree_survey.R`. +- **`documents/`** Ce serait un endroit pour conserver les plans, les brouillons, les + et d'autres textes. +- **`scripts/`** (ou `src`) Ce serait l'emplacement où conserver vos scripts R + pour différentes analyses ou traçages, et potentiellement un + dossier séparé pour vos fonctions (plus nous y reviendrons plus tard). + +Vous souhaiterez peut-être des répertoires ou sous-répertoires supplémentaires en fonction de +les besoins de votre projet, mais ceux-ci devraient constituer l'épine dorsale de votre répertoire de travail +. ```{r, results="markup", fig.cap="Example of a working directory structure.", echo=FALSE, purl=FALSE, out.width="100%", fig.align="center"} knitr::include_graphics("fig/working-directory-structure.png") ``` -For this course, we will need a `data/` folder to store our raw data, -and we will use `data_output/` for when we learn how to export data as -CSV files, and `fig_output/` folder for the figures that we will save. +Pour ce cours, nous aurons besoin d'un dossier `data/` pour stocker nos données brutes, +et nous utiliserons `data_output/` lorsque nous apprendrons à exporter des données sous forme de +fichiers CSV, et Dossier `fig_output/` pour les figures que nous allons enregistrer. ::::::::::::::::::::::::::::::::::::::: challenge -## Challenge: create your project directory structure +## Défi : créer la structure de répertoires de votre projet -Under the `Files` tab on the right of the screen, click on `New Folder` and -create a folder named `data` within your newly created working directory -(e.g., `~/bioc-intro/data`). (Alternatively, type `dir.create("data")` at -your R console.) Repeat these operations to create a `data_output/` and a -`fig_output` folders. +Sous l'onglet « Fichiers » à droite de l'écran, cliquez sur « Nouveau dossier » et +créez un dossier nommé « données » dans votre répertoire de travail nouvellement créé +(par exemple, « ~/bioc -intro/données`). (Vous pouvez également taper `dir.create("data")`sur +votre console R.) Répétez ces opérations pour créer un dossier`data_output/`et un`fig_output\`. -:::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::: -We are going to keep the script in the root of our working directory -because we are only going to use one file and it will make things -easier. +Nous allons conserver le script à la racine de notre répertoire de travail +car nous n'allons utiliser qu'un seul fichier et cela rendra les choses +plus faciles. -Your working directory should now look like this: +Votre répertoire de travail devrait maintenant ressembler à ceci : ```{r, results="markup", fig.cap="How it should look like at the beginning of this lesson", echo=FALSE, purl=FALSE, out.width="100%", fig.align="center"} knitr::include_graphics("fig/r-starting-how-it-should-look-like.png") ``` -**Project management** is also applicable to bioinformatics projects, -of course[^bioindatascience]. William Noble (@Noble:2009) proposes the -following directory structure: - -[^bioindatascience]: In this course, we consider bioinformatics as - data science applied to biological or bio-medical data. - -> Directory names are in large typeface, and filenames are in smaller -> typeface. Only a subset of the files are shown here. Note that the -> dates are formatted `<year>-<month>-<day>` so that they can be -> sorted in chronological order. The source code `src/ms-analysis.c` -> is compiled to create `bin/ms-analysis` and is documented in -> `doc/ms-analysis.html`. The `README` files in the data directories -> specify who downloaded the data files from what URL on what -> date. The driver script `results/2009-01-15/runall` automatically -> generates the three subdirectories split1, split2, and split3, -> corresponding to three cross-validation splits. The -> `bin/parse-sqt.py` script is called by both of the `runall` driver -> scripts. +La **gestion de projet** s'applique également aux projets de bioinformatique, +bien sûr[^bioindatascience]. William Noble (@Noble:2009) propose la +structure de répertoires suivante : + +[^bioindatascience]: Dans ce cours, nous considérons la bioinformatique comme une + science des données appliquée aux données biologiques ou bio-médicales. + +> Les noms de répertoires sont en gros caractères et les noms de fichiers sont en caractères plus petits +> . Seul un sous-ensemble des fichiers est affiché ici. Notez que les dates +> sont formatées `<year>-<month>-<day>` afin qu'elles puissent être +> triées par ordre chronologique. Le code source `src/ms-analysis.c` +> est compilé pour créer `bin/ms-analysis` et est documenté dans +> `doc/ms-analysis.html`. Les fichiers `README` dans les répertoires de données +> précisent qui a téléchargé les fichiers de données à partir de quelle URL et à quelle date +> . Le script du pilote `results/2009-01-15/runall` +> génère automatiquement les trois sous-répertoires split1, split2 et split3, +> correspondant à trois divisions de validation croisée. Le script +> `bin/parse-sqt.py` est appelé par les deux scripts du pilote `runall` +> . ```{r bioinfoproj, fig.cap="Directory structure for a sample bioinformatics project.", out.width="100%", echo=FALSE} knitr::include_graphics("fig/noble-bioinfo-project.png") ``` -The most important aspect of a well defined and well documented -project directory is to enable someone unfamiliar with the -project[^futureself] to +L'aspect le plus important d'un répertoire de projet +bien défini et bien documenté est de permettre à quelqu'un qui n'est pas familier avec le projet +[^futureself] de -1. understand what the project is about, what data are available, what - analyses were run, and what results were produced and, most - importantly to +1. comprendre en quoi consiste le projet, quelles données sont disponibles, quelles + analyses ont été effectuées et quels résultats ont été produits et, plus important encore, -2. repeat the analysis over again - with new data, or changing some - analysis parameters. +2. répétez l'analyse à nouveau - avec de nouvelles données ou en modifiant certains paramètres d'analyse + . -[^futureself]: That someone could be, and very likely will be your - future self, a couple of months or years after the analyses were - run. +[^futureself]: Cette personne pourrait être, et sera très probablement votre + futur moi, quelques mois ou années après que les analyses aient été + effectuées. -### The working directory +### Le répertoire de travail -The working directory is an important concept to understand. It is the -place from where R will be looking for and saving the files. When you -write code for your project, it should refer to files in relation to -the root of your working directory and only need files within this -structure. +Le répertoire de travail est un concept important à comprendre. C'est l'endroit +à partir duquel R recherchera et enregistrera les fichiers. Lorsque vous +écrivez du code pour votre projet, il doit faire référence à des fichiers en relation avec +la racine de votre répertoire de travail et n'a besoin que de fichiers au sein de cette structure +. -Using RStudio projects makes this easy and ensures that your working -directory is set properly. If you need to check it, you can use +L'utilisation de projets RStudio facilite cela et garantit que votre répertoire de travail +est correctement défini. Si vous avez besoin de le vérifier, vous pouvez utiliser `getwd()`. If for some reason your working directory is not what it should be, you can change it in the RStudio interface by navigating in the file browser where your working directory should be, and clicking on the blue gear icon `More`, and select `Set As Working Directory`. -Alternatively you can use `setwd("/path/to/working/directory")` to -reset your working directory. However, your scripts should not include -this line because it will fail on someone else's computer. +Vous pouvez également utiliser `setwd("/path/to/working/directory")` pour +réinitialiser votre répertoire de travail. Cependant, vos scripts ne doivent pas inclure +cette ligne car elle échouera sur l'ordinateur de quelqu'un d'autre. -**Example** +**Exemple** -The schema below represents the working directory `bioc-intro` with the -`data` and `fig_output` sub-directories, and 2 files in the latter: +Le schéma ci-dessous représente le répertoire de travail `bioc-intro` avec les sous-répertoires +`data` et `fig_output`, et 2 fichiers dans ce dernier : ``` bioc-intro/data/ @@ -340,155 +339,155 @@ bioc-intro/data/ /fig_output/fig2.png ``` -If we were in the working directory, we could refer to the `fig1.pdf` -file using the relative path `bioc-intro/fig_output/fig1.pdf` or the -absolute path `/home/user/bioc-intro/fig_output/fig1.pdf`. - -If we were in the `data` directory, we would use the relative path -`../fig_output/fig1.pdf` or the same absolute path -`/home/user/bioc-intro/fig_output/fig1.pdf`. - -## Interacting with R - -The basis of programming is that we write down instructions for the -computer to follow, and then we tell the computer to follow those -instructions. We write, or _code_, instructions in R because it is a -common language that both the computer and we can understand. We call -the instructions _commands_ and we tell the computer to follow the -instructions by _executing_ (also called _running_) those commands. - -There are two main ways of interacting with R: by using the -**console** or by using **scripts** (plain text files that contain -your code). The console pane (in RStudio, the bottom left panel) is -the place where commands written in the R language can be typed and -executed immediately by the computer. It is also where the results -will be shown for commands that have been executed. You can type -commands directly into the console and press `Enter` to execute those -commands, but they will be forgotten when you close the session. - -Because we want our code and workflow to be reproducible, it is better -to type the commands we want in the script editor, and save the -script. This way, there is a complete record of what we did, and -anyone (including our future selves!) can easily replicate the -results on their computer. Note, however, that merely typing the commands -in the script does not automatically _run_ them - they still need to -be sent to the console for execution. - -RStudio allows you to execute commands directly from the script editor -by using the `Ctrl` + `Enter` shortcut (on Macs, `Cmd` + `Return` will -work, too). The command on the current line in the script (indicated -by the cursor) or all of the commands in the currently selected text -will be sent to the console and executed when you press `Ctrl` + -`Enter`. You can find other keyboard shortcuts in this RStudio -cheatsheet about the RStudio -IDE. - -At some point in your analysis you may want to check the content of a -variable or the structure of an object, without necessarily keeping a -record of it in your script. You can type these commands and execute -them directly in the console. RStudio provides the `Ctrl` + `1` and -`Ctrl` + `2` shortcuts allow you to jump between the script and the -console panes. - -If R is ready to accept commands, the R console shows a `>` prompt. If +Si on était dans le répertoire de travail, on pourrait faire référence au fichier `fig1.pdf` +en utilisant le chemin relatif `bioc-intro/fig_output/fig1.pdf` ou le chemin absolu +`/ accueil/user/bioc-intro/fig_output/fig1.pdf`. + +Si nous étions dans le répertoire `data`, nous utiliserions le chemin relatif +`../fig_output/fig1.pdf` ou le même chemin absolu +`/home/user/bioc-intro /fig_output/fig1.pdf`. + +## Interagir avec R + +La base de la programmation est que nous écrivons les instructions que l'ordinateur +doit suivre, puis nous disons à l'ordinateur de suivre ces instructions +. Nous écrivons, ou _codeons_, des instructions dans R car c'est un +langage commun que l'ordinateur et nous pouvons comprendre. Nous appelons +les instructions _commandes_ et nous disons à l'ordinateur de suivre les instructions +en _exécutant_ (également appelé _exécutant_) ces commandes. + +Il existe deux manières principales d'interagir avec R : en utilisant la +**console** ou en utilisant des **scripts** (fichiers texte brut contenant +votre code). Le volet de la console (dans RStudio, le panneau inférieur gauche) est +l'endroit où les commandes écrites en langage R peuvent être saisies et +exécutées immédiatement par l'ordinateur. C'est également là que les résultats +seront affichés pour les commandes exécutées. Vous pouvez taper des commandes +directement dans la console et appuyer sur « Entrée » pour exécuter ces commandes +, mais elles seront oubliées lorsque vous fermerez la session. + +Parce que nous voulons que notre code et notre flux de travail soient reproductibles, il est préférable +de taper les commandes souhaitées dans l'éditeur de script et d'enregistrer le script +. De cette façon, il existe un enregistrement complet de ce que nous avons fait, et +n'importe qui (y compris notre futur moi !) peuvent facilement reproduire les résultats +sur leur ordinateur. Notez cependant que le simple fait de taper les commandes +dans le script ne les _exécute_ pas automatiquement - elles doivent quand même +être envoyées à la console pour exécution. + +RStudio vous permet d'exécuter des commandes directement depuis l'éditeur de script +en utilisant le raccourci `Ctrl` + `Entrée` (sur Mac, `Cmd` + `Return` +fonctionnera également). La commande sur la ligne actuelle du script (indiquée +par le curseur) ou toutes les commandes dans le texte actuellement sélectionné +seront envoyées à la console et exécutées lorsque vous appuyez sur `Ctrl` + +`Entrer`. Vous pouvez trouver d'autres raccourcis clavier dans cette aide-mémoire RStudio +sur l'IDE RStudio +. + +À un moment donné de votre analyse, vous souhaiterez peut-être vérifier le contenu d'une variable +ou la structure d'un objet, sans nécessairement en conserver un enregistrement +dans votre script. Vous pouvez taper ces commandes et les exécuter +directement dans la console. RStudio fournit les raccourcis `Ctrl` + `1` et +`Ctrl` + `2` vous permettant de passer entre le script et les volets de la console +. + +Si R est prêt à accepter les commandes, la console R affiche une invite `>`. If it receives a command (by typing, copy-pasting or sending from the script editor using `Ctrl` + `Enter`), R will try to execute it, and when ready, will show the results and come back with a new `>` prompt to wait for new commands. -If R is still waiting for you to enter more data because it isn't -complete yet, the console will show a `+` prompt. It means that you -haven't finished entering a complete command. This is because you have +Si R attend toujours que vous saisissiez plus de données parce que +n'est pas encore terminé, la console affichera une invite « + ». Cela signifie que vous +n'avez pas fini de saisir une commande complète. This is because you have not 'closed' a parenthesis or quotation, i.e. you don't have the same number of left-parentheses as right-parentheses, or the same number of -opening and closing quotation marks. When this happens, and you -thought you finished typing your command, click inside the console -window and press `Esc`; this will cancel the incomplete command and -return you to the `>` prompt. - -## How to learn more during and after the course? - -The material we cover during this course will give you an initial -taste of how you can use R to analyse data for your own -research. However, you will need to learn more to do advanced -operations such as cleaning your dataset, using statistical methods, -or creating beautiful graphics[^inthiscoure]. The best way to become -proficient and efficient at R, as with any other tool, is to use it to -address your actual research questions. As a beginner, it can feel -daunting to have to write a script from scratch, and given that many -people make their code available online, modifying existing code to -suit your purpose might make it easier for you to get started. - -[^inthiscoure]: We will introduce most of these (except statistics) - here, but will only manage to scratch the surface of the wealth of - what is possible to do with R. +opening and closing quotation marks. Lorsque cela se produit et que vous +pensez avoir fini de taper votre commande, cliquez dans la fenêtre +de la console et appuyez sur « Échap » ; cela annulera la commande incomplète et +vous ramènera à l'invite `>`. + +## Comment en savoir plus pendant et après le cours ? + +Le matériel que nous aborderons au cours de ce cours vous donnera un premier +aperçu de la façon dont vous pouvez utiliser R pour analyser des données pour votre propre +recherche. Cependant, vous devrez en apprendre davantage pour effectuer des +opérations avancées telles que nettoyer votre ensemble de données, utiliser des méthodes statistiques, +ou créer de superbes graphiques\[^dans ce cours]. La meilleure façon de devenir +compétent et efficace en R, comme avec tout autre outil, est de l'utiliser pour +répondre à vos questions de recherche réelles. En tant que débutant, il peut sembler +intimidant de devoir écrire un script à partir de zéro, et étant donné que de nombreuses +personnes rendent leur code disponible en ligne, modifiant le code existant pour +répondre à vos objectifs. cela pourrait vous permettre de démarrer plus facilement. + +[^inthiscoure]: Nous présenterons ici la plupart d'entre eux (sauf les statistiques) + , mais nous ne parviendrons qu'à effleurer la surface de la richesse de + ce qu'il est possible de faire avec R. ```{r kitten, results="markup", echo=FALSE, purl=FALSE, out.width="400px", fig.align="center"} knitr::include_graphics("fig/kitten-try-things.jpg") ``` -## Seeking help +## Cherche de l'aide -### Use the built-in RStudio help interface to search for more information on R functions +### Utilisez l'interface d'aide intégrée de RStudio pour rechercher plus d'informations sur les fonctions R. ```{r rstudiohelp, fig.cap="RStudio help interface.", results="markup", echo=FALSE, purl=FALSE, out.width="70%", fig.align="center"} knitr::include_graphics("fig/rstudiohelp.png") ``` -One of the fastest ways to get help, is to use the RStudio help -interface. This panel by default can be found at the lower right hand -panel of RStudio. As seen in the screenshot, by typing the word -"Mean", RStudio tries to also give a number of suggestions that you -might be interested in. The description is then shown in the display -window. +L'un des moyens les plus rapides d'obtenir de l'aide consiste à utiliser l'interface d'aide RStudio +. Ce panneau par défaut se trouve dans le panneau inférieur droit +de RStudio. Comme le montre la capture d'écran, en tapant le mot +"Mean", RStudio essaie également de donner un certain nombre de suggestions qui pourraient vous intéresser +. La description s'affiche alors dans la fenêtre d'affichage +. -### I know the name of the function I want to use, but I'm not sure how to use it +### Je connais le nom de la fonction que je souhaite utiliser, mais je ne sais pas comment l'utiliser -If you need help with a specific function, let's say `barplot()`, you -can type: +Si vous avez besoin d'aide avec une fonction spécifique, disons `barplot()`, vous +pouvez taper : ```{r, eval=FALSE, purl=TRUE} ?barplot ``` -If you just need to remind yourself of the names of the arguments, you can use: +Si vous avez juste besoin de vous rappeler les noms des arguments, vous pouvez utiliser : ```{r, eval=FALSE, purl=TRUE} -args(lm) +arguments(lm) ``` -### I want to use a function that does X, there must be a function for it but I don't know which one... +### Je veux utiliser une fonction qui fait X, il doit y avoir une fonction pour ça mais je ne sais pas laquelle... -If you are looking for a function to do a particular task, you can use the -`help.search()` function, which is called by the double question mark `??`. -However, this only looks through the installed packages for help pages with a -match to your search request +Si vous recherchez une fonction pour effectuer une tâche particulière, vous pouvez utiliser la fonction +`help.search()`, qui est appelée par le double point d'interrogation `??`. +Cependant, cela ne recherche dans les packages installés que les pages d'aide avec une correspondance +avec votre demande de recherche. ```{r, eval=FALSE, purl=TRUE} ??kruskal ``` -If you can't find what you are looking for, you can use -the [rdocumentation.org](https://www.rdocumentation.org) website that searches -through the help files across all packages available. +Si vous ne trouvez pas ce que vous cherchez, vous pouvez utiliser +le site Web [rdocumentation.org](https://www.rdocumentation.org) qui recherche +dans les fichiers d'aide de tous les forfaits disponibles. -Finally, a generic Google or internet search "R \<task>" will often either send -you to the appropriate package documentation or a helpful forum where someone -else has already asked your question. +Enfin, une recherche générique sur Google ou sur Internet "R \<task>" vous enverra souvent +soit à la documentation du package appropriée, soit à un forum utile où quelqu'un +d'autre a déjà posé votre question. -### I am stuck... I get an error message that I don't understand +### Je suis coincé... Je reçois un message d'erreur que je ne comprends pas -Start by googling the error message. However, this doesn't always work very well -because often, package developers rely on the error catching provided by R. You -end up with general error messages that might not be very helpful to diagnose a -problem (e.g. "subscript out of bounds"). If the message is very generic, you -might also include the name of the function or package you're using in your -query. +Commencez par rechercher le message d'erreur sur Google. Cependant, cela ne fonctionne pas toujours très bien +car souvent, les développeurs de packages s'appuient sur la détection d'erreurs fournie par R. Vous +vous retrouvez avec des messages d'erreur généraux qui pourraient ne pas être très utiles pour diagnostiquer un problème. +problème (par exemple "indice hors limites"). Si le message est très générique, vous +pouvez également inclure le nom de la fonction ou du package que vous utilisez dans votre +requête. -However, you should check Stack Overflow. Search using the `[r]` tag. Most -questions have already been answered, but the challenge is to use the right -words in the search to find the -answers: +Cependant, vous devriez vérifier Stack Overflow. Recherchez en utilisant la balise `[r]`. La plupart des +questions ont déjà reçu une réponse, mais le défi consiste à utiliser les bons +mots dans la recherche pour trouver les +réponses : [http://stackoverflow.com/questions/tagged/r](https://stackoverflow.com/questions/tagged/r) @@ -496,173 +495,173 @@ The [Introduction to R](https://cran.r-project.org/doc/manuals/R-intro.pdf) can also be dense for people with little programming experience but it is a good place to understand the underpinnings of the R language. -The [R FAQ](https://cran.r-project.org/doc/FAQ/R-FAQ.html) is dense and technical -but it is full of useful information. - -### Asking for help - -The key to receiving help from someone is for them to rapidly grasp -your problem. You should make it as easy as possible to pinpoint where -the issue might be. - -Try to use the correct words to describe your problem. For instance, a -package is not the same thing as a library. Most people will -understand what you meant, but others have really strong feelings -about the difference in meaning. The key point is that it can make -things confusing for people trying to help you. Be as precise as -possible when describing your problem. - -If possible, try to reduce what doesn't work to a simple _reproducible -example_. If you can reproduce the problem using a very small data -frame instead of your 50000 rows and 10000 columns one, provide the -small one with the description of your problem. When appropriate, try -to generalise what you are doing so even people who are not in your -field can understand the question. For instance instead of using a -subset of your real dataset, create a small (3 columns, 5 rows) -generic one. For more information on how to write a reproducible -example see this article by Hadley +La [FAQ R](https://cran.r-project.org/doc/FAQ/R-FAQ.html) est dense et technique +mais elle regorge d'informations utiles. + +### Demander de l'aide + +La clé pour recevoir de l’aide de quelqu’un est qu’il comprenne rapidement +votre problème. Vous devez faire en sorte qu'il soit aussi simple que possible d'identifier où +pourrait se situer le problème. + +Essayez d'utiliser les mots corrects pour décrire votre problème. Par exemple, un package +n’est pas la même chose qu’une bibliothèque. La plupart des gens +comprendront ce que vous vouliez dire, mais d'autres ont des sentiments très forts +à propos de la différence de sens. Le point clé est que cela peut rendre +les choses déroutantes pour les personnes qui essaient de vous aider. Soyez aussi précis que +possible lorsque vous décrivez votre problème. + +Si possible, essayez de réduire ce qui ne fonctionne pas à un simple \*exemple reproductible +\*. Si vous pouvez reproduire le problème en utilisant un très petit cadre de données +au lieu de celui de 50 000 lignes et 10 000 colonnes, fournissez le petit +avec la description de votre problème. Le cas échéant, essayez +de généraliser ce que vous faites afin que même les personnes qui ne font pas partie de votre domaine +puissent comprendre la question. Par exemple, au lieu d'utiliser un sous-ensemble +de votre ensemble de données réel, créez un petit (3 colonnes, 5 lignes) +générique. Pour plus d'informations sur la façon d'écrire un exemple +reproductible, voir cet article de Hadley Wickham. -To share an object with someone else, if it's relatively small, you -can use the function `dput()`. It will output R code that can be used -to recreate the exact same object as the one in memory: +Pour partager un objet avec quelqu'un d'autre, s'il est relativement petit, vous +pouvez utiliser la fonction `dput()`. Il produira du code R qui peut être utilisé +pour recréer exactement le même objet que celui en mémoire : ```{r, results="show", purl=TRUE} -## iris is an example data frame that comes with R and head() is a -## function that returns the first part of the data frame +## iris est un exemple de bloc de données fourni avec R et head() est une +## fonction qui renvoie la première partie du bloc de données dput(head(iris)) ``` If the object is larger, provide either the raw file (i.e., your CSV file) with your script up to the point of the error (and after removing everything that is not relevant to your -issue). Alternatively, in particular if your question is not related -to a data frame, you can save any R object to a file[^export]: +issue). Alternativement, en particulier si votre question n'est pas liée +à un bloc de données, vous pouvez enregistrer n'importe quel objet R dans un fichier[^export] : ```{r, eval=FALSE, purl=FALSE} saveRDS(iris, file="/tmp/iris.rds") ``` -The content of this file is however not human readable and cannot be -posted directly on Stack Overflow. Instead, it can be sent to someone -by email who can read it with the `readRDS()` command (here it is -assumed that the downloaded file is in a `Downloads` folder in the -user's home directory): +Le contenu de ce fichier n'est cependant pas lisible par l'homme et ne peut pas être +publié directement sur Stack Overflow. Au lieu de cela, il peut être envoyé à quelqu'un +par email qui pourra le lire avec la commande `readRDS()` (ici, +suppose que le fichier téléchargé se trouve dans un dossier `Téléchargements` dans le +répertoire personnel de l'utilisateur) : ```{r, eval=FALSE, purl=FALSE} some_data <- readRDS(file="~/Downloads/iris.rds") ``` -Last, but certainly not least, **always include the output of `sessionInfo()`** -as it provides critical information about your platform, the versions of R and -the packages that you are using, and other information that can be very helpful -to understand your problem. +Dernier point, mais non le moindre, **incluez toujours la sortie de `sessionInfo()`** +car elle fournit des informations critiques sur votre plate-forme, les versions de R et +les packages que vous utilisez. utilisation, et d'autres informations qui peuvent être très utiles +pour comprendre votre problème. ```{r, results="show", purl=TRUE} sessionInfo() ``` -### Where to ask for help? - -- The person sitting next to you during the course. Don't hesitate to - talk to your neighbour during the workshop, compare your answers, - and ask for help. -- Your friendly colleagues: if you know someone with more experience - than you, they might be able and willing to help you. -- [Stack Overflow](https://stackoverflow.com/questions/tagged/r): if - your question hasn't been answered before and is well crafted, - chances are you will get an answer in less than 5 min. Remember to - follow their guidelines on how to ask a good +### Où demander de l'aide ? + +- La personne assise à côté de vous pendant le cours. N'hésitez pas à + parler à votre voisin pendant l'atelier, comparer vos réponses, + et demander de l'aide. +- Vos collègues amicaux : si vous connaissez quelqu'un avec plus d'expérience + que vous, il pourra et voudra peut-être vous aider. +- [Stack Overflow](https://stackoverflow.com/questions/tagged/r) : si + votre question n'a pas reçu de réponse auparavant et est bien conçue, + il y a de fortes chances que vous obteniez un réponse en moins de 5 minutes. N'oubliez pas de + suivre leurs directives sur comment poser une bonne question. -- The R-help mailing - list: it is read by a - lot of people (including most of the R core team), a lot of people - post to it, but the tone can be pretty dry, and it is not always - very welcoming to new users. If your question is valid, you are - likely to get an answer very fast but don't expect that it will come - with smiley faces. Also, here more than anywhere else, be sure to - use correct vocabulary (otherwise you might get an answer pointing - to the misuse of your words rather than answering your - question). You will also have more success if your question is about - a base function rather than a specific package. -- If your question is about a specific package, see if there is a - mailing list for it. Usually it's included in the DESCRIPTION file - of the package that can be accessed using - `packageDescription("name-of-package")`. You may also want to try to - email the author of the package directly, or open an issue on the - code repository (e.g., GitHub). -- There are also some topic-specific mailing lists (GIS, - phylogenetics, etc...), the complete list is - [here](https://www.r-project.org/mail.html). - -### More resources - -- The [Posting Guide](https://www.r-project.org/posting-guide.html) for - the R mailing lists. - -- How to ask for R - help - useful guidelines. - -- This blog post by Jon +- La liste de diffusion R-help +  : elle est lue par un + grand nombre de personnes (dont la plupart des l'équipe principale de R), beaucoup de gens + y publient des messages, mais le ton peut être assez sec, et il n'est pas toujours + très accueillant pour les nouveaux utilisateurs. Si votre question est valide, vous avez + de chances d'obtenir une réponse très rapidement, mais ne vous attendez pas à ce qu'elle vienne + avec des visages souriants. Aussi, ici plus qu'ailleurs, veillez à + d'utiliser un vocabulaire correct (sinon vous pourriez obtenir une réponse pointant + vers une mauvaise utilisation de vos mots plutôt que de répondre à votre + question). Vous aurez également plus de succès si votre question concerne + une fonction de base plutôt qu'un package spécifique. +- Si votre question concerne un package spécifique, vérifiez s'il existe une liste de diffusion + pour celui-ci. Habituellement, il est inclus dans le fichier DESCRIPTION + du package accessible en utilisant + `packageDescription("name-of-package")`. Vous pouvez également essayer d'envoyer + un e-mail directement à l'auteur du package ou d'ouvrir un ticket sur le référentiel de code + (par exemple, GitHub). +- Il existe également quelques listes de diffusion thématiques (SIG, + phylogénétique, etc...), la liste complète est + [ici](https://www.r-project.org/ mail.html). + +### Davantage de ressources + +- Le [Guide de publication](https://www.r-project.org/posting-guide.html) pour + les listes de diffusion R. + +- Comment demander de l'aide R + + directives utiles. + +- Ce billet de blog de Jon Skeet - has quite comprehensive advice on how to ask programming questions. + contient des conseils assez complets sur la façon dont pour poser des questions de programmation. -- The [reprex](https://cran.rstudio.com/web/packages/reprex/) package - is very helpful to create reproducible examples when asking for - help. The rOpenSci community call "How to ask questions so they get +- Le package [reprex](https://cran.rstudio.com/web/packages/reprex/) + est très utile pour créer des exemples reproductibles lorsque vous demandez de l'aide à + . The rOpenSci community call "How to ask questions so they get answered" (Github link and video recording) includes a presentation of the reprex package and of its philosophy. -## R packages +## Forfaits R -### Loading packages +### Chargement des paquets -As we have seen above, R packages play a fundamental role in R. The -make use of a package's functionality, assuming it is installed, we -first need to load it to be able to use it. This is done with the -`library()` function. Below, we load `ggplot2`. +Comme nous l'avons vu plus haut, les packages R jouent un rôle fondamental dans R. Les +utilisent les fonctionnalités d'un package, en supposant qu'il soit installé, il faut +d'abord le charger pour pouvoir l'utiliser . Cela se fait avec la fonction +`library()`. Ci-dessous, nous chargeons `ggplot2`. ```{r loadp, eval=FALSE, purl=TRUE} -library("ggplot2") +bibliothèque("ggplot2") ``` -### Installing packages +### Installation des packages -The default package repository is The _Comprehensive R Archive -Network_ (CRAN), and any package that is available on CRAN can be -installed with the `install.packages()` function. Below, for example, -we install the `dplyr` package that we will learn about later. +Le référentiel de packages par défaut est The _Comprehensive R Archive +Network_ (CRAN), et tout package disponible sur CRAN peut être +installé avec la fonction `install.packages()`. Ci-dessous, par exemple, +, nous installons le package `dplyr` que nous découvrirons plus tard. ```{r craninstall, eval=FALSE, purl=TRUE} install.packages("dplyr") ``` -This command will install the `dplyr` package as well as all its -dependencies, i.e. all the packages that it relies on to function. +Cette commande installera le package `dplyr` ainsi que toutes ses +dépendances, c'est à dire tous les packages sur lesquels il s'appuie pour fonctionner. -Another major R package repository is maintained by Bioconductor. [Bioconductor packages](https://bioconductor.org/packages/release/BiocViews.html#___Software) are managed and installed using a dedicated package, -namely `BiocManager`, that can be installed from CRAN with +Un autre référentiel majeur de packages R est géré par Bioconductor. [Packages Bioconductor](https://bioconductor.org/packages/release/BiocViews.html#___Software) sont gérés et installés à l'aide d'un package dédié, +à savoir `BiocManager`, qui peut être installé à partir de CRAN avec ```{r, eval=FALSE, purl=TRUE} install.packages("BiocManager") ``` -Individual packages such as `SummarizedExperiment` (we will use it -later), `DESeq2` (for RNA-Seq analysis), and any others from either Bioconductor or CRAN can then be -installed with `BiocManager::install`. +Des packages individuels tels que `SummarizedExperiment` (nous l'utiliserons +plus tard), `DESeq2` (pour l'analyse RNA-Seq) et tout autre de Bioconductor ou CRAN peuvent ensuite être +installés avec ` BiocManager :: installer`. ```{r, eval=FALSE, purl=TRUE} BiocManager::install("SummarizedExperiment") BiocManager::install("DESeq2") ``` -By default, `BiocManager::install()` will also check all your installed packages and see if there are newer versions available. If there are, it will show them to you and ask you if you want to `Update all/some/none? [a/s/n]:` and then wait for your answer. While you should strive to have the most up-to-date package versions, in practice we recommend only updating packages in a fresh R session before any packages are loaded. +Par défaut, `BiocManager::install()` vérifiera également tous vos packages installés et verra si des versions plus récentes sont disponibles. S'il y en a, il vous les montrera et vous demandera si vous souhaitez « Mettre à jour tout/certains/aucun ? [a/s/n] :\` et attendez votre réponse. Bien que vous deviez vous efforcer de disposer des versions de packages les plus à jour, en pratique, nous vous recommandons de mettre à jour les packages uniquement lors d'une nouvelle session R avant le chargement des packages. :::::::::::::::::::::::::::::::::::::::: keypoints -- Start using R and RStudio +- Commencez à utiliser R et RStudio -:::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::: From 00636232c8b9ac679b0e7c6e8648378e8f606f00 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Thu, 1 Aug 2024 00:07:23 +0900 Subject: [PATCH 196/334] New translations 20-r-rstudio.md (Chinese Simplified) --- locale/zh/episodes/20-r-rstudio.Rmd | 895 ++++++++++++++-------------- 1 file changed, 447 insertions(+), 448 deletions(-) diff --git a/locale/zh/episodes/20-r-rstudio.Rmd b/locale/zh/episodes/20-r-rstudio.Rmd index 6b0ca4095..d154be2fb 100644 --- a/locale/zh/episodes/20-r-rstudio.Rmd +++ b/locale/zh/episodes/20-r-rstudio.Rmd @@ -1,7 +1,7 @@ --- -source: Rmd -title: R and RStudio -teaching: 30 +source: 放射科 +title: R 和 RStudio +teaching: 三十 exercises: 0 --- @@ -10,329 +10,329 @@ exercises: 0 ::::::::::::::::::::::::::::::::::::::: objectives -- Describe the purpose of the RStudio Script, Console, Environment, and Plots panes. -- Organise files and directories for a set of analyses as an R project, and understand the purpose of the working directory. -- Use the built-in RStudio help interface to search for more information on R functions. -- Demonstrate how to provide sufficient information for troubleshooting with the R user community. +- 描述 RStudio 脚本、控制台、环境和绘图窗格的用途。 +- 将一组分析的文件和目录组织为 R 项目,并了解工作目录的用途。 +- 使用内置的 RStudio 帮助界面搜索有关 R 函数的更多信息。 +- 演示如何向 R 用户社区提供足够的信息以进行故障排除。 -:::::::::::::::::::::::::::::::::::::::::::::::::: +::::::::::::::::::::::::::::::::::::::::::::::::::::: :::::::::::::::::::::::::::::::::::::::: questions -- What are R and RStudio? +- 什么是 R 和 RStudio? -:::::::::::::::::::::::::::::::::::::::::::::::::: +::::::::::::::::::::::::::::::::::::::::::::::::::::: -> This episode is based on the Data Carpentries's _Data Analysis and -> Visualisation in R for Ecologists_ lesson. +> 本集基于 Data Carpentries 的_面向生态学家的 R 语言数据分析和 +> 可视化_课程。 -## What is R? What is RStudio? +## 什么是 R? 什么是 RStudio? -The term [R](https://www.r-project.org/) is used to refer to the -_programming language_, the _environment for statistical computing_ -and _the software_ that interprets the scripts written using it. +术语 [R](https://www.r-project.org/) 用于指代 +_编程语言_、_统计计算环境_ +和 _解释使用该语言编写的脚本的软件_。 -[RStudio](https://rstudio.com) is currently a very popular way to not -only write your R scripts but also to interact with the R -software[^plainr]. To function correctly, RStudio needs R and -therefore both need to be installed on your computer. +[RStudio](https://rstudio.com) 目前是一种非常流行的方式,不仅可以 +编写 R 脚本,还可以与 R +软件进行交互[^plainr]。 为了正常运行,RStudio 需要 R 和 +,因此两者都需要安装在您的计算机上。 -[^plainr]: As opposed to using R directly from the command line - console. There exist other software that interface and integrate - with R, but RStudio is particularly well suited for beginners - while providing numerous very advanced features. +[^plainr]: 与直接从命令行 + 控制台使用 R 相反。 还有其他软件可以将 + 与 R 进行接口和集成,但 RStudio 特别适合初学者 + ,同时提供许多非常高级的功能。 The RStudio IDE Cheat Sheet provides much more information than will be covered here, but can be useful to learn keyboard shortcuts and discover new features. -## Why learn R? +## 为什么要学习 R? -### R does not involve lots of pointing and clicking, and that's a good thing +### R 不需要大量的指向和点击,这是一件好事 -The learning curve might be steeper than with other software, but with -R, the results of your analysis do not rely on remembering a -succession of pointing and clicking, but instead on a series of -written commands, and that's a good thing! So, if you want to redo -your analysis because you collected more data, you don't have to -remember which button you clicked in which order to obtain your -results; you just have to run your script again. +学习曲线可能比其他软件更陡峭,但使用 +R,您的分析结果并不依赖于记住 +连续的指向和单击,而是依赖于一系列 +书面命令,这是一件好事! 因此,如果您因为收集了更多数据而想要重新进行 +分析,您不必 +记住您以何种顺序单击了哪个按钮来获得 +结果;您只需再次运行脚本即可。 -Working with scripts makes the steps you used in your analysis clear, -and the code you write can be inspected by someone else who can give -you feedback and spot mistakes. +使用脚本可以使您在分析中使用的步骤更加清晰, +并且您编写的代码可以由其他人检查,他们可以为您提供 +反馈并发现错误。 Working with scripts forces you to have a deeper understanding of what you are doing, and facilitates your learning and comprehension of the methods you use. -### R code is great for reproducibility +### R 代码具有很好的可重复性 Reproducibility means that someone else (including your future self) can obtain the same results from the same dataset when using the same analysis code. -R integrates with other tools to generate manuscripts or reports from your -code. If you collect more data, or fix a mistake in your dataset, the -figures and the statistical tests in your manuscript or report are updated -automatically. +R 与其他工具集成,从您的 +代码生成手稿或报告。 如果您收集了更多数据,或者修复了数据集中的错误,则手稿或报告中的 +图表和统计测试将自动更新 +。 -An increasing number of journals and funding agencies expect analyses -to be reproducible, so knowing R will give you an edge with these -requirements. +越来越多的期刊和资助机构希望分析 +具有可重复性,因此了解 R 将使您在满足这些 +要求方面更具优势。 -### R is interdisciplinary and extensible +### R 是跨学科且可扩展的 -With 10000+ packages[^whatarepkgs] that can be installed to extend its -capabilities, R provides a framework that allows you to combine -statistical approaches from many scientific disciplines to best suit -the analytical framework you need to analyse your data. For instance, -R has packages for image analysis, GIS, time series, population -genetics, and a lot more. +R 有超过 10000 个软件包[^whatarepkgs]可供安装以扩展其 +功能,它提供了一个框架,允许您结合来自许多科学学科的 +统计方法,以最适合 +分析数据所需的分析框架。 例如, +R 具有用于图像分析、GIS、时间序列、人口 +遗传学等的软件包。 -[^whatarepkgs]: i.e. add-ons that confer R with new functionality, - such as bioinformatics data analysis. +[^whatarepkgs]: 即赋予 R 新功能的附加组件, + 例如生物信息学数据分析。 ```{r, fig.cap="Exponential increase of the number of packages available on [CRAN](https://cran.r-project.org/), the Comprehensive R Archive Network. From the R Journal, Volume 10/2, December 2018.", echo=FALSE, message=FALSE} -knitr::include_graphics("fig/cran.png") +针织::包括_图形(“图/cran.png”) ``` -### R works on data of all shapes and sizes +### R 可处理各种形状和大小的数据 -The skills you learn with R scale easily with the size of your -dataset. Whether your dataset has hundreds or millions of lines, it -won't make much difference to you. +您通过 R 学到的技能会随着您的 +数据集的大小而轻松扩展。 无论您的数据集有数百行还是数百万行, +对您来说都不会有太大区别。 -R is designed for data analysis. It comes with special data structures -and data types that make handling of missing data and statistical -factors convenient. +R 是为数据分析而设计的。 它带有特殊的数据结构 +和数据类型,可以方便地处理缺失数据和统计 +因素。 -R can connect to spreadsheets, databases, and many other data formats, -on your computer or on the web. +R 可以连接到您的计算机或网络上的电子表格、数据库和许多其他数据格式, +。 -### R produces high-quality graphics +### R 生成高质量图形 -The plotting functionalities in R are extensive, and allow you to adjust -any aspect of your graph to convey most effectively the message from -your data. +R 中的绘图功能非常广泛,允许您调整 +图形的任何方面,以最有效地传达来自 +数据的信息。 -### R has a large and welcoming community +### R 有一个庞大而热情的社区 -Thousands of people use R daily. Many of them are willing to help you -through mailing lists and websites such as Stack -Overflow, or on the RStudio -community. These broad user communities -extend to specialised areas such as bioinformatics. One such subset of the R community is [Bioconductor](https://bioconductor.org/), a scientific project for analysis and comprehension "of data from current and emerging biological assays." This workshop was developed by members of the Bioconductor community; for more information on Bioconductor, please see the companion workshop ["The Bioconductor Project"](https://carpentries-incubator.github.io/bioc-project/). +每天都有成千上万的人使用 R。 他们中的许多人都愿意通过邮件列表和网站(例如 Stack +Overflow 或 RStudio +社区 为您提供帮助 +。 这些广泛的用户社区 +扩展到生物信息学等专业领域。 R 社区的一个这样的子集是 [Bioconductor](https://bioconductor.org/),这是一个用于分析和理解“来自当前和新兴生物检测的数据”的科学项目。 该研讨会由 Bioconductor 社区成员开发;有关 Bioconductor 的更多信息,请参阅配套研讨会 [“Bioconductor 项目”](https://carpentries-incubator.github.io/bioc-project/)。 -### Not only is R free, but it is also open-source and cross-platform +### R 不仅免费,而且开源且跨平台 -Anyone can inspect the source code to see how R works. Because of this -transparency, there is less chance for mistakes, and if you (or -someone else) find some, you can report and fix bugs. +任何人都可以检查源代码来了解 R 的工作原理。 由于这种 +透明度,出现错误的可能性较小,如果您(或 +其他人)发现一些错误,您可以报告并修复错误。 -## Knowing your way around RStudio +## 了解 RStudio -Let's start by learning about [RStudio](https://www.rstudio.com/), -which is an Integrated Development Environment (IDE) for working with -R. +让我们首先了解 [RStudio](https://www.rstudio.com/), +它是一个用于处理 +R 的集成开发环境 (IDE)。 -The RStudio IDE open-source product is free under the Affero General -Public License (AGPL) v3. -The RStudio IDE is also available with a commercial license and -priority email support from Posit, Inc. +RStudio IDE 开源产品在 Affero General +Public License (AGPL) v3 下免费使用。 +RStudio IDE 还提供商业许可和 Posit, Inc. 的 +优先电子邮件支持。 -We will use the RStudio IDE to write code, navigate the files on our -computer, inspect the variables we are going to create, and visualise -the plots we will generate. RStudio can also be used for other things -(e.g., version control, developing packages, writing Shiny apps) that -we will not cover during the workshop. +我们将使用 RStudio IDE 编写代码,浏览我们 +计算机上的文件,检查我们要创建的变量,并可视化 +我们将生成的图表。 RStudio 还可用于其他事项 +(例如版本控制、开发包、编写 Shiny 应用程序),而 +我们不会在研讨会期间介绍这些事项。 ```{r, results="markup", fig.cap="RStudio interface screenshot. Clockwise from top left: Source, Environment/History, Files/Plots/Packages/Help/Viewer, Console.", echo=FALSE, purl=FALSE, out.width="100%", fig.align="center"} -knitr::include_graphics("fig/rstudio-screenshot.png") +knitr::include_graphics(“fig/rstudio-screenshot.png”) ``` -The RStudio window is divided into 4 "Panes": +RStudio 窗口分为 4 个“窗格”: -- the **Source** for your scripts and documents (top-left, in the - default layout) -- your **Environment/History** (top-right), -- your **Files/Plots/Packages/Help/Viewer** (bottom-right), and -- the R **Console** (bottom-left). +- 脚本和文档的**来源**(左上角,在 + 默认布局中) +- 你的**环境/历史**(右上), +- 您的 **Files/Plots/Packages/Help/Viewer** (右下角),以及 +- R **控制台**(左下)。 -The placement of these panes and their content can be customised (see -menu, `Tools -> Global Options -> Pane Layout`). +这些窗格的位置和它们的内容可以自定义(参见 +菜单,“工具->全局选项->窗格布局”)。 One of the advantages of using RStudio is that all the information you -need to write code is available in a single window. Additionally, with -many shortcuts, **autocompletion**, and **highlighting** for the major -file types you use while developing in R, RStudio will make typing -easier and less error-prone. - -## Getting set up - -It is good practice to keep a set of related data, analyses, and text -self-contained in a single folder, called the **working -directory**. All of the scripts within this folder can then use -**relative paths** to files that indicate where inside the project a -file is located (as opposed to absolute paths, which point to where a -file is on a specific computer). Working this way makes it a lot +need to write code is available in a single window. 此外,通过 +的许多快捷方式、**自动完成**和**突出显示**,针对您在 R 中开发时使用的主要 +文件类型,RStudio 将使输入 +变得更容易且更不容易出错。 + +## 开始设置 + +将一组相关数据、分析和文本 +保存在一个文件夹中,称为 **工作 +目录**,是一种很好的做法。 然后,此文件夹中的所有脚本都可以使用 +**相对路径** 来指示 +文件在项目内部的位置(而不是绝对路径,绝对路径指向 +文件在特定计算机上的位置)。 Working this way makes it a lot easier to move your project around on your computer and share it with others without worrying about whether or not the underlying scripts will still work. -RStudio provides a helpful set of tools to do this through its "Projects" -interface, which not only creates a working directory for you, but also remembers -its location (allowing you to quickly navigate to it) and optionally preserves -custom settings and open files to make it easier to resume work after a -break. Go through the steps for creating an "R Project" for this -tutorial below. - -1. Start RStudio. -2. Under the `File` menu, click on `New project`. Choose `New directory`, then - `New project`. -3. Enter a name for this new folder (or "directory"), and choose a - convenient location for it. This will be your **working directory** - for this session (or whole course) (e.g., `bioc-intro`). -4. Click on `Create project`. -5. (Optional) Set Preferences to 'Never' save workspace in RStudio. - -RStudio's default preferences generally work well, but saving a workspace to -.RData can be cumbersome, especially if you are working with larger datasets. -To turn that off, go to Tools --> 'Global Options' and select the 'Never' option -for 'Save workspace to .RData' on exit. +RStudio 通过其“项目” +界面提供了一套有用的工具来执行此操作,它不仅可以为您创建工作目录,还可以记住 +它的位置(允许您快速导航到它)并可选择保留 +自定义设置和打开的文件,以便在 +休息后更容易恢复工作。 按照下面为本 +教程创建“R 项目”的步骤进行操作。 + +1. 启动 RStudio。 +2. 在“文件”菜单下,点击“新建项目”。 选择“新目录”,然后 + “新项目”。 +3. 输入这个新文件夹(或“目录”)的名称,并为其选择一个 + 方便的位置。 这将是您本次会话(或整个课程)的**工作目录** + (例如`bioc-intro`)。 +4. 点击“创建项目”。 +5. (可选)将首选项设置为“从不”在 RStudio 中保存工作区。 + +RStudio 的默认首选项通常运行良好,但将工作区保存到 +.RData 可能会很麻烦,特别是在处理较大的数据集时。 +要关闭该功能,请转到工具-->“全局选项”,然后在退出时选择“从不”选项 +以将“工作区保存到.RData”。 ```{r, results="markup", fig.cap="Set 'Save workspace to .RData on exit' to 'Never'", echo=FALSE, purl=FALSE, out.width="70%", fig.align="center"} -knitr::include_graphics("fig/rstudio-preferences.png") +knitr::include_graphics(“fig/rstudio-preferences.png”) ``` -To avoid character encoding issues between Windows and other operating -systems, we are -going to set UTF-8 by default: +为了避免 Windows 与其他操作系统 +之间的字符编码问题,我们 +将默认设置 UTF-8: ```{r, results="markup", fig.cap="Set the default text encoding to UTF-8 to save us headache in the coming future. (Figure from the link above).", echo=FALSE, purl=FALSE, out.width="70%", fig.align="center"} -knitr::include_graphics("fig/utf8.png") +knitr::include_graphics(“fig/utf8.png”) ``` -### Organizing your working directory - -Using a consistent folder structure across your projects will help keep things -organised, and will also make it easy to find/file things in the future. This -can be especially helpful when you have multiple projects. In general, you may -create directories (folders) for **scripts**, **data**, and **documents**. - -- **`data/`** Use this folder to store your raw data and intermediate - datasets you may create for the need of a particular analysis. For - the sake of transparency and - [provenance](https://en.wikipedia.org/wiki/Provenance), you should - _always_ keep a copy of your raw data accessible and do as much of - your data cleanup and preprocessing programmatically (i.e., with - scripts, rather than manually) as possible. Separating raw data - from processed data is also a good idea. For example, you could - have files `data/raw/tree_survey.plot1.txt` and `...plot2.txt` kept - separate from a `data/processed/tree.survey.csv` file generated by - the `scripts/01.preprocess.tree_survey.R` script. -- **`documents/`** This would be a place to keep outlines, drafts, - and other text. -- **`scripts/`** (or `src`) This would be the location to keep your R - scripts for different analyses or plotting, and potentially a - separate folder for your functions (more on that later). - -You may want additional directories or subdirectories depending on -your project needs, but these should form the backbone of your working -directory. +### 组织你的工作目录 + +在您的项目中使用一致的文件夹结构将有助于保持事物 +井然有序,并且还能让您在将来轻松查找/归档事物。 当您有多个项目时,这个 +会特别有用。 一般来说,您可以 +为**脚本**、**数据**和**文档**创建目录(文件夹)。 + +- **`data/`** 使用此文件夹存储您可能为特定分析的需要而创建的原始数据和中间 + 数据集。 为了 + 透明度和 + [来源](https://en.wikipedia.org/wiki/Provenance),您应该 + _始终_ 保留原始数据的副本,并尽可能多地以编程方式 (即使用 + 脚本,而不是手动) 完成 + 数据清理和预处理。 将原始数据 + 与处理后的数据分开也是一个好主意。 例如,你可以将 + 文件“data/raw/tree_survey.plot1.txt”和“...plot2.txt”与 + 由 + “scripts/01.preprocess.tree_survey.R”脚本生成的“data/processed/tree.survey.csv”文件分开保存。 +- **`documents/`** 这里用来保存大纲、草稿、 + 和其他文本。 +- **`scripts/`**(或`src`)这将是保存用于不同分析或绘图的 R + 脚本的位置,并且可能为您的函数保存一个 + 单独的文件夹(稍后会详细介绍)。 + +根据 +项目需求,您可能需要额外的目录或子目录,但这些应该构成您工作 +目录的骨干。 ```{r, results="markup", fig.cap="Example of a working directory structure.", echo=FALSE, purl=FALSE, out.width="100%", fig.align="center"} -knitr::include_graphics("fig/working-directory-structure.png") +knitr::include_graphics(“图/工作目录结构.png”) ``` -For this course, we will need a `data/` folder to store our raw data, -and we will use `data_output/` for when we learn how to export data as -CSV files, and `fig_output/` folder for the figures that we will save. +对于本课程,我们将需要一个 `data/` 文件夹来存储我们的原始数据 +并且当我们学习如何将数据导出为 +CSV 文件时,我们将使用 `data_output/`,以及 `fig_output/` 文件夹来存储我们将保存的图形。 ::::::::::::::::::::::::::::::::::::::: challenge -## Challenge: create your project directory structure +## 挑战:创建项目目录结构 -Under the `Files` tab on the right of the screen, click on `New Folder` and -create a folder named `data` within your newly created working directory -(e.g., `~/bioc-intro/data`). (Alternatively, type `dir.create("data")` at -your R console.) Repeat these operations to create a `data_output/` and a -`fig_output` folders. +在屏幕右侧的“文件”选项卡下,单击“新建文件夹”并 +在新创建的工作目录 +内创建一个名为“data”的文件夹(例如,“~/bioc-intro/data”)。 (或者,在 R 控制台的 +处输入 `dir.create("data")`。) 重复这些操作来创建“data_output/”和 +“fig_output”文件夹。 -:::::::::::::::::::::::::::::::::::::::::::::::::: +::::::::::::::::::::::::::::::::::::::::::::::::::::: -We are going to keep the script in the root of our working directory -because we are only going to use one file and it will make things -easier. +我们将把脚本保存在工作目录 +的根目录中,因为我们只使用一个文件,这将使事情 +更容易。 -Your working directory should now look like this: +您的工作目录现在应如下所示: ```{r, results="markup", fig.cap="How it should look like at the beginning of this lesson", echo=FALSE, purl=FALSE, out.width="100%", fig.align="center"} -knitr::include_graphics("fig/r-starting-how-it-should-look-like.png") +knitr::include_graphics(“fig/r-starting-how-it-should-look-like.png”) ``` -**Project management** is also applicable to bioinformatics projects, -of course[^bioindatascience]. William Noble (@Noble:2009) proposes the -following directory structure: +**项目管理**也适用于生物信息学项目, +当然[^bioindatascience]。 William Noble (@Noble:2009) 建议 +以下目录结构: -[^bioindatascience]: In this course, we consider bioinformatics as - data science applied to biological or bio-medical data. +[^bioindatascience]: 在本课程中,我们将生物信息学视为应用于生物或生物医学数据的 + 数据科学。 -> Directory names are in large typeface, and filenames are in smaller -> typeface. Only a subset of the files are shown here. Note that the +> 目录名称采用大字体,文件名采用较小的 +> 字体。 此处仅显示一部分文件。 Note that the > dates are formatted `<year>-<month>-<day>` so that they can be -> sorted in chronological order. The source code `src/ms-analysis.c` -> is compiled to create `bin/ms-analysis` and is documented in -> `doc/ms-analysis.html`. The `README` files in the data directories -> specify who downloaded the data files from what URL on what -> date. The driver script `results/2009-01-15/runall` automatically -> generates the three subdirectories split1, split2, and split3, -> corresponding to three cross-validation splits. The -> `bin/parse-sqt.py` script is called by both of the `runall` driver -> scripts. +> sorted in chronological order. 源代码“src/ms-analysis.c” +> 被编译以创建“bin/ms-analysis”并记录在 +> “doc/ms-analysis.html”中。 数据目录 +> 中的 `README` 文件指定了谁在哪个 +> 日期从哪个 URL 下载了数据文件。 驱动脚本“results/2009-01-15/runall”自动 +> 生成三个子目录split1、split2 和 split3、 +> 对应三个交叉验证分割。 +> `bin/parse-sqt.py` 脚本被两个 `runall` 驱动程序 +> 脚本调用。 ```{r bioinfoproj, fig.cap="Directory structure for a sample bioinformatics project.", out.width="100%", echo=FALSE} -knitr::include_graphics("fig/noble-bioinfo-project.png") +knitr::include_graphics(“fig/noble-bioinfo-project.png”) ``` -The most important aspect of a well defined and well documented -project directory is to enable someone unfamiliar with the -project[^futureself] to +定义明确、记录良好的 +项目目录最重要的方面是,让不熟悉 +项目[^futureself] 的人能够 1. understand what the project is about, what data are available, what analyses were run, and what results were produced and, most importantly to -2. repeat the analysis over again - with new data, or changing some - analysis parameters. +2. 再次重复分析 - 使用新数据,或更改一些 + 分析参数。 [^futureself]: That someone could be, and very likely will be your future self, a couple of months or years after the analyses were run. -### The working directory +### 工作目录 -The working directory is an important concept to understand. It is the -place from where R will be looking for and saving the files. When you -write code for your project, it should refer to files in relation to -the root of your working directory and only need files within this -structure. +工作目录是一个需要理解的重要概念。 它是 R 查找和保存文件的 +位置。 当你 +为你的项目编写代码时,它应该引用与 +你的工作目录的根目录相关的文件,并且只需要这个 +结构内的文件。 -Using RStudio projects makes this easy and ensures that your working -directory is set properly. If you need to check it, you can use -`getwd()`. If for some reason your working directory is not what it -should be, you can change it in the RStudio interface by navigating in -the file browser where your working directory should be, and clicking -on the blue gear icon `More`, and select `Set As Working Directory`. -Alternatively you can use `setwd("/path/to/working/directory")` to -reset your working directory. However, your scripts should not include -this line because it will fail on someone else's computer. +使用 RStudio 项目可以轻松实现此目的并确保您的工作 +目录设置正确。 如果您需要检查它,您可以使用 +`getwd()`。 如果由于某种原因您的工作目录不是它 +应该的样子,您可以在 RStudio 界面中更改它,方法是在 +文件浏览器中导航到您的工作目录应该所在的位置,然后单击 +蓝色齿轮图标“更多”,然后选择“设置为工作目录”。 +或者,您可以使用 `setwd("/path/to/working/directory")` 来 +重置您的工作目录。 但是,您的脚本不应该包含 +这一行,因为它会在别人的计算机上失败。 -**Example** +**例子** -The schema below represents the working directory `bioc-intro` with the -`data` and `fig_output` sub-directories, and 2 files in the latter: +下面的模式表示工作目录“bioc-intro”,其中包含 +“data”和“fig_output”子目录,以及后者中的 2 个文件: ``` bioc-intro/data/ @@ -340,314 +340,313 @@ bioc-intro/data/ /fig_output/fig2.png ``` -If we were in the working directory, we could refer to the `fig1.pdf` -file using the relative path `bioc-intro/fig_output/fig1.pdf` or the -absolute path `/home/user/bioc-intro/fig_output/fig1.pdf`. - -If we were in the `data` directory, we would use the relative path -`../fig_output/fig1.pdf` or the same absolute path -`/home/user/bioc-intro/fig_output/fig1.pdf`. - -## Interacting with R - -The basis of programming is that we write down instructions for the -computer to follow, and then we tell the computer to follow those -instructions. We write, or _code_, instructions in R because it is a -common language that both the computer and we can understand. We call -the instructions _commands_ and we tell the computer to follow the -instructions by _executing_ (also called _running_) those commands. - -There are two main ways of interacting with R: by using the -**console** or by using **scripts** (plain text files that contain -your code). The console pane (in RStudio, the bottom left panel) is -the place where commands written in the R language can be typed and -executed immediately by the computer. It is also where the results -will be shown for commands that have been executed. You can type -commands directly into the console and press `Enter` to execute those -commands, but they will be forgotten when you close the session. - -Because we want our code and workflow to be reproducible, it is better -to type the commands we want in the script editor, and save the -script. This way, there is a complete record of what we did, and -anyone (including our future selves!) can easily replicate the -results on their computer. Note, however, that merely typing the commands -in the script does not automatically _run_ them - they still need to -be sent to the console for execution. - -RStudio allows you to execute commands directly from the script editor -by using the `Ctrl` + `Enter` shortcut (on Macs, `Cmd` + `Return` will -work, too). The command on the current line in the script (indicated -by the cursor) or all of the commands in the currently selected text -will be sent to the console and executed when you press `Ctrl` + -`Enter`. You can find other keyboard shortcuts in this RStudio -cheatsheet about the RStudio -IDE. - -At some point in your analysis you may want to check the content of a -variable or the structure of an object, without necessarily keeping a -record of it in your script. You can type these commands and execute -them directly in the console. RStudio provides the `Ctrl` + `1` and -`Ctrl` + `2` shortcuts allow you to jump between the script and the -console panes. - -If R is ready to accept commands, the R console shows a `>` prompt. If -it receives a command (by typing, copy-pasting or sending from the script -editor using `Ctrl` + `Enter`), R will try to execute it, and when -ready, will show the results and come back with a new `>` prompt to -wait for new commands. +如果我们在工作目录中,我们可以使用相对路径“bioc-intro/fig_output/fig1.pdf”或 +绝对路径“/home/user/bioc-intro/fig_output/fig1.pdf”引用“fig1.pdf” +文件。 + +如果我们在“数据”目录中,我们将使用相对路径 +“../fig_output/fig1.pdf”或相同的绝对路径 +“/home/user/bioc-intro/fig_output/fig1.pdf”。 + +## 与 R 交互 + +编程的基础是我们写下 +计算机要遵循的指令,然后我们告诉计算机遵循那些 +指令。 我们用 R 编写或_编码_指令,因为它是一种 +通用语言,计算机和我们都能理解。 我们将 +指令称为 _命令_,并告诉计算机通过 _执行_(也称为 _运行_)这些命令来遵循 +指令。 + +与 R 交互的主要方式有两种:使用 +**控制台** 或使用 **脚本**(包含 +代码的纯文本文件)。 控制台窗格(在 RStudio 中,左下方面板)是 +用 R 语言编写的命令可以被计算机输入并 +立即执行的地方。 它还将显示已执行命令的结果 +。 您可以直接在控制台中输入 +命令,然后按 `Enter` 来执行那些 +命令,但是当您关闭会话时它们会被遗忘。 + +因为我们希望我们的代码和工作流程具有可重现性,所以最好 +在脚本编辑器中输入我们想要的命令,然后保存 +脚本。 这样,我们所做的事情就有了完整的记录,而且 +任何人(包括我们未来的自己!) 可以轻松地在他们的计算机上复制 +的结果。 但是请注意,仅在脚本中输入命令 +并不能自动_运行_它们 - 它们仍然需要通过 +发送到控制台进行执行。 + +RStudio 允许您使用 `Ctrl` + `Enter` 快捷键直接从脚本编辑器 +执行命令(在 Mac 上,`Cmd` + `Return` 也可以 +起作用)。 当您按下 `Ctrl` + +`Enter` 时,脚本中当前行的命令(光标指示为 +)或当前选定的文本 +中的所有命令将被发送到控制台并执行。 您可以在此 RStudio +有关 RStudio +IDE 的备忘单 中找到其他键盘快捷键。 + +在分析的某个阶段,您可能想要检查 +变量的内容或对象的结构,而不一定在脚本中保留它的 +记录。 You can type these commands and execute +them directly in the console. RStudio 提供了 `Ctrl` + `1` 和 +`Ctrl` + `2` 快捷键,允许您在脚本和 +控制台窗格之间跳转。 + +如果 R 准备好接受命令,R 控制台将显示 `>` 提示。 如果 +收到一个命令(通过键入、复制粘贴或使用 `Ctrl` + `Enter` 从脚本 +编辑器发送),R 将尝试执行它,并且当 +准备就绪时,将显示结果并返回一个新的 `>` 提示符以 +等待新命令。 If R is still waiting for you to enter more data because it isn't -complete yet, the console will show a `+` prompt. It means that you -haven't finished entering a complete command. This is because you have +complete yet, the console will show a `+` prompt. 这意味着你 +还没有输入完整的命令。 This is because you have not 'closed' a parenthesis or quotation, i.e. you don't have the same number of left-parentheses as right-parentheses, or the same number of -opening and closing quotation marks. When this happens, and you -thought you finished typing your command, click inside the console -window and press `Esc`; this will cancel the incomplete command and -return you to the `>` prompt. - -## How to learn more during and after the course? - -The material we cover during this course will give you an initial -taste of how you can use R to analyse data for your own -research. However, you will need to learn more to do advanced -operations such as cleaning your dataset, using statistical methods, -or creating beautiful graphics[^inthiscoure]. The best way to become +opening and closing quotation marks. 当这种情况发生时,如果你 +认为你已经完成了命令输入,请单击控制台 +窗口内并按 `Esc`;这将取消不完整的命令并且 +返回到 `>` 提示符。 + +## 如何在课程中和课程结束后学习更多知识? + +我们在本课程中涵盖的材料将为您提供初步的 +体验如何使用 R 分析数据以进行您自己的 +研究。 但是,您需要学习更多知识才能执行高级 +操作,例如清理数据集、使用统计方法、 +或创建漂亮的图形[^inthiscoure]。 The best way to become proficient and efficient at R, as with any other tool, is to use it to -address your actual research questions. As a beginner, it can feel -daunting to have to write a script from scratch, and given that many -people make their code available online, modifying existing code to -suit your purpose might make it easier for you to get started. +address your actual research questions. 对于初学者来说,从头开始编写脚本可能会让人感到 +畏惧,而考虑到许多 +人将他们的代码发布到网上,修改现有代码以 +满足你的目的可能会让你更容易上手。 -[^inthiscoure]: We will introduce most of these (except statistics) - here, but will only manage to scratch the surface of the wealth of - what is possible to do with R. +[^inthiscoure]: 我们将在这里介绍其中的大部分内容(统计数据除外) + ,但只能触及使用 R 可以实现的 + 的丰富内容的表面。 ```{r kitten, results="markup", echo=FALSE, purl=FALSE, out.width="400px", fig.align="center"} -knitr::include_graphics("fig/kitten-try-things.jpg") +knitr::include_graphics(“fig/kitten-try-things.jpg”) ``` -## Seeking help +## 寻求帮助 -### Use the built-in RStudio help interface to search for more information on R functions +### 使用内置的 RStudio 帮助界面搜索有关 R 函数的更多信息 ```{r rstudiohelp, fig.cap="RStudio help interface.", results="markup", echo=FALSE, purl=FALSE, out.width="70%", fig.align="center"} -knitr::include_graphics("fig/rstudiohelp.png") +knitr::include_graphics(“fig/rstudiohelp.png”) ``` -One of the fastest ways to get help, is to use the RStudio help -interface. This panel by default can be found at the lower right hand -panel of RStudio. As seen in the screenshot, by typing the word -"Mean", RStudio tries to also give a number of suggestions that you -might be interested in. The description is then shown in the display -window. +获得帮助的最快方法之一是使用 RStudio 帮助 +界面。 默认情况下,该面板位于 RStudio 右下角 +面板。 如屏幕截图所示,通过输入单词 +“Mean”,RStudio 还会尝试提供一些您 +可能感兴趣的建议。 然后描述就会显示在显示 +窗口中。 -### I know the name of the function I want to use, but I'm not sure how to use it +### 我知道我想使用的函数的名称,但我不知道如何使用它 -If you need help with a specific function, let's say `barplot()`, you -can type: +如果您需要有关特定函数的帮助,比如说 `barplot()`,您 +可以输入: ```{r, eval=FALSE, purl=TRUE} -?barplot +?条形图 ``` -If you just need to remind yourself of the names of the arguments, you can use: +如果您只需要提醒自己参数的名称,您可以使用: ```{r, eval=FALSE, purl=TRUE} -args(lm) +参数(lm) ``` -### I want to use a function that does X, there must be a function for it but I don't know which one... +### 我想使用一个执行 X 的函数,一定有一个函数可以执行该操作,但我不知道是哪一个...... -If you are looking for a function to do a particular task, you can use the -`help.search()` function, which is called by the double question mark `??`. +如果您正在寻找一个函数来执行特定任务,您可以使用 +`help.search()`函数,它由双问号`??`调用。 However, this only looks through the installed packages for help pages with a match to your search request ```{r, eval=FALSE, purl=TRUE} -??kruskal +??克鲁斯卡尔 ``` -If you can't find what you are looking for, you can use -the [rdocumentation.org](https://www.rdocumentation.org) website that searches -through the help files across all packages available. +如果您找不到所需内容,您可以使用 +[rdocumentation.org](https://www.rdocumentation.org) 网站,该网站通过所有可用软件包中的帮助文件进行搜索 +。 Finally, a generic Google or internet search "R \<task>" will often either send you to the appropriate package documentation or a helpful forum where someone else has already asked your question. -### I am stuck... I get an error message that I don't understand +### 我被困住了…… 我收到一条我无法理解的错误消息 -Start by googling the error message. However, this doesn't always work very well -because often, package developers rely on the error catching provided by R. You -end up with general error messages that might not be very helpful to diagnose a -problem (e.g. "subscript out of bounds"). If the message is very generic, you -might also include the name of the function or package you're using in your -query. +首先通过谷歌搜索错误信息。 但是,这种方法并不总是能很好地发挥作用 +因为通常包开发人员依赖于 R 提供的错误捕获功能。您 +最终会得到一般错误消息,而这些消息可能对诊断 +问题(例如“下标越界”)没有多大帮助。 如果消息非常通用,您 +可能还会在 +查询中包含您正在使用的函数或包的名称。 -However, you should check Stack Overflow. Search using the `[r]` tag. Most -questions have already been answered, but the challenge is to use the right -words in the search to find the -answers: +但是,您应该检查一下 Stack Overflow。 使用 `[r]` 标签搜索。 大多数 +问题已经得到解答,但挑战在于在搜索中使用正确的 +词来找到 +答案: [http://stackoverflow.com/questions/tagged/r](https://stackoverflow.com/questions/tagged/r) -The [Introduction to R](https://cran.r-project.org/doc/manuals/R-intro.pdf) can -also be dense for people with little programming experience but it is a good -place to understand the underpinnings of the R language. +[R 简介](https://cran.r-project.org/doc/manuals/R-intro.pdf) 对于编程经验较少的人来说可能 +比较难懂,但是它是 +了解 R 语言基础知识的好地方。 -The [R FAQ](https://cran.r-project.org/doc/FAQ/R-FAQ.html) is dense and technical -but it is full of useful information. +[R FAQ](https://cran.r-project.org/doc/FAQ/R-FAQ.html) 内容密集且技术性很强 +但其中包含大量有用信息。 -### Asking for help +### 寻求帮助 -The key to receiving help from someone is for them to rapidly grasp -your problem. You should make it as easy as possible to pinpoint where +获得他人帮助的关键是让他们迅速掌握 +你的问题。 You should make it as easy as possible to pinpoint where the issue might be. -Try to use the correct words to describe your problem. For instance, a -package is not the same thing as a library. Most people will -understand what you meant, but others have really strong feelings -about the difference in meaning. The key point is that it can make -things confusing for people trying to help you. Be as precise as -possible when describing your problem. - -If possible, try to reduce what doesn't work to a simple _reproducible -example_. If you can reproduce the problem using a very small data -frame instead of your 50000 rows and 10000 columns one, provide the -small one with the description of your problem. When appropriate, try -to generalise what you are doing so even people who are not in your -field can understand the question. For instance instead of using a -subset of your real dataset, create a small (3 columns, 5 rows) -generic one. For more information on how to write a reproducible -example see this article by Hadley -Wickham. - -To share an object with someone else, if it's relatively small, you -can use the function `dput()`. It will output R code that can be used -to recreate the exact same object as the one in memory: +尝试使用正确的词语来描述你的问题。 例如, +包与库不同。 大多数人会 +理解你的意思,但其他人对含义的差异有很强烈的感受 +。 The key point is that it can make +things confusing for people trying to help you. 描述问题时请尽可能精确( +)。 + +如果可能的话,尝试将不起作用的部分简化为一个简单的_可重现的 +示例_。 如果您可以使用非常小的数据 +框架而不是 50000 行和 10000 列的框架重现该问题,请提供 +小框架并描述您的问题。 在适当的时候,尝试 +来概括你正在做的事情,这样即使不在你的 +领域的人也能理解这个问题。 例如,不要使用真实数据集的 +子集,而是创建一个小的(3 列,5 行) +通用数据集。 有关如何编写可重现的 +示例的更多信息,请参阅 Hadley +Wickham 的这篇文章。 + +要与他人共享一个对象,如果它相对较小,您 +可以使用函数“dput()”。 它将输出可用于 +重新创建与内存中完全相同的对象: ```{r, results="show", purl=TRUE} -## iris is an example data frame that comes with R and head() is a -## function that returns the first part of the data frame +## iris 是 R 附带的一个示例数据框,head() 是一个 +## 函数,返回数据框的第一部分 dput(head(iris)) ``` -If the object is larger, provide either the raw file (i.e., your CSV -file) with your script up to the point of the error (and after -removing everything that is not relevant to your -issue). Alternatively, in particular if your question is not related +如果对象较大,请提供原始文件(即您的 CSV +文件)以及您的脚本直到出现错误的位置(并且在 +之后删除与您的 +问题无关的所有内容)。 Alternatively, in particular if your question is not related to a data frame, you can save any R object to a file[^export]: ```{r, eval=FALSE, purl=FALSE} -saveRDS(iris, file="/tmp/iris.rds") +保存RDS(iris,文件=“/tmp/iris.rds”) ``` -The content of this file is however not human readable and cannot be -posted directly on Stack Overflow. Instead, it can be sent to someone -by email who can read it with the `readRDS()` command (here it is -assumed that the downloaded file is in a `Downloads` folder in the -user's home directory): +但是,该文件的内容不是人类可读的,并且无法 +直接发布在 Stack Overflow 上。 相反,它可以通过电子邮件发送给某个人 +,该人可以使用 `readRDS()` 命令阅读它(这里 +假设下载的文件位于 +用户主目录中的 `Downloads` 文件夹中): ```{r, eval=FALSE, purl=FALSE} -some_data <- readRDS(file="~/Downloads/iris.rds") +some_data <- readRDS(file="~/Downloads/iris.rds") ``` -Last, but certainly not least, **always include the output of `sessionInfo()`** -as it provides critical information about your platform, the versions of R and -the packages that you are using, and other information that can be very helpful -to understand your problem. +最后,但同样重要的一点是,**始终包含 `sessionInfo()`** +的输出,因为它提供了有关您的平台、R 版本和 +您正在使用的软件包的重要信息,以及其他对 +理解您的问题非常有帮助的信息。 ```{r, results="show", purl=TRUE} -sessionInfo() +会话信息() ``` -### Where to ask for help? +### 去哪里寻求帮助? -- The person sitting next to you during the course. Don't hesitate to - talk to your neighbour during the workshop, compare your answers, - and ask for help. -- Your friendly colleagues: if you know someone with more experience - than you, they might be able and willing to help you. -- [Stack Overflow](https://stackoverflow.com/questions/tagged/r): if - your question hasn't been answered before and is well crafted, - chances are you will get an answer in less than 5 min. Remember to +- 课程期间坐在您旁边的人。 不要犹豫, + 在研讨会期间与你的邻居交谈,比较你的答案, + 并寻求帮助。 +- 友好的同事:如果您认识某个比您更有经验 + 的人,他们也许能够并且愿意帮助您。 +- [Stack Overflow](https://stackoverflow.com/questions/tagged/r): 如果 + 你的问题之前没有被回答过并且回答得很好,那么 + 你很有可能在 5 分钟内得到答案。 Remember to follow their guidelines on how to ask a good question. -- The R-help mailing - list: it is read by a - lot of people (including most of the R core team), a lot of people - post to it, but the tone can be pretty dry, and it is not always - very welcoming to new users. If your question is valid, you are +- R-help 邮件 + 列表:有 + 很多人 (包括大多数 R 核心团队成员) 阅读它,也有 + 很多人向其发帖,但其语气可能相当枯燥,并且对新用户并不总是 + 欢迎。 If your question is valid, you are likely to get an answer very fast but don't expect that it will come - with smiley faces. Also, here more than anywhere else, be sure to - use correct vocabulary (otherwise you might get an answer pointing - to the misuse of your words rather than answering your - question). You will also have more success if your question is about - a base function rather than a specific package. -- If your question is about a specific package, see if there is a - mailing list for it. Usually it's included in the DESCRIPTION file - of the package that can be accessed using - `packageDescription("name-of-package")`. You may also want to try to - email the author of the package directly, or open an issue on the - code repository (e.g., GitHub). -- There are also some topic-specific mailing lists (GIS, - phylogenetics, etc...), the complete list is - [here](https://www.r-project.org/mail.html). - -### More resources - -- The [Posting Guide](https://www.r-project.org/posting-guide.html) for - the R mailing lists. - -- How to ask for R - help - useful guidelines. - -- This blog post by Jon - Skeet - has quite comprehensive advice on how to ask programming questions. - -- The [reprex](https://cran.rstudio.com/web/packages/reprex/) package - is very helpful to create reproducible examples when asking for - help. The rOpenSci community call "How to ask questions so they get - answered" (Github - link and video - recording) includes a presentation of - the reprex package and of its philosophy. - -## R packages - -### Loading packages - -As we have seen above, R packages play a fundamental role in R. The -make use of a package's functionality, assuming it is installed, we -first need to load it to be able to use it. This is done with the -`library()` function. Below, we load `ggplot2`. + with smiley faces. 此外,在这里比在其他地方更重要的是,一定要确保 + 使用正确的词汇(否则您可能会得到一个指向 + 用词不当的答案,而不是回答您的 + 问题)。 如果您的问题是关于 + 基本函数而不是特定的包,您也会获得更多的成功。 +- 如果您的问题是关于特定软件包的,请查看是否有该软件包的 + 邮件列表。 通常它包含在包的描述文件 + 中,可以使用 + `packageDescription("name-of-package")` 来访问。 您可能还想尝试 + 直接给软件包的作者发送电子邮件,或者在 + 代码存储库(例如 GitHub)上打开一个问题。 +- 还有一些特定主题的邮件列表(GIS、 + 系统发育学等...),完整列表在 + [这里](https://www.r-project.org/mail.html)。 + +### 更多资源 + +- R 邮件列表的 [发帖指南](https://www.r-project.org/posting-guide.html)。 + +- 如何寻求 R + 帮助 + 有用的指南。 + +- Jon + Skeet 的这篇博客文章 + 对如何提出编程问题提供了相当全面的建议。 + +- [reprex](https://cran.rstudio.com/web/packages/reprex/) 包 + 在寻求 + 帮助时对于创建可重现的示例非常有帮助。 rOpenSci 社区呼吁“如何提出问题以便得到 + 答案”(Github + 链接 和 视频 + 录音)包括对 + reprex 包及其理念的介绍。 + +## R 包 + +### 加载包 + +正如我们上面看到的,R 包在 R 中起着基础性的作用。 +利用包的功能,假设它已安装,我们 +首先需要加载它才能使用它。 这是通过 +`library()` 函数完成的。 下面,我们加载“ggplot2”。 ```{r loadp, eval=FALSE, purl=TRUE} -library("ggplot2") +库(“ggplot2”) ``` -### Installing packages +### 安装软件包 The default package repository is The _Comprehensive R Archive Network_ (CRAN), and any package that is available on CRAN can be -installed with the `install.packages()` function. Below, for example, -we install the `dplyr` package that we will learn about later. +installed with the `install.packages()` function. 下面,例如, +我们安装稍后将了解的 `dplyr` 包。 ```{r craninstall, eval=FALSE, purl=TRUE} -install.packages("dplyr") +安装.包(“dplyr”) ``` -This command will install the `dplyr` package as well as all its -dependencies, i.e. all the packages that it relies on to function. +此命令将安装“dplyr”包及其所有 +依赖项,即它所依赖的所有包。 -Another major R package repository is maintained by Bioconductor. [Bioconductor packages](https://bioconductor.org/packages/release/BiocViews.html#___Software) are managed and installed using a dedicated package, -namely `BiocManager`, that can be installed from CRAN with +另一个主要的 R 包存储库由 Bioconductor 维护。 [Bioconductor 软件包](https://bioconductor.org/packages/release/BiocViews.html#___Software) 使用专用软件包 +即 `BiocManager` 进行管理和安装,可以使用以下命令从 CRAN 安装: ```{r, eval=FALSE, purl=TRUE} -install.packages("BiocManager") +安装.软件包(“BiocManager”) ``` Individual packages such as `SummarizedExperiment` (we will use it @@ -659,10 +658,10 @@ BiocManager::install("SummarizedExperiment") BiocManager::install("DESeq2") ``` -By default, `BiocManager::install()` will also check all your installed packages and see if there are newer versions available. If there are, it will show them to you and ask you if you want to `Update all/some/none? [a/s/n]:` and then wait for your answer. While you should strive to have the most up-to-date package versions, in practice we recommend only updating packages in a fresh R session before any packages are loaded. +默认情况下,`BiocManager::install()` 还将检查所有已安装的软件包,查看是否有可用的新版本。 如果有,它会向您显示并询问您是否要“更新全部/部分/无?” [a/s/n]:\`然后等待您的答复。 虽然您应该努力获得最新的软件包版本,但实际上我们建议仅在加载任何包之前在新的 R 会话中更新包。 :::::::::::::::::::::::::::::::::::::::: keypoints -- Start using R and RStudio +- 开始使用 R 和 RStudio -:::::::::::::::::::::::::::::::::::::::::::::::::: +::::::::::::::::::::::::::::::::::::::::::::::::::::: From 575e762adeeee64456bfcd19f00f11d2c485d28a Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Thu, 1 Aug 2024 00:07:25 +0900 Subject: [PATCH 197/334] New translations 23-starting-with-r.md (French) --- locale/fr/episodes/23-starting-with-r.Rmd | 920 +++++++++++----------- 1 file changed, 460 insertions(+), 460 deletions(-) diff --git a/locale/fr/episodes/23-starting-with-r.Rmd b/locale/fr/episodes/23-starting-with-r.Rmd index 410e507fd..039310451 100644 --- a/locale/fr/episodes/23-starting-with-r.Rmd +++ b/locale/fr/episodes/23-starting-with-r.Rmd @@ -1,6 +1,6 @@ --- source: Rmd -title: Introduction to R +title: Introduction à R teaching: 60 exercises: 60 --- @@ -8,374 +8,374 @@ exercises: 60 ```{r, include=FALSE} ``` -::::::::::::::::::::::::::::::::::::::: objectives +::::::::::::::::::::::::::::::::::::::: objectifs -- Define the following terms as they relate to R: object, assign, call, function, arguments, options. -- Assign values to objects in R. -- Learn how to _name_ objects -- Use comments to inform script. -- Solve simple arithmetic operations in R. -- Call functions and use arguments to change their default options. -- Inspect the content of vectors and manipulate their content. -- Subset and extract values from vectors. -- Analyze vectors with missing data. +- Définissez les termes suivants relatifs à R : objet, affectation, appel, fonction, arguments, options. +- Attribuez des valeurs aux objets dans R. +- Apprenez à _nommer_ des objets +- Utilisez les commentaires pour informer le script. +- Résoudre des opérations arithmétiques simples dans R. +- Appelez des fonctions et utilisez des arguments pour modifier leurs options par défaut. +- Inspectez le contenu des vecteurs et manipulez leur contenu. +- Sous-ensembler et extraire des valeurs à partir de vecteurs. +- Analysez les vecteurs avec des données manquantes. -:::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::: questions +:::::::::::::::::::::::::::::::::::::::: des questions -- First commands in R +- Premières commandes dans R -:::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::: -> This episode is based on the Data Carpentries's _Data Analysis and -> Visualisation in R for Ecologists_ lesson. +> Cet épisode est basé sur la leçon _Analyse des données et +> Visualisation dans R pour les écologistes_ de Data Carpentries. -## Creating objects in R +## Créer des objets dans R -You can get output from R simply by typing math in the console: +Vous pouvez obtenir le résultat de R simplement en tapant math dans la console : ```{r, purl=TRUE} 3 + 5 12 / 7 ``` -However, to do useful and interesting things, we need to assign _values_ to -_objects_. To create an object, we need to give it a name followed by the -assignment operator `<-`, and the value we want to give it: +Cependant, pour faire des choses utiles et intéressantes, nous devons attribuer des _valeurs_ à +_objets_. Pour créer un objet, nous devons lui donner un nom suivi de l'opérateur d'affectation +`<-`, et de la valeur que nous voulons lui donner : ```{r, purl=TRUE} -weight_kg <- 55 +poids_kg <- 55 ``` -`<-` is the assignment operator. It assigns values on the right to -objects on the left. So, after executing `x <- 3`, the value of `x` is -`3`. The arrow can be read as 3 **goes into** `x`. For historical -reasons, you can also use `=` for assignments, but not in every -context. Because of the -[slight differences](https://blog.revolutionanalytics.com/2008/12/use-equals-or-arrow-for-assignment.html) -in syntax, it is good practice to always use `<-` for assignments. +`<-` est l'opérateur d'affectation. Il attribue des valeurs à droite aux objets +à gauche. Ainsi, après avoir exécuté `x <- 3`, la valeur de `x` est +`3`. La flèche peut être lue comme 3 **entre dans** `x`. Pour des raisons historiques +, vous pouvez également utiliser `=` pour les affectations, mais pas dans tous les contextes +. En raison du +[légères différences](https://blog.revolutionanalytics.com/2008/12/use-equals-or-arrow-for-assignment.html) +dans la syntaxe, il est une bonne pratique de toujours utiliser `<-` pour les affectations. In RStudio, typing <kbd>Alt</kbd> + <kbd>\-</kbd> (push <kbd>Alt</kbd> at the same time as the <kbd>\-</kbd> key) will write `<-` in a single keystroke in a PC, while typing <kbd>Option</kbd> + <kbd>\-</kbd> (push <kbd>Option</kbd> at the same time as the <kbd>\-</kbd> key) does the same in a Mac. -### Naming variables - -Objects can be given any name such as `x`, `current_temperature`, or -`subject_id`. You want your object names to be explicit and not too -long. They cannot start with a number (`2x` is not valid, but `x2` -is). R is case sensitive (e.g., `weight_kg` is different from -`Weight_kg`). There are some names that cannot be used because they -are the names of fundamental functions in R (e.g., `if`, `else`, -`for`, see -[here](https://stat.ethz.ch/R-manual/R-devel/library/base/html/Reserved.html) -for a complete list). In general, even if it's allowed, it's best to -not use other function names (e.g., `c`, `T`, `mean`, `data`, `df`, -`weights`). If in doubt, check the help to see if the name is already -in use. It's also best to avoid dots (`.`) within an object name as in +### Nommer les variables + +Les objets peuvent recevoir n'importe quel nom tel que « x », « current_temperature » ou +« subject_id ». Vous voulez que les noms de vos objets soient explicites et pas trop +longs. Ils ne peuvent pas commencer par un nombre (`2x` n'est pas valide, mais `x2` +l'est). R est sensible à la casse (par exemple, `weight_kg` est différent de +`Weight_kg`). Certains noms ne peuvent pas être utilisés car ils +sont les noms de fonctions fondamentales dans R (par exemple, `if`, `else`, +`for`, voir +[ ici](https://stat.ethz.ch/R-manual/R-devel/library/base/html/Reserved.html) +pour une liste complète). En général, même si c'est autorisé, il est préférable de +de ne pas utiliser d'autres noms de fonctions (par exemple, `c`, `T`, `mean`, `data`, `df`, +` poids`). En cas de doute, consultez l'aide pour voir si le nom est déjà +utilisé. Il est également préférable d'éviter les points (`.`) dans un nom d'objet comme dans `my.dataset`. There are many functions in R with dots in their names for historical reasons, but because dots have a special meaning in R (for methods) and other programming languages, it's best to avoid -them. It is also recommended to use nouns for object names, and verbs -for function names. It's important to be consistent in the styling of -your code (where you put spaces, how you name objects, etc.). Using a -consistent coding style makes your code clearer to read for your -future self and your collaborators. In R, some popular style guides -are [Google's](https://google.github.io/styleguide/Rguide.xml), the -[tidyverse's](https://style.tidyverse.org/) style and the Bioconductor +them. Il est également recommandé d'utiliser des noms pour les noms d'objets et des verbes +pour les noms de fonctions. Il est important d'être cohérent dans le style de +votre code (où vous placez les espaces, comment vous nommez les objets, etc.). L'utilisation d'un style de codage +cohérent rend votre code plus clair à lire pour votre +futur moi et vos collaborateurs. Dans R, certains guides de style populaires +sont [de Google](https://google.github.io/styleguide/Rguide.xml), le +[tidyverse](https://style. Tidyverse.org/) et le Bioconductor style -guide. The -tidyverse's is very comprehensive and may seem overwhelming at -first. You can install the -[**`lintr`**](https://github.com/jimhester/lintr) package to -automatically check for issues in the styling of your code. +guide. Le +Tidyverse est très complet et peut sembler écrasant au début +. Vous pouvez installer le package +[**`lintr`**](https://github.com/jimhester/lintr) pour +vérifier automatiquement les problèmes dans le style de votre code. -> **Objects vs. variables**: What are known as `objects` in `R` are -> known as `variables` in many other programming languages. Depending -> on the context, `object` and `variable` can have drastically -> different meanings. However, in this lesson, the two words are used -> synonymously. For more information -> [see here.](https://cran.r-project.org/doc/manuals/r-release/R-lang.html#Objects) +> **Objets et variables** : ce que l'on appelle des « objets » dans « R » sont +> connus sous le nom de « variables » dans de nombreux autres langages de programmation. Selon +> le contexte, « objet » et « variable » peuvent avoir des significations radicalement +> différentes. Cependant, dans cette leçon, les deux mots sont utilisés +> de manière synonyme. Pour plus d'informations +> [voir ici.](https://cran.r-project.org/doc/manuals/r-release/R-lang.html#Objects) -When assigning a value to an object, R does not print anything. You -can force R to print the value by using parentheses or by typing the -object name: +Lors de l'attribution d'une valeur à un objet, R n'imprime rien. Vous +pouvez forcer R à imprimer la valeur en utilisant des parenthèses ou en tapant le +nom de l'objet : ```{r, purl=TRUE} -weight_kg <- 55 # doesn't print anything -(weight_kg <- 55) # but putting parenthesis around the call prints the value of `weight_kg` -weight_kg # and so does typing the name of the object +weight_kg <- 55 # n'imprime rien +(weight_kg <- 55) # mais mettre des parenthèses autour de l'appel imprime la valeur de `weight_kg` +weight_kg # et taper également le nom du objet ``` -Now that R has `weight_kg` in memory, we can do arithmetic with it. For -instance, we may want to convert this weight into pounds (weight in pounds is 2.2 times the weight in kg): +Maintenant que R a « weight_kg » en mémoire, nous pouvons faire de l'arithmétique avec. Pour +exemple, nous pouvons vouloir convertir ce poids en livres (le poids en livres est 2,2 fois le poids en kg) : ```{r, purl=TRUE} -2.2 * weight_kg +2,2 * poids_kg ``` -We can also change an object's value by assigning it a new one: +On peut également changer la valeur d'un objet en lui attribuant une nouvelle : ```{r, purl=TRUE} -weight_kg <- 57.5 -2.2 * weight_kg +poids_kg <- 57,5 +2,2 * poids_kg ``` -This means that assigning a value to one object does not change the values of -other objects For example, let's store the animal's weight in pounds in a new -object, `weight_lb`: +Cela signifie que l'attribution d'une valeur à un objet ne modifie pas les valeurs de +autres objets. Par exemple, stockons le poids de l'animal en livres dans un nouvel objet +, `weight_lb` : ```{r, purl=TRUE} -weight_lb <- 2.2 * weight_kg +poids_lb <- 2,2 * poids_kg ``` -and then change `weight_kg` to 100. +puis remplacez « weight_kg » par 100. ```{r} -weight_kg <- 100 +poids_kg <- 100 ``` -::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::: défi -## Challenge: +## Défi: -What do you think is the current content of the object `weight_lb`? -126\.5 or 220? +Selon vous, quel est le contenu actuel de l'objet `weight_lb` ? +126\.5 ou 220 ? -:::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::: -## Comments +## commentaires -The comment character in R is `#`, anything to the right of a `#` in a -script will be ignored by R. It is useful to leave notes, and -explanations in your scripts. +Le caractère de commentaire dans R est `#`, tout ce qui se trouve à droite d'un `#` dans un script +sera ignoré par R. Il est utile de laisser des notes et +des explications dans vos scripts . -RStudio makes it easy to comment or uncomment a paragraph: after -selecting the lines you want to comment, press at the same time on -your keyboard <kbd>Ctrl</kbd> + <kbd>Shift</kbd> + <kbd>C</kbd>. If +RStudio permet de commenter ou décommenter facilement un paragraphe : après +sélectionnant les lignes que vous souhaitez commenter, appuyez en même temps sur +votre clavier <kbd>Ctrl</kbd> + <kbd>Maj</kbd> + <kbd>C</kbd>. If you only want to comment out one line, you can put the cursor at any location of that line (i.e. no need to select the whole line), then press <kbd>Ctrl</kbd> + <kbd>Shift</kbd> + <kbd>C</kbd>. -::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::: défi -## Challenge +## Défi -What are the values after each statement in the following? +Quelles sont les valeurs après chaque instruction suivante ? ```{r, purl=TRUE} -mass <- 47.5 # mass? -age <- 122 # age? -mass <- mass * 2.0 # mass? -age <- age - 20 # age? -mass_index <- mass/age # mass_index? +masse <- 47,5 # masse ? +âge <- 122 # âge ? +masse <- masse * 2.0 # masse ? +âge <- âge - 20 # âge ? +mass_index <- masse/âge # mass_index ? ``` -:::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::: -## Functions and their arguments +## Fonctions et leurs arguments -Functions are "canned scripts" that automate more complicated sets of commands -including operations assignments, etc. Many functions are predefined, or can be -made available by importing R _packages_ (more on that later). A function -usually gets one or more inputs called _arguments_. Functions often (but not -always) return a _value_. A typical example would be the function `sqrt()`. The -input (the argument) must be a number, and the return value (in fact, the -output) is the square root of that number. Executing a function ('running it') -is called _calling_ the function. An example of a function call is: +Les fonctions sont des "scripts prédéfinis" qui automatisent des ensembles de commandes plus complexes +, y compris les affectations d'opérations, etc. De nombreuses fonctions sont prédéfinies ou peuvent être +rendues disponibles en important des _packages_ R (nous en parlerons plus tard). Une fonction +obtient généralement une ou plusieurs entrées appelées _arguments_. Les fonctions renvoient souvent (mais pas +toujours) une _valeur_. Un exemple typique serait la fonction `sqrt()`. L'entrée +(l'argument) doit être un nombre et la valeur de retour (en fait, la sortie +) est la racine carrée de ce nombre. Exécuter une fonction (« l'exécuter ») +est appelé _appeler_ la fonction. Un exemple d'appel de fonction est : ```{r, eval=FALSE, purl=FALSE} b <- sqrt(a) ``` -Here, the value of `a` is given to the `sqrt()` function, the `sqrt()` function -calculates the square root, and returns the value which is then assigned to -the object `b`. This function is very simple, because it takes just one argument. +Ici, la valeur de `a` est donnée à la fonction `sqrt()`, la fonction `sqrt()` +calcule la racine carrée, et renvoie la valeur qui est ensuite attribuée à +l'objet 'b'. Cette fonction est très simple car elle ne prend qu’un seul argument. The return 'value' of a function need not be numerical (like that of `sqrt()`), and it also does not need to be a single item: it can be a set of things, or -even a dataset. We'll see that when we read data files into R. +even a dataset. Nous le verrons lorsque nous lirons des fichiers de données dans R. -Arguments can be anything, not only numbers or filenames, but also other -objects. Exactly what each argument means differs per function, and must be -looked up in the documentation (see below). Some functions take arguments which -may either be specified by the user, or, if left out, take on a _default_ value: -these are called _options_. Options are typically used to alter the way the -function operates, such as whether it ignores 'bad values', or what symbol to -use in a plot. However, if you want something specific, you can specify a value -of your choice which will be used instead of the default. +Les arguments peuvent être n'importe quoi, non seulement des nombres ou des noms de fichiers, mais aussi d'autres objets +. La signification exacte de chaque argument diffère selon la fonction et doit être +recherchée dans la documentation (voir ci-dessous). Certaines fonctions prennent des arguments qui +peuvent soit être spécifiés par l'utilisateur, soit, s'ils sont laissés de côté, prendre une valeur _par défaut_ : +ceux-ci sont appelés _options_. Les options sont généralement utilisées pour modifier le fonctionnement de la fonction +, par exemple si elle ignore les « mauvaises valeurs » ou quel symbole +utiliser dans un tracé. Cependant, si vous souhaitez quelque chose de spécifique, vous pouvez spécifier une valeur +de votre choix qui sera utilisée à la place de la valeur par défaut. -Let's try a function that can take multiple arguments: `round()`. +Essayons une fonction qui peut prendre plusieurs arguments : `round()`. ```{r, results="show", purl=TRUE} -round(3.14159) +rond(3.14159) ``` -Here, we've called `round()` with just one argument, `3.14159`, and it has -returned the value `3`. That's because the default is to round to the nearest -whole number. If we want more digits we can see how to do that by getting -information about the `round` function. We can use `args(round)` or look at the -help for this function using `?round`. +Ici, nous avons appelé `round()` avec un seul argument, `3.14159`, et il a +renvoyé la valeur `3`. En effet, la valeur par défaut est d'arrondir au nombre entier +le plus proche. Si nous voulons plus de chiffres, nous pouvons voir comment procéder en obtenant +des informations sur la fonction `round`. Nous pouvons utiliser `args(round)` ou consulter l'aide +pour cette fonction en utilisant `?round`. ```{r, results="show", purl=TRUE} -args(round) +arguments (rond) ``` ```{r, eval=FALSE, purl=TRUE} -?round +?rond ``` -We see that if we want a different number of digits, we can -type `digits=2` or however many we want. +Nous voyons que si nous voulons un nombre différent de chiffres, nous pouvons +taper `digits=2` ou autant que nous le voulons. ```{r, results="show", purl=TRUE} -round(3.14159, digits = 2) +rond(3.14159, chiffres = 2) ``` -If you provide the arguments in the exact same order as they are defined you -don't have to name them: +Si vous fournissez les arguments exactement dans le même ordre que celui dans lequel ils sont définis, vous +n'avez pas besoin de les nommer : ```{r, results="show", purl=TRUE} -round(3.14159, 2) +rond(3.14159, 2) ``` -And if you do name the arguments, you can switch their order: +Et si vous nommez les arguments, vous pouvez changer leur ordre : ```{r, results="show", purl=TRUE} -round(digits = 2, x = 3.14159) +rond(chiffres = 2, x = 3,14159) ``` -It's good practice to put the non-optional arguments (like the number you're -rounding) first in your function call, and to specify the names of all optional -arguments. If you don't, someone reading your code might have to look up the -definition of a function with unfamiliar arguments to understand what you're -doing. By specifying the name of the arguments you are also safeguarding -against possible future changes in the function interface, which may -potentially add new arguments in between the existing ones. +Il est recommandé de placer les arguments non facultatifs (comme le nombre que vous arrondissez +) en premier dans votre appel de fonction et de spécifier les noms de tous les arguments +facultatifs. Si vous ne le faites pas, quelqu'un qui lit votre code devra peut-être rechercher la définition +d'une fonction avec des arguments inconnus pour comprendre ce que vous faites +. En spécifiant le nom des arguments, vous protégez également +contre d'éventuelles modifications futures dans l'interface de la fonction, qui peuvent +potentiellement ajouter de nouveaux arguments entre ceux existants. -## Vectors and data types +## Vecteurs et types de données -A vector is the most common and basic data type in R, and is pretty much -the workhorse of R. A vector is composed by a series of values, such as -numbers or characters. We can assign a series of values to a vector using -the `c()` function. For example we can create a vector of animal weights and assign -it to a new object `weight_g`: +Un vecteur est le type de données le plus courant et le plus basique dans R, et est à peu près +le cheval de bataille de R. Un vecteur est composé d'une série de valeurs, telles que +nombres ou caractères. Nous pouvons attribuer une série de valeurs à un vecteur en utilisant +la fonction `c()`. Par exemple, nous pouvons créer un vecteur de poids d'animaux et l'attribuer +à un nouvel objet `weight_g` : ```{r, purl=TRUE} -weight_g <- c(50, 60, 65, 82) -weight_g +poids_g <- c(50, 60, 65, 82) +poids_g ``` -A vector can also contain characters: +Un vecteur peut également contenir des caractères : ```{r, purl=TRUE} -molecules <- c("dna", "rna", "protein") -molecules +molécules <- c("adna", "rna", "protein") +molécules ``` -The quotes around "dna", "rna", etc. are essential here. Without the -quotes R will assume there are objects called `dna`, `rna` and -`protein`. As these objects don't exist in R's memory, there will be -an error message. +Les guillemets autour de « adn », « arn », etc. sont ici essentiels. Sans les guillemets +, R supposera qu'il existe des objets appelés « adn », « arn » et +« protéine ». Comme ces objets n'existent pas dans la mémoire de R, il y aura +un message d'erreur. -There are many functions that allow you to inspect the content of a -vector. `length()` tells you how many elements are in a particular vector: +Il existe de nombreuses fonctions qui vous permettent d'inspecter le contenu d'un vecteur +. `length()` vous indique combien d'éléments se trouvent dans un vecteur particulier : ```{r, purl=TRUE} -length(weight_g) -length(molecules) +longueur (poids_g) +longueur (molécules) ``` -An important feature of a vector, is that all of the elements are the -same type of data. The function `class()` indicates the class (the -type of element) of an object: +Une caractéristique importante d'un vecteur est que tous les éléments sont du +même type de données. La fonction `class()` indique la classe (le type d'élément +) d'un objet : ```{r, purl=TRUE} -class(weight_g) -class(molecules) +classe (poids_g) +classe (molécules) ``` -The function `str()` provides an overview of the structure of an -object and its elements. It is a useful function when working with -large and complex objects: +La fonction `str()` fournit un aperçu de la structure d'un objet +et de ses éléments. C'est une fonction utile lorsque vous travaillez avec +des objets volumineux et complexes : ```{r, purl=TRUE} -str(weight_g) -str(molecules) +str(poids_g) +str(molécules) ``` -You can use the `c()` function to add other elements to your vector: +Vous pouvez utiliser la fonction `c()` pour ajouter d'autres éléments à votre vecteur : ```{r} -weight_g <- c(weight_g, 90) # add to the end of the vector -weight_g <- c(30, weight_g) # add to the beginning of the vector -weight_g +poids_g <- c(poids_g, 90) # ajouter à la fin du vecteur +poids_g <- c(30, poids_g) # ajouter au début du vecteur +poids_g ``` -In the first line, we take the original vector `weight_g`, add the -value `90` to the end of it, and save the result back into -`weight_g`. Then we add the value `30` to the beginning, again saving -the result back into `weight_g`. +Dans la première ligne, nous prenons le vecteur d'origine `weight_g`, ajoutons la valeur +`90` à la fin de celui-ci et enregistrons le résultat dans +`weight_g`. Ensuite, nous ajoutons la valeur « 30 » au début, en enregistrant à nouveau +le résultat dans « weight_g ». -We can do this over and over again to grow a vector, or assemble a -dataset. As we program, this may be useful to add results that we are -collecting or calculating. +Nous pouvons faire cela encore et encore pour développer un vecteur ou assembler un ensemble de données +. Au fur et à mesure que nous programmons, cela peut être utile pour ajouter les résultats que nous +collectons ou calculons. -An **atomic vector** is the simplest R **data type** and is a linear -vector of a single type. Above, we saw 2 of the 6 main **atomic -vector** types that R uses: `"character"` and `"numeric"` (or -`"double"`). These are the basic building blocks that all R objects -are built from. The other 4 **atomic vector** types are: +Un **vecteur atomique** est le **type de données** R le plus simple et est un vecteur linéaire +d'un seul type. Ci-dessus, nous avons vu 2 des 6 principaux types de vecteurs \*\*atomiques +\*\* que R utilise : `"caractère"` et `"numérique"` (ou +`"double"`). Ce sont les éléments de base à partir desquels tous les objets R +sont construits. Les 4 autres types de **vecteurs atomiques** sont : -- `"logical"` for `TRUE` and `FALSE` (the boolean data type) -- `"integer"` for integer numbers (e.g., `2L`, the `L` indicates to R - that it's an integer) -- `"complex"` to represent complex numbers with real and imaginary - parts (e.g., `1 + 4i`) and that's all we're going to say about them -- `"raw"` for bitstreams that we won't discuss further +- `"logique"` pour `TRUE` et `FALSE` (le type de données booléen) +- `"integer"` pour les nombres entiers (par exemple, `2L`, le `L` indique à R + que c'est un entier) +- `"complexe"` pour représenter des nombres complexes avec des parties réelles et imaginaires + (par exemple, `1 + 4i`) et c'est tout ce que nous allons dire à leur sujet +- `"raw"` pour les bitstreams dont nous ne parlerons pas davantage -You can check the type of your vector using the `typeof()` function -and inputting your vector as the argument. +Vous pouvez vérifier le type de votre vecteur en utilisant la fonction `typeof()` +et en saisissant votre vecteur comme argument. -Vectors are one of the many **data structures** that R uses. Other -important ones are lists (`list`), matrices (`matrix`), data frames -(`data.frame`), factors (`factor`) and arrays (`array`). +Les vecteurs sont l'une des nombreuses **structures de données** utilisées par R. Les autres +importants sont les listes (`list`), les matrices (`matrix`), les trames de données +(`data.frame`), les facteurs (`factor`) et les tableaux (`array` ). -::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::: défi -## Challenge: +## Défi: -We've seen that atomic vectors can be of type character, numeric (or -double), integer, and logical. But what happens if we try to mix -these types in a single vector? +Nous avons vu que les vecteurs atomiques peuvent être de type caractère, numérique (ou +double), entier et logique. Mais que se passe-t-il si nous essayons de mélanger +ces types dans un seul vecteur ? -::::::::::::::: solution +::::::::::::::: solution ## Solution -R implicitly converts them to all be the same type +R les convertit implicitement pour qu'ils soient tous du même type ::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::: -::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::: défi -## Challenge: +## Défi: -What will happen in each of these examples? (hint: use `class()` to -check the data type of your objects and type in their names to see what happens): +Que se passera-t-il dans chacun de ces exemples ? (indice : utilisez `class()` pour +vérifier le type de données de vos objets et tapez leurs noms pour voir ce qui se passe) : ```{r, eval=TRUE} num_char <- c(1, 2, 3, "a") num_logical <- c(1, 2, 3, TRUE, FALSE) -char_logical <- c("a", "b", "c", TRUE) -tricky <- c(1, 2, 3, "4") +char_logical <- c("a", " b", "c", VRAI) +délicat <- c(1, 2, 3, "4") ``` -::::::::::::::: solution +::::::::::::::: solution ## Solution @@ -392,149 +392,149 @@ tricky ::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::: -::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::: défi -## Challenge: +## Défi: -Why do you think it happens? +Pourquoi pensez-vous que cela arrive ? -::::::::::::::: solution +::::::::::::::: solution ## Solution -Vectors can be of only one data type. R tries to convert (coerce) -the content of this vector to find a _common denominator_ that -doesn't lose any information. +Les vecteurs ne peuvent appartenir qu’à un seul type de données. R essaie de convertir (contraindre) +le contenu de ce vecteur pour trouver un _dénominateur commun_ qui +ne perd aucune information. ::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::: -::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::: défi -## Challenge: +## Défi: -How many values in `combined_logical` are `"TRUE"` (as a character) -in the following example: +Combien de valeurs dans `combined_logical` sont `"TRUE"` (sous forme de caractère) +dans l'exemple suivant : ```{r, eval=TRUE} num_logical <- c(1, 2, 3, TRUE) char_logical <- c("a", "b", "c", TRUE) -combined_logical <- c(num_logical, char_logical) +combiné_logique <- c(num_logical, char_logical ) ``` -::::::::::::::: solution +::::::::::::::: solution ## Solution -Only one. There is no memory of past data types, and the coercion -happens the first time the vector is evaluated. Therefore, the `TRUE` -in `num_logical` gets converted into a `1` before it gets converted -into `"1"` in `combined_logical`. +Seulement un. Il n'y a pas de mémoire des types de données passés et la coercition +se produit la première fois que le vecteur est évalué. Par conséquent, le `TRUE` +dans `num_logical` est converti en `1` avant d'être converti +en `"1"` dans `combined_logical`. ```{r} -combined_logical +combiné_logique ``` ::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::: -::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::: défi -## Challenge: +## Défi: -In R, we call converting objects from one class into another class -_coercion_. These conversions happen according to a hierarchy, -whereby some types get preferentially coerced into other types. Can -you draw a diagram that represents the hierarchy of how these data -types are coerced? +Dans R, nous appelons la conversion d'objets d'une classe vers une autre classe +_coercition_. Ces conversions se produisent selon une hiérarchie, +selon laquelle certains types sont préférentiellement contraints vers d'autres types. Pouvez-vous +dessiner un diagramme qui représente la hiérarchie de la façon dont ces types de données +sont forcés ? -::::::::::::::: solution +::::::::::::::: solution ## Solution -logical → numeric → character ← logical +logique → numérique → caractère ← logique ::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::: ```{r, echo=FALSE, eval=FALSE, purl=TRUE} -## We've seen that atomic vectors can be of type character, numeric, integer, and -## logical. But what happens if we try to mix these types in a single -## vector? +## Nous avons vu que les vecteurs atomiques peuvent être de type caractère, numérique, entier et +## logique. Mais que se passe-t-il si nous essayons de mélanger ces types dans un seul vecteur +## ? -## What will happen in each of these examples? (hint: use `class()` to -## check the data type of your object) +## Que va-t-il se passer dans chacun de ces exemples ? (indice : utilisez `class()` pour +## vérifier le type de données de votre objet) num_char <- c(1, 2, 3, "a") num_logical <- c(1, 2, 3, TRUE) char_logical <- c("a", "b", "c", TRUE) -tricky <- c(1, 2, 3, "4") +délicat <- c(1, 2 , 3, "4") -## Why do you think it happens? +## Pourquoi pensez-vous que cela arrive ? -## You've probably noticed that objects of different types get -## converted into a single, shared type within a vector. In R, we call -## converting objects from one class into another class -## _coercion_. These conversions happen according to a hierarchy, -## whereby some types get preferentially coerced into other types. Can -## you draw a diagram that represents the hierarchy of how these data -## types are coerced? +## Vous avez probablement remarqué que des objets de types différents sont +## convertis en un seul type partagé au sein d'un vecteur. Dans R, nous appelons +## convertir des objets d'une classe en une autre classe +## _coercion_. Ces conversions se produisent selon une hiérarchie, +## selon laquelle certains types sont préférentiellement contraints vers d'autres types. Pouvez-vous +## dessiner un diagramme qui représente la hiérarchie de la façon dont ces types de données +## sont forcés ? ``` -## Subsetting vectors +## Vecteurs de sous-ensemble -If we want to extract one or several values from a vector, we must -provide one or several indices in square brackets. For instance: +Si l'on veut extraire une ou plusieurs valeurs d'un vecteur, il faut +fournir un ou plusieurs indices entre crochets. Par exemple: ```{r, results="show", purl=TRUE} -molecules <- c("dna", "rna", "peptide", "protein") -molecules[2] -molecules[c(3, 2)] +molécules <- c("ADN", "arn", "peptide", "protéine") +molécules[2] +molécules[c(3, 2)] ``` -We can also repeat the indices to create an object with more elements -than the original one: +On peut également répéter les indices pour créer un objet avec plus d'éléments +que celui d'origine : ```{r, results="show", purl=TRUE} -more_molecules <- molecules[c(1, 2, 3, 2, 1, 4)] +more_molecules <- molécules[c(1, 2, 3, 2, 1, 4)] more_molecules ``` -R indices start at 1. Programming languages like Fortran, MATLAB, -Julia, and R start counting at 1, because that's what human beings -typically do. Languages in the C family (including C++, Java, Perl, -and Python) count from 0 because that's simpler for computers to do. +Les indices R commencent à 1. Les langages de programmation comme Fortran, MATLAB, +Julia et R commencent à compter à 1, car c'est ce que font généralement les êtres humains +. Les langages de la famille C (y compris C++, Java, Perl, +et Python) comptent à partir de 0 car c'est plus simple à faire pour les ordinateurs. -Finally, it is also possible to get all the elements of a vector -except some specified elements using negative indices: +Enfin, il est également possible d'obtenir tous les éléments d'un vecteur +sauf certains éléments spécifiés en utilisant des indices négatifs : ```{r} -molecules ## all molecules -molecules[-1] ## all but the first one -molecules[-c(1, 3)] ## all but 1st/3rd ones -molecules[c(-1, -3)] ## all but 1st/3rd ones +molécules ## toutes les molécules +molécules[-1] ## toutes sauf la première +molécules[-c(1, 3)] ## toutes sauf les 1ère/3ème +molécules[c(-1, -3)] ## toutes sauf les 1ère/3ème ``` -## Conditional subsetting +## Sous-ensemble conditionnel -Another common way of subsetting is by using a logical vector. `TRUE` will -select the element with the same index, while `FALSE` will not: +Une autre méthode courante de sous-ensemble consiste à utiliser un vecteur logique. `TRUE` +sélectionnera l'élément avec le même index, tandis que `FALSE` ne le fera pas : ```{r, purl=TRUE} -weight_g <- c(21, 34, 39, 54, 55) -weight_g[c(TRUE, FALSE, TRUE, TRUE, FALSE)] +poids_g <- c(21, 34, 39, 54, 55) +poids_g[c(VRAI, FAUX, VRAI, VRAI, FAUX)] ``` -Typically, these logical vectors are not typed by hand, but are the -output of other functions or logical tests. For instance, if you -wanted to select only the values above 50: +Généralement, ces vecteurs logiques ne sont pas tapés à la main, mais sont la sortie +d'autres fonctions ou tests logiques. Par exemple, si vous +souhaitez sélectionner uniquement les valeurs supérieures à 50 : ```{r, purl=TRUE} ## will return logicals with TRUE for the indices that meet @@ -544,24 +544,24 @@ weight_g > 50 weight_g[weight_g > 50] ``` -You can combine multiple tests using `&` (both conditions are true, -AND) or `|` (at least one of the conditions is true, OR): +Vous pouvez combiner plusieurs tests en utilisant `&` (les deux conditions sont vraies, +AND) ou `|` (au moins une des conditions est vraie, OR) : ```{r, results="show", purl=TRUE} -weight_g[weight_g < 30 | weight_g > 50] -weight_g[weight_g >= 30 & weight_g == 21] +poids_g[poids_g < 30 | poids_g > 50] +poids_g[poids_g >= 30 & poids_g == 21] ``` -Here, `<` stands for "less than", `>` for "greater than", `>=` for -"greater than or equal to", and `==` for "equal to". The double equal +Ici, `<` signifie "inférieur à", `>` pour "supérieur à", `>=` pour +"supérieur ou égal à" et `==` pour "égal à". The double equal sign `==` is a test for numerical equality between the left and right hand sides, and should not be confused with the single `=` sign, which performs variable assignment (similar to `<-`). -A common task is to search for certain strings in a vector. One could -use the "or" operator `|` to test for equality to multiple values, but -this can quickly become tedious. The function `%in%` allows you to -test if any of the elements of a search vector are found: +Une tâche courante consiste à rechercher certaines chaînes dans un vecteur. On pourrait +utiliser l'opérateur "ou" `|` pour tester l'égalité de plusieurs valeurs, mais +cela peut rapidement devenir fastidieux. La fonction `%in%` permet de +tester si l'un des éléments d'un vecteur de recherche est trouvé : ```{r, purl=TRUE} molecules <- c("dna", "rna", "protein", "peptide") @@ -570,60 +570,60 @@ molecules %in% c("rna", "dna", "metabolite", "peptide", "glycerol") molecules[molecules %in% c("rna", "dna", "metabolite", "peptide", "glycerol")] ``` -::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::: défi -## Challenge: +## Défi: -Can you figure out why `"four" > "five"` returns `TRUE`? +Pouvez-vous comprendre pourquoi « quatre » > « cinq » renvoie « VRAI » ? -::::::::::::::: solution +::::::::::::::: solution ## Solution ```{r} -"four" > "five" +"quatre" > "cinq" ``` -When using `>` or `<` on strings, R compares their alphabetical order. -Here `"four"` comes after `"five"`, and therefore is _greater than_ -it. +Lorsque vous utilisez `>` ou `<` sur des chaînes, R compare leur ordre alphabétique. +Ici, `"quatre"` vient après `"cinq"`, et est donc _supérieur à_ +. ::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::: -## Names +## Des noms -It is possible to name each element of a vector. The code chunk below -shows an initial vector without any names, how names are set, and -retrieved. +Il est possible de nommer chaque élément d'un vecteur. Le morceau de code ci-dessous +montre un vecteur initial sans aucun nom, comment les noms sont définis et +récupérés. ```{r} x <- c(1, 5, 3, 5, 10) -names(x) ## no names -names(x) <- c("A", "B", "C", "D", "E") -names(x) ## now we have names +noms(x) ## pas de noms +noms(x) <- c("A", "B", " C", "D", "E") +noms(x) ## maintenant nous avons des noms ``` -When a vector has names, it is possible to access elements by their -name, in addition to their index. +Lorsqu'un vecteur possède des noms, il est possible d'accéder aux éléments par leur nom +, en plus de leur index. ```{r} x[c(1, 3)] x[c("A", "C")] ``` -## Missing data +## Données manquantes -As R was designed to analyze datasets, it includes the concept of -missing data (which is uncommon in other programming -languages). Missing data are represented in vectors as `NA`. +Comme R a été conçu pour analyser des ensembles de données, il inclut le concept de +données manquantes (ce qui est rare dans d'autres langages de programmation +). Les données manquantes sont représentées dans les vecteurs par « NA ». -When doing operations on numbers, most functions will return `NA` if -the data you are working with include missing values. This feature -makes it harder to overlook the cases where you are dealing with -missing data. You can add the argument `na.rm = TRUE` to calculate -the result while ignoring the missing values. +Lorsque vous effectuez des opérations sur des nombres, la plupart des fonctions renverront « NA » si +les données avec lesquelles vous travaillez incluent des valeurs manquantes. Cette fonctionnalité +rend plus difficile l'ignorance des cas où vous avez affaire à +données manquantes. Vous pouvez ajouter l'argument `na.rm = TRUE` pour calculer +le résultat en ignorant les valeurs manquantes. ```{r} heights <- c(2, 4, 4, NA, 6) @@ -633,292 +633,292 @@ mean(heights, na.rm = TRUE) max(heights, na.rm = TRUE) ``` -If your data include missing values, you may want to become familiar -with the functions `is.na()`, `na.omit()`, and `complete.cases()`. See -below for examples. +Si vos données incluent des valeurs manquantes, vous souhaiterez peut-être vous familiariser +avec les fonctions `is.na()`, `na.omit()` et `complete.cases()`. Voir +ci-dessous pour des exemples. ```{r} -## Extract those elements which are not missing values. +## Extrayez les éléments pour lesquels il ne manque pas de valeurs. heights[!is.na(heights)] -## Returns the object with incomplete cases removed. -## The returned object is an atomic vector of type `"numeric"` -## (or `"double"`). +## Renvoie l'objet avec les cas incomplets supprimés. +## L'objet retourné est un vecteur atomique de type `"numeric"` +## (ou `"double"`). na.omit(heights) -## Extract those elements which are complete cases. -## The returned object is an atomic vector of type `"numeric"` -## (or `"double"`). -heights[complete.cases(heights)] +## Extrayez les éléments qui sont des cas complets. +## L'objet retourné est un vecteur atomique de type `"numeric"` +## (ou `"double"`). +hauteurs[complete.cases(heights)] ``` -::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::: défi -## Challenge: +## Défi: -1. Using this vector of heights in inches, create a new vector with the NAs removed. +1. En utilisant ce vecteur de hauteurs en pouces, créez un nouveau vecteur en supprimant les NA. ```{r} -heights <- c(63, 69, 60, 65, NA, 68, 61, 70, 61, 59, 64, 69, 63, 63, NA, 72, 65, 64, 70, 63, 65) +hauteurs <- c(63, 69, 60, 65, NA, 68, 61, 70, 61, 59, 64, 69, 63, 63, NA, 72, 65, 64, 70, 63, 65) ``` -2. Use the function `median()` to calculate the median of the `heights` vector. -3. Use R to figure out how many people in the set are taller than 67 inches. +2. Utilisez la fonction `median()` pour calculer la médiane du vecteur `heights`. +3. Utilisez R pour déterminer combien de personnes dans l’ensemble mesurent plus de 67 pouces. -::::::::::::::: solution +::::::::::::::: solution ## Solution ```{r, purl=TRUE} heights_no_na <- heights[!is.na(heights)] -## or +## ou heights_no_na <- na.omit(heights) ``` ```{r, purl=TRUE} -median(heights, na.rm = TRUE) +médiane (hauteurs, na.rm = TRUE) ``` ```{r, purl=TRUE} -heights_above_67 <- heights_no_na[heights_no_na > 67] -length(heights_above_67) +hauteurs_above_67 <- heights_no_na[heights_no_na > 67] +longueur(heights_above_67) ``` ::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::: -## Generating vectors {#sec:genvec} +## Génération de vecteurs {#sec:genvec} ```{r, echo=FALSE} set.seed(1) ``` -### Constructors +### Constructeurs -There exists some functions to generate vectors of different type. To -generate a vector of numerics, one can use the `numeric()` -constructor, providing the length of the output vector as -parameter. The values will be initialised with 0. +Il existe quelques fonctions pour générer des vecteurs de différents types. Pour +générer un vecteur de valeurs numériques, on peut utiliser le constructeur `numeric()` +, fournissant la longueur du vecteur de sortie comme paramètre +. Les valeurs seront initialisées à 0. ```{r, purl=TRUE} -numeric(3) -numeric(10) +numérique(3) +numérique(10) ``` -Note that if we ask for a vector of numerics of length 0, we obtain -exactly that: +Notez que si l'on demande un vecteur de numériques de longueur 0, on obtient +exactement cela : ```{r, purl=TRUE} -numeric(0) +numérique(0) ``` -There are similar constructors for characters and logicals, named -`character()` and `logical()` respectively. +Il existe des constructeurs similaires pour les caractères et les logiques, nommés respectivement +`character()` et `logical()`. -::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::: défi -## Challenge: +## Défi: -What are the defaults for character and logical vectors? +Quelles sont les valeurs par défaut pour les caractères et les vecteurs logiques ? -::::::::::::::: solution +::::::::::::::: solution ## Solution ```{r, purl=TRUE} -character(2) ## the empty character -logical(2) ## FALSE +caractère(2) ## le caractère vide +logique(2) ## FALSE ``` ::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::: -### Replicate elements +### Répliquer des éléments -The `rep` function allow to repeat a value a certain number of -times. If we want to initiate a vector of numerics of length 5 with -the value -1, for example, we could do the following: +La fonction `rep` permet de répéter une valeur un certain nombre de +fois. Si nous voulons initier un vecteur de numériques de longueur 5 avec +la valeur -1, par exemple, nous pourrions faire ce qui suit : ```{r, purl=TRUE} -rep(-1, 5) +représentant(-1, 5) ``` -Similarly, to generate a vector populated with missing values, which -is often a good way to start, without setting assumptions on the data -to be collected: +De même, pour générer un vecteur rempli de valeurs manquantes, ce qui +est souvent une bonne façon de commencer, sans poser d'hypothèses sur les données +à collecter : ```{r, purl=TRUE} -rep(NA, 5) +représentant(NA, 5) ``` -`rep` can take vectors of any length as input (above, we used vectors -of length 1) and any type. For example, if we want to repeat the -values 1, 2 and 3 five times, we would do the following: +`rep` peut prendre en entrée des vecteurs de n'importe quelle longueur (ci-dessus, nous avons utilisé des vecteurs +de longueur 1) et de n'importe quel type. Par exemple, si nous voulons répéter cinq fois les valeurs +1, 2 et 3, nous procéderions comme suit : ```{r, purl=TRUE} -rep(c(1, 2, 3), 5) +représentant(c(1, 2, 3), 5) ``` -::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::: défi -## Challenge: +## Défi: -What if we wanted to repeat the values 1, 2 and 3 five times, but -obtain five 1s, five 2s and five 3s in that order? There are two -possibilities - see `?rep` or `?sort` for help. +Et si nous voulions répéter les valeurs 1, 2 et 3 cinq fois, mais que +obtenait cinq 1, cinq 2 et cinq 3 dans cet ordre ? Il existe deux +possibilités - voir `?rep` ou `?sort` pour obtenir de l'aide. -::::::::::::::: solution +::::::::::::::: solution ## Solution ```{r, purl=TRUE} -rep(c(1, 2, 3), each = 5) +rep(c(1, 2, 3), chacun = 5) sort(rep(c(1, 2, 3), 5)) ``` ::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::: -### Sequence generation +### Génération de séquence -Another very useful function is `seq`, to generate a sequence of -numbers. For example, to generate a sequence of integers from 1 to 20 -by steps of 2, one would use: +Une autre fonction très utile est `seq`, pour générer une séquence de +nombres. Par exemple, pour générer une séquence d'entiers de 1 à 20 +par pas de 2, on utiliserait : ```{r, purl=TRUE} -seq(from = 1, to = 20, by = 2) +séq(de = 1, à = 20, par = 2) ``` -The default value of `by` is 1 and, given that the generation of a -sequence of one value to another with steps of 1 is frequently used, -there's a shortcut: +La valeur par défaut de `by` est 1 et, étant donné que la génération d'une séquence +d'une valeur à une autre avec des pas de 1 est fréquemment utilisée, +il existe un raccourci : ```{r, purl=TRUE} seq(1, 5, 1) -seq(1, 5) ## default by +seq(1, 5) ## par défaut par 1:5 ``` -To generate a sequence of numbers from 1 to 20 of final length of 3, -one would use: +Pour générer une séquence de nombres de 1 à 20 de longueur finale de 3, +on utiliserait : ```{r, purl=TRUE} -seq(from = 1, to = 20, length.out = 3) +seq (de = 1, à = 20, longueur.out = 3) ``` -### Random samples and permutations +### Échantillons aléatoires et permutations -A last group of useful functions are those that generate random -data. The first one, `sample`, generates a random permutation of -another vector. For example, to draw a random order to 10 students -oral exam, I first assign each student a number from 1 to ten (for -instance based on the alphabetic order of their name) and then: +Un dernier groupe de fonctions utiles sont celles qui génèrent des données aléatoires +. Le premier, `sample`, génère une permutation aléatoire de +un autre vecteur. Par exemple, pour tirer au sort un ordre aléatoire de 10 étudiants +à l'examen oral, j'attribue d'abord à chaque étudiant un numéro de 1 à dix (par exemple +en fonction de l'ordre alphabétique de son nom) puis : ```{r, purl=TRUE} -sample(1:10) +échantillon (1:10) ``` -Without further arguments, `sample` will return a permutation of all -elements of the vector. If I want a random sample of a certain size, I -would set this value as the second argument. Below, I sample 5 random -letters from the alphabet contained in the pre-defined `letters` vector: +Sans autres arguments, `sample` renverra une permutation de tous les +éléments du vecteur. Si je veux un échantillon aléatoire d'une certaine taille, je +définirais cette valeur comme deuxième argument. Ci-dessous, j'échantillonne 5 +lettres aléatoires de l'alphabet contenu dans le vecteur `letters` prédéfini : ```{r, purl=TRUE} -sample(letters, 5) +échantillon(lettres, 5) ``` -If I wanted an output larger than the input vector, or being able to -draw some elements multiple times, I would need to set the `replace` -argument to `TRUE`: +Si je voulais une sortie plus grande que le vecteur d'entrée, ou pouvoir +dessiner certains éléments plusieurs fois, je devrais définir l'argument `replace` +sur `TRUE` : ```{r, purl=TRUE} -sample(1:5, 10, replace = TRUE) +échantillon (1:5, 10, remplacer = VRAI) ``` -::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::: défi -## Challenge: +## Défi: -When trying the functions above out, you will have realised that the -samples are indeed random and that one doesn't get the same -permutation twice. To be able to reproduce these random draws, one can -set the random number generation seed manually with `set.seed()` -before drawing the random sample. +En essayant les fonctions ci-dessus, vous aurez réalisé que les échantillons +sont effectivement aléatoires et qu'on n'obtient pas deux fois la même permutation +. Pour pouvoir reproduire ces tirages aléatoires, on peut +définir manuellement la graine de génération de nombres aléatoires avec `set.seed()` +avant de tirer l'échantillon aléatoire. -Test this feature with your neighbour. First draw two random -permutations of `1:10` independently and observe that you get -different results. +Testez cette fonctionnalité avec votre voisin. Dessinez d'abord deux permutations aléatoires +de « 1:10 » indépendamment et observez que vous obtenez +résultats différents. -Now set the seed with, for example, `set.seed(123)` and repeat the -random draw. Observe that you now get the same random draws. +Définissez maintenant la graine avec, par exemple, `set.seed(123)` et répétez le tirage au sort +. Observez que vous obtenez désormais les mêmes tirages au sort. -Repeat by setting a different seed. +Répétez en définissant une graine différente. -::::::::::::::: solution +::::::::::::::: solution ## Solution -Different permutations +Différentes permutations ```{r, purl=TRUE} -sample(1:10) -sample(1:10) +échantillon (1:10) +échantillon (1:10) ``` -Same permutations with seed 123 +Mêmes permutations avec la graine 123 ```{r, purl=TRUE} set.seed(123) -sample(1:10) +échantillon(1:10) set.seed(123) -sample(1:10) +échantillon(1:10) ``` -A different seed +Une graine différente ```{r, purl=TRUE} set.seed(1) -sample(1:10) +échantillon(1:10) set.seed(1) -sample(1:10) +échantillon(1:10) ``` ::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::: -### Drawing samples from a normal distribution +### Extraire des échantillons à partir d'une distribution normale -The last function we are going to see is `rnorm`, that draws a random -sample from a normal distribution. Two normal distributions of means 0 -and 100 and standard deviations 1 and 5, noted _N(0, 1)_ and -_N(100, 5)_, are shown below. +La dernière fonction que nous allons voir est `rnorm`, qui tire un échantillon aléatoire +à partir d'une distribution normale. Deux distributions normales de moyennes 0 +et 100 et d'écarts types 1 et 5, notées _N(0, 1)_ et +_N(100, 5)_, sont présentées ci-dessous. ```{r, echo=FALSE, fig.width=12, fig.height=6, fig.cap="Two normal distributions: *N(0, 1)* on the left and *N(100, 5)* on the right."} par(mfrow = c(1, 2)) plot(density(rnorm(1000)), main = "", sub = "N(0, 1)") -plot(density(rnorm(1000, 100, 5)), main = "", sub = "N(100, 5)") +plot(densité (rnorm(1000, 100, 5)), principal = "", sous = "N(100, 5)") ``` -The three arguments, `n`, `mean` and `sd`, define the size of the -sample, and the parameters of the normal distribution, i.e the mean -and its standard deviation. The defaults of the latter are 0 and 1. +Les trois arguments, `n`, `mean` et `sd`, définissent la taille de l'échantillon +, et les paramètres de la distribution normale, c'est-à-dire la moyenne +et son écart type. Les valeurs par défaut de ce dernier sont 0 et 1. ```{r, purl=TRUE} -rnorm(5) -rnorm(5, 2, 2) -rnorm(5, 100, 5) +rnorme(5) +rnorme(5, 2, 2) +rnorme(5, 100, 5) ``` -Now that we have learned how to write scripts, and the basics of R's -data structures, we are ready to start working with larger data, and -learn about data frames. +Maintenant que nous avons appris à écrire des scripts et les bases des structures de données +de R, nous sommes prêts à commencer à travailler avec des données plus volumineuses et à +en apprendre davantage sur les trames de données. -:::::::::::::::::::::::::::::::::::::::: keypoints +:::::::::::::::::::::::::::::::::::::::: points clés -- How to interact with R +- Comment interagir avec R -:::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::: From 5b4cbb73a3cefdd96931217b8d7f018ed0f543d3 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Thu, 1 Aug 2024 00:07:29 +0900 Subject: [PATCH 198/334] New translations 23-starting-with-r.md (Chinese Simplified) --- locale/zh/episodes/23-starting-with-r.Rmd | 937 +++++++++++----------- 1 file changed, 469 insertions(+), 468 deletions(-) diff --git a/locale/zh/episodes/23-starting-with-r.Rmd b/locale/zh/episodes/23-starting-with-r.Rmd index 410e507fd..a043e2203 100644 --- a/locale/zh/episodes/23-starting-with-r.Rmd +++ b/locale/zh/episodes/23-starting-with-r.Rmd @@ -1,6 +1,6 @@ --- -source: Rmd -title: Introduction to R +source: 放射科 +title: R 简介 teaching: 60 exercises: 60 --- @@ -8,365 +8,366 @@ exercises: 60 ```{r, include=FALSE} ``` -::::::::::::::::::::::::::::::::::::::: objectives +:::::::::::::::::::::::::::::::::::::::::: 目标 -- Define the following terms as they relate to R: object, assign, call, function, arguments, options. -- Assign values to objects in R. -- Learn how to _name_ objects -- Use comments to inform script. -- Solve simple arithmetic operations in R. -- Call functions and use arguments to change their default options. -- Inspect the content of vectors and manipulate their content. -- Subset and extract values from vectors. -- Analyze vectors with missing data. +- 定义与 R 相关的以下术语:对象、分配、调用、函数、参数、选项。 +- 为 R 中的对象分配值。 +- 学习如何命名物体 +- 使用注释来告知脚本。 +- 解决 R 中的简单算术运算。 +- 调用函数并使用参数来改变其默认选项。 +- 检查向量的内容并操作其内容。 +- 从向量中取子集并提取值。 +- 分析缺失数据的向量。 -:::::::::::::::::::::::::::::::::::::::::::::::::: +::::::::::::::::::::::::::::::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::: questions +:::::::::::::::::::::::::::::::::::::::: 问题 -- First commands in R +- R 中的第一个命令 -:::::::::::::::::::::::::::::::::::::::::::::::::: +::::::::::::::::::::::::::::::::::::::::::::::::::::: -> This episode is based on the Data Carpentries's _Data Analysis and -> Visualisation in R for Ecologists_ lesson. +> 本集基于 Data Carpentries 的_面向生态学家的 R 语言数据分析和 +> 可视化_课程。 -## Creating objects in R +## 在 R 中创建对象 -You can get output from R simply by typing math in the console: +只需在控制台中输入数学即可获得 R 的输出: ```{r, purl=TRUE} 3 + 5 12 / 7 ``` -However, to do useful and interesting things, we need to assign _values_ to -_objects_. To create an object, we need to give it a name followed by the -assignment operator `<-`, and the value we want to give it: +然而,为了做一些有用和有趣的事情,我们需要为 +_对象_分配\*值。 要创建一个对象,我们需要给它一个名字,后跟 +赋值运算符`<-`,以及我们想要赋予它的值: ```{r, purl=TRUE} -weight_kg <- 55 +体重_kg <- 55 ``` -`<-` is the assignment operator. It assigns values on the right to -objects on the left. So, after executing `x <- 3`, the value of `x` is -`3`. The arrow can be read as 3 **goes into** `x`. For historical -reasons, you can also use `=` for assignments, but not in every -context. Because of the -[slight differences](https://blog.revolutionanalytics.com/2008/12/use-equals-or-arrow-for-assignment.html) -in syntax, it is good practice to always use `<-` for assignments. +`<-` 是赋值运算符。 它将右侧的值分配给左侧的 +对象。 因此,执行 `x <- 3` 后,`x` 的值为 +`3`。 箭头可以读作 3 **进入** `x`。 由于历史 +原因,您也可以使用 `=` 进行赋值,但并非在每个 +上下文中都是如此。 由于语法上存在 +[细微差别](https://blog.revolutionanalytics.com/2008/12/use-equals-or-arrow-for-assignment.html) +,因此在赋值时始终使用 `<-` 是一种很好的做法。 -In RStudio, typing <kbd>Alt</kbd> + <kbd>\-</kbd> (push <kbd>Alt</kbd> -at the same time as the <kbd>\-</kbd> key) will write `<-` in a single -keystroke in a PC, while typing <kbd>Option</kbd> + <kbd>\-</kbd> (push <kbd>Option</kbd> at the same time as the <kbd>\-</kbd> key) does the -same in a Mac. +在 RStudio 中,输入 <kbd>Alt</kbd> + <kbd>\-</kbd> (同时按下 <kbd>Alt</kbd> +和 <kbd>\-</kbd> 键)将在 PC 上的单个 +按键中写入 `<-`,而输入 <kbd>Option</kbd> + <kbd>\-</kbd> (同时按下 <kbd>Option</kbd> 和 <kbd>\-</kbd> 键)则不会在单个 +按键中写入 `<-`。 +在 Mac 上也一样。 -### Naming variables +### 命名变量 -Objects can be given any name such as `x`, `current_temperature`, or -`subject_id`. You want your object names to be explicit and not too -long. They cannot start with a number (`2x` is not valid, but `x2` -is). R is case sensitive (e.g., `weight_kg` is different from -`Weight_kg`). There are some names that cannot be used because they -are the names of fundamental functions in R (e.g., `if`, `else`, -`for`, see -[here](https://stat.ethz.ch/R-manual/R-devel/library/base/html/Reserved.html) -for a complete list). In general, even if it's allowed, it's best to +对象可以被赋予任何名称,例如“x”、“current_temperature”或 +“subject_id”。 You want your object names to be explicit and not too +long. 它们不能以数字开头(`2x` 无效,但 `x2` +有效)。 R 区分大小写(例如,`weight_kg` 与 +`Weight_kg` 不同)。 有些名称不能使用,因为它们 +是 R 中基本函数的名称(例如,`if`、`else`、 +`for`,请参阅 +[此处](https://stat.ethz.ch/R-manual/R-devel/library/base/html/Reserved.html) +了解完整列表)。 In general, even if it's allowed, it's best to not use other function names (e.g., `c`, `T`, `mean`, `data`, `df`, -`weights`). If in doubt, check the help to see if the name is already -in use. It's also best to avoid dots (`.`) within an object name as in -`my.dataset`. There are many functions in R with dots in their names -for historical reasons, but because dots have a special meaning in R -(for methods) and other programming languages, it's best to avoid -them. It is also recommended to use nouns for object names, and verbs -for function names. It's important to be consistent in the styling of -your code (where you put spaces, how you name objects, etc.). Using a -consistent coding style makes your code clearer to read for your -future self and your collaborators. In R, some popular style guides -are [Google's](https://google.github.io/styleguide/Rguide.xml), the -[tidyverse's](https://style.tidyverse.org/) style and the Bioconductor -style -guide. The +`weights`). 如果有疑问,请检查帮助以查看该名称是否已被 +使用。 最好避免在对象名称中使用点(“。”),如 +“my.dataset”。 由于历史原因,R 中有许多函数名称中带有点 +,但由于点在 R +(用于方法)和其他编程语言中具有特殊含义,因此最好避免使用 +它们。 还建议使用名词作为对象名称,使用动词 +作为函数名称。 保持代码 +样式的一致性(空格的位置、对象命名方式等)非常重要。 使用 +一致的编码风格可以让您的代码更清晰地供您 +未来的自己和您的合作者阅读。 在 R 中,一些流行的风格指南 +是 [Google 的](https://google.github.io/styleguide/Rguide.xml)、 +[tidyverse 的](https://style.tidyverse.org/) 风格和 Bioconductor +风格 +指南。 The tidyverse's is very comprehensive and may seem overwhelming at -first. You can install the -[**`lintr`**](https://github.com/jimhester/lintr) package to -automatically check for issues in the styling of your code. +first. 您可以安装 +[**`lintr`**](https://github.com/jimhester/lintr) 包来 +自动检查代码样式中的问题。 > **Objects vs. variables**: What are known as `objects` in `R` are > known as `variables` in many other programming languages. Depending > on the context, `object` and `variable` can have drastically > different meanings. However, in this lesson, the two words are used -> synonymously. For more information -> [see here.](https://cran.r-project.org/doc/manuals/r-release/R-lang.html#Objects) +> synonymously. 有关更多信息 +> [请参阅此处](https://cran.r-project.org/doc/manuals/r-release/R-lang.html#Objects) -When assigning a value to an object, R does not print anything. You -can force R to print the value by using parentheses or by typing the -object name: +当为一个对象分配值时,R 不会打印任何内容。 您 +可以使用括号或键入 +对象名称来强制 R 打印该值: ```{r, purl=TRUE} -weight_kg <- 55 # doesn't print anything -(weight_kg <- 55) # but putting parenthesis around the call prints the value of `weight_kg` -weight_kg # and so does typing the name of the object +weight_kg <- 55 # 不打印任何内容 +(weight_kg <- 55) # 但是在调用周围加上括号会打印 `weight_kg` 的值 +weight_kg # 输入对象的名称也会打印任何内容 ``` -Now that R has `weight_kg` in memory, we can do arithmetic with it. For -instance, we may want to convert this weight into pounds (weight in pounds is 2.2 times the weight in kg): +现在 R 内存中有了“weight_kg”,我们可以用它进行算术运算。 例如,对于 +来说,我们可能希望将这个重量转换为磅(磅重量是公斤重量的 2.2 倍): ```{r, purl=TRUE} -2.2 * weight_kg +2.2 * 体重_公斤 ``` -We can also change an object's value by assigning it a new one: +我们还可以通过分配新值来更改对象的值: ```{r, purl=TRUE} -weight_kg <- 57.5 -2.2 * weight_kg +体重_kg <- 57.5 +2.2 * 体重_kg ``` -This means that assigning a value to one object does not change the values of -other objects For example, let's store the animal's weight in pounds in a new -object, `weight_lb`: +这意味着为一个对象分配一个值不会改变 +其他对象的值例如,让我们将动物的体重(磅)存储在一个新的 +对象`weight_lb`中: ```{r, purl=TRUE} -weight_lb <- 2.2 * weight_kg +体重磅 <- 2.2 * 体重公斤 ``` -and then change `weight_kg` to 100. +然后将“weight_kg”改为100。 ```{r} -weight_kg <- 100 +体重_kg <- 100 ``` -::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::: 挑战 -## Challenge: +## 挑战: -What do you think is the current content of the object `weight_lb`? -126\.5 or 220? +您认为对象“weight_lb”的当前内容是什么? +126\.5 还是 220? -:::::::::::::::::::::::::::::::::::::::::::::::::: +::::::::::::::::::::::::::::::::::::::::::::::::::::: -## Comments +## 评论 -The comment character in R is `#`, anything to the right of a `#` in a -script will be ignored by R. It is useful to leave notes, and -explanations in your scripts. +R 中的注释字符是 `#`, +脚本中 `#` 右边的任何内容都将被 R 忽略。在脚本中留下注释和 +解释很有用。 -RStudio makes it easy to comment or uncomment a paragraph: after -selecting the lines you want to comment, press at the same time on -your keyboard <kbd>Ctrl</kbd> + <kbd>Shift</kbd> + <kbd>C</kbd>. If -you only want to comment out one line, you can put the cursor at any -location of that line (i.e. no need to select the whole line), then -press <kbd>Ctrl</kbd> + <kbd>Shift</kbd> + <kbd>C</kbd>. +RStudio 可以轻松注释或取消注释一个段落:在 +选择要注释的行后,同时按下键盘上 +上的 <kbd>Ctrl</kbd> + <kbd>Shift</kbd> + <kbd>C</kbd>。 如果 +你只想注释掉一行,你可以将光标放在该行的任意 +位置(即不需要选择整行),然后 +按 <kbd>Ctrl</kbd> + <kbd>Shift</kbd> + <kbd>C</kbd>。 -::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::: 挑战 -## Challenge +## 挑战 -What are the values after each statement in the following? +以下每个语句后面的值是什么? ```{r, purl=TRUE} -mass <- 47.5 # mass? -age <- 122 # age? -mass <- mass * 2.0 # mass? -age <- age - 20 # age? -mass_index <- mass/age # mass_index? +mass <- 47.5 # 质量? +age <- 122 # 年龄? +mass <- mass * 2.0 # 质量? +age <- age - 20 # 年龄? +mass_index <- mass/age # 质量指数? ``` -:::::::::::::::::::::::::::::::::::::::::::::::::: +::::::::::::::::::::::::::::::::::::::::::::::::::::: -## Functions and their arguments +## 函数及其参数 -Functions are "canned scripts" that automate more complicated sets of commands -including operations assignments, etc. Many functions are predefined, or can be -made available by importing R _packages_ (more on that later). A function -usually gets one or more inputs called _arguments_. Functions often (but not -always) return a _value_. A typical example would be the function `sqrt()`. The -input (the argument) must be a number, and the return value (in fact, the -output) is the square root of that number. Executing a function ('running it') -is called _calling_ the function. An example of a function call is: +函数是“固定脚本”,可以自动执行更复杂的命令集 +,包括操作分配等。 Many functions are predefined, or can be +made available by importing R _packages_ (more on that later). 函数 +通常会获得一个或多个称为 _参数_ 的输入。 函数通常(但并非总是 +)返回一个 _值_。 一个典型的例子是函数“sqrt()”。 +输入(参数)必须是一个数字,返回值(实际上是 +输出)是该数字的平方根。 执行一个函数(“运行它”) +被称为_调用_该函数。 函数调用的一个示例是: ```{r, eval=FALSE, purl=FALSE} -b <- sqrt(a) +b <- sqrt (a) ``` -Here, the value of `a` is given to the `sqrt()` function, the `sqrt()` function -calculates the square root, and returns the value which is then assigned to -the object `b`. This function is very simple, because it takes just one argument. +这里,将 `a` 的值传递给 `sqrt()` 函数,`sqrt()` 函数 +计算平方根,并返回该值,然后将该值赋值给 +对象 `b`。 这个函数非常简单,因为它只接受一个参数。 -The return 'value' of a function need not be numerical (like that of `sqrt()`), -and it also does not need to be a single item: it can be a set of things, or -even a dataset. We'll see that when we read data files into R. +函数的返回“值”不必是数字(如 `sqrt()` 的值), +并且它也不必是单个项:它可以是一组事物,或者 +甚至是一个数据集。 当我们将数据文件读入 R 时,我们就会看到这一点。 -Arguments can be anything, not only numbers or filenames, but also other -objects. Exactly what each argument means differs per function, and must be -looked up in the documentation (see below). Some functions take arguments which -may either be specified by the user, or, if left out, take on a _default_ value: -these are called _options_. Options are typically used to alter the way the -function operates, such as whether it ignores 'bad values', or what symbol to -use in a plot. However, if you want something specific, you can specify a value -of your choice which will be used instead of the default. +参数可以是任何东西,不仅是数字或文件名,还可以是其他 +对象。 每个参数的具体含义因函数而异,必须在文档中查找 +(见下文)。 一些函数接受的参数 +可以由用户指定,或者,如果省略,则采用_默认_值: +,这些被称为_选项_。 选项通常用于改变 +函数的运行方式,例如是否忽略“坏值”,或者 +在图中使用什么符号。 但是,如果您想要一些特定的东西,您可以指定一个您选择的值 +来代替默认值。 -Let's try a function that can take multiple arguments: `round()`. +让我们尝试一个可以接受多个参数的函数:“round()”。 ```{r, results="show", purl=TRUE} -round(3.14159) +圆形(3.14159) ``` -Here, we've called `round()` with just one argument, `3.14159`, and it has -returned the value `3`. That's because the default is to round to the nearest -whole number. If we want more digits we can see how to do that by getting -information about the `round` function. We can use `args(round)` or look at the -help for this function using `?round`. +在这里,我们仅用一个参数“3.14159”调用了“round()”,并且它 +返回了值“3”。 这是因为默认设置是四舍五入到最接近的 +整数。 如果我们想要更多的数字,我们可以通过获取有关“round”函数的 +信息来了解如何做到这一点。 我们可以使用“args(round)”或者使用“?round”查看此函数的 +帮助。 ```{r, results="show", purl=TRUE} -args(round) +参数(圆形) ``` ```{r, eval=FALSE, purl=TRUE} -?round +?圆形的 ``` -We see that if we want a different number of digits, we can -type `digits=2` or however many we want. +我们看到,如果我们想要不同数量的数字,我们可以 +输入“digits=2”或任意我们想要的数字。 ```{r, results="show", purl=TRUE} -round(3.14159, digits = 2) +四舍五入(3.14159,数字 = 2) ``` -If you provide the arguments in the exact same order as they are defined you -don't have to name them: +如果您按照定义参数的完全相同的顺序提供参数,则 +不必命名它们: ```{r, results="show", purl=TRUE} -round(3.14159, 2) +圆形(3.14159,2) ``` -And if you do name the arguments, you can switch their order: +如果你确实命名了参数,你可以切换它们的顺序: ```{r, results="show", purl=TRUE} -round(digits = 2, x = 3.14159) +四舍五入(数字 = 2,x = 3.14159) ``` -It's good practice to put the non-optional arguments (like the number you're -rounding) first in your function call, and to specify the names of all optional -arguments. If you don't, someone reading your code might have to look up the -definition of a function with unfamiliar arguments to understand what you're -doing. By specifying the name of the arguments you are also safeguarding -against possible future changes in the function interface, which may -potentially add new arguments in between the existing ones. +很好的做法是,在函数调用中将非可选参数(比如 +四舍五入的数字)放在第一位,并指定所有可选 +参数的名称。 如果你不这样做,阅读你代码的人可能必须查找具有不熟悉参数的函数的 +定义才能理解你在 +做什么。 通过指定参数的名称,您还可以保护 +免受函数接口将来可能发生的变化的影响,这些变化可能会 +在现有参数之间添加新参数。 -## Vectors and data types +## 向量和数据类型 -A vector is the most common and basic data type in R, and is pretty much -the workhorse of R. A vector is composed by a series of values, such as -numbers or characters. We can assign a series of values to a vector using -the `c()` function. For example we can create a vector of animal weights and assign -it to a new object `weight_g`: +向量是 R 中最常见、最基本的数据类型,基本上是 +R 的主力。向量由一系列值组成,例如 +数字或字符。 我们可以使用 +`c()` 函数为向量分配一系列值。 例如,我们可以创建一个动物体重向量,并将 +分配给一个新的对象“weight_g”: ```{r, purl=TRUE} -weight_g <- c(50, 60, 65, 82) -weight_g +权重_g <- c(50, 60, 65, 82) +权重_g ``` -A vector can also contain characters: +向量也可以包含字符: ```{r, purl=TRUE} -molecules <- c("dna", "rna", "protein") -molecules +分子 <- c("dna", "rna", "蛋白质") +分子 ``` -The quotes around "dna", "rna", etc. are essential here. Without the -quotes R will assume there are objects called `dna`, `rna` and -`protein`. As these objects don't exist in R's memory, there will be -an error message. +这里“dna”、“rna”等周围的引号至关重要。 如果没有 +引号,R 将假定存在名为 `dna`、`rna` 和 +`protein` 的对象。 由于这些对象在 R 的内存中不存在,因此会出现 +错误消息。 -There are many functions that allow you to inspect the content of a -vector. `length()` tells you how many elements are in a particular vector: +有许多函数可让您检查 +向量的内容。 `length()` 告诉你特定向量中有多少个元素: ```{r, purl=TRUE} -length(weight_g) -length(molecules) +长度(重量_g) +长度(分子) ``` -An important feature of a vector, is that all of the elements are the -same type of data. The function `class()` indicates the class (the -type of element) of an object: +向量的一个重要特征是,所有元素都是 +相同类型的数据。 函数 `class()` 表示对象的类(元素的 +类型): ```{r, purl=TRUE} -class(weight_g) -class(molecules) +类别(权重_g) +类别(分子) ``` -The function `str()` provides an overview of the structure of an -object and its elements. It is a useful function when working with -large and complex objects: +函数“str()”概述了 +对象及其元素的结构。 在处理 +大型复杂对象时,它是一个很有用的函数: ```{r, purl=TRUE} -str(weight_g) -str(molecules) +str(重量_g) +str(分子) ``` -You can use the `c()` function to add other elements to your vector: +您可以使用 `c()` 函数将其他元素添加到向量中: ```{r} -weight_g <- c(weight_g, 90) # add to the end of the vector -weight_g <- c(30, weight_g) # add to the beginning of the vector +weight_g <- c(weight_g, 90) # 添加到向量末尾 +weight_g <- c(30, weight_g) # 添加到向量开头 weight_g ``` -In the first line, we take the original vector `weight_g`, add the -value `90` to the end of it, and save the result back into -`weight_g`. Then we add the value `30` to the beginning, again saving +在第一行中,我们取原始向量“weight_g”,将 +值“90”添加到其末尾,然后将结果保存回 +“weight_g”。 Then we add the value `30` to the beginning, again saving the result back into `weight_g`. -We can do this over and over again to grow a vector, or assemble a -dataset. As we program, this may be useful to add results that we are -collecting or calculating. +我们可以反复这样做来增加一个向量,或者组装一个 +数据集。 在我们编程时,这可能有助于添加我们正在 +收集或计算的结果。 -An **atomic vector** is the simplest R **data type** and is a linear -vector of a single type. Above, we saw 2 of the 6 main **atomic -vector** types that R uses: `"character"` and `"numeric"` (or -`"double"`). These are the basic building blocks that all R objects -are built from. The other 4 **atomic vector** types are: +**原子向量**是最简单的 R **数据类型**,是单一类型的线性 +向量。 上面,我们看到了 R 使用的 6 个主要**原子 +向量**类型中的 2 个:“字符”和“数字”(或 +“双精度”)。 这些是构建所有 R 对象 +的基本构建块。 其他 4 种 **原子向量** 类型是: -- `"logical"` for `TRUE` and `FALSE` (the boolean data type) -- `"integer"` for integer numbers (e.g., `2L`, the `L` indicates to R - that it's an integer) -- `"complex"` to represent complex numbers with real and imaginary - parts (e.g., `1 + 4i`) and that's all we're going to say about them -- `"raw"` for bitstreams that we won't discuss further +- `“逻辑”` 表示 `TRUE` 和 `FALSE`(布尔数据类型) +- `“integer”` 表示整数(例如 `2L`,`L` 向 R + 表示它是一个整数) +- `“complex”` 表示具有实部和虚部 + 的复数(例如 `1 + 4i`),这就是我们要说的 +- “raw” 表示比特流,我们不会进一步讨论 -You can check the type of your vector using the `typeof()` function -and inputting your vector as the argument. +您可以使用 `typeof()` 函数 +并输入您的向量作为参数来检查您的向量的类型。 -Vectors are one of the many **data structures** that R uses. Other -important ones are lists (`list`), matrices (`matrix`), data frames -(`data.frame`), factors (`factor`) and arrays (`array`). +向量是 R 使用的众多**数据结构**之一。 其他 +重要的是列表(`list`)、矩阵(`matrix`)、数据框 +(`data.frame`)、因子(`factor`)和数组(`array`)。 -::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::: 挑战 -## Challenge: +## 挑战: -We've seen that atomic vectors can be of type character, numeric (or -double), integer, and logical. But what happens if we try to mix -these types in a single vector? +我们已经看到,原子向量可以是字符类型、数字类型(或 +双精度型)、整数类型和逻辑类型。 但是如果我们尝试在一个向量中混合 +这些类型会发生什么? -::::::::::::::: solution +::::::::::::::: 解决方案 -## Solution +## 解决方案 -R implicitly converts them to all be the same type +R 隐式地将它们全部转换为同一类型 ::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::: +::::::::::::::::::::::::::::::::::::::::::::::::::::: -::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::: 挑战 -## Challenge: +## 挑战: -What will happen in each of these examples? (hint: use `class()` to -check the data type of your objects and type in their names to see what happens): +每个例子中会发生什么? (提示:使用 `class()` 来 +检查对象的数据类型并输入其名称以查看会发生什么): ```{r, eval=TRUE} num_char <- c(1, 2, 3, "a") @@ -375,49 +376,49 @@ char_logical <- c("a", "b", "c", TRUE) tricky <- c(1, 2, 3, "4") ``` -::::::::::::::: solution +::::::::::::::: 解决方案 -## Solution +## 解决方案 ```{r, purl=TRUE} -class(num_char) +类(num_char) num_char -class(num_logical) +类(num_logical) num_logical -class(char_logical) +类(char_logical) char_logical -class(tricky) +类(tricky) tricky ``` ::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::: +::::::::::::::::::::::::::::::::::::::::::::::::::::: -::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::: 挑战 -## Challenge: +## 挑战: -Why do you think it happens? +您认为为什么会发生这种情况? -::::::::::::::: solution +::::::::::::::: 解决方案 -## Solution +## 解决方案 -Vectors can be of only one data type. R tries to convert (coerce) -the content of this vector to find a _common denominator_ that -doesn't lose any information. +向量只能是一种数据类型。 R 尝试转换(强制) +该向量的内容以找到 +不会丢失任何信息的 _共同点_。 ::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::: +::::::::::::::::::::::::::::::::::::::::::::::::::::: -::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::: 挑战 -## Challenge: +## 挑战: -How many values in `combined_logical` are `"TRUE"` (as a character) -in the following example: +以下示例中,`combined_logical` 中有多少个值为 `“TRUE”`(作为字符) +: ```{r, eval=TRUE} num_logical <- c(1, 2, 3, TRUE) @@ -425,42 +426,42 @@ char_logical <- c("a", "b", "c", TRUE) combined_logical <- c(num_logical, char_logical) ``` -::::::::::::::: solution +::::::::::::::: 解决方案 -## Solution +## 解决方案 -Only one. There is no memory of past data types, and the coercion -happens the first time the vector is evaluated. Therefore, the `TRUE` -in `num_logical` gets converted into a `1` before it gets converted -into `"1"` in `combined_logical`. +只有一个。 没有过去数据类型的记忆,并且强制 +发生在第一次评估向量时。 因此,`num_logical` 中的 `TRUE` +在 `combined_logical` 中的 +转换为 `"1"` 之前,会先转换为 `1`。 ```{r} -combined_logical +组合逻辑 ``` ::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::: +::::::::::::::::::::::::::::::::::::::::::::::::::::: -::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::: 挑战 -## Challenge: +## 挑战: -In R, we call converting objects from one class into another class -_coercion_. These conversions happen according to a hierarchy, -whereby some types get preferentially coerced into other types. Can +在 R 中,我们将对象从一个类转换为另一个类称为 +_强制_。 这些转换根据层次结构进行, +,某些类型优先被强制转换为其他类型。 Can you draw a diagram that represents the hierarchy of how these data types are coerced? -::::::::::::::: solution +::::::::::::::: 解决方案 -## Solution +## 解决方案 -logical → numeric → character ← logical +逻辑 → 数字 → 字符 ← 逻辑 ::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::: +::::::::::::::::::::::::::::::::::::::::::::::::::::: ```{r, echo=FALSE, eval=FALSE, purl=TRUE} ## We've seen that atomic vectors can be of type character, numeric, integer, and @@ -488,290 +489,290 @@ tricky <- c(1, 2, 3, "4") ## types are coerced? ``` -## Subsetting vectors +## 向量子集 -If we want to extract one or several values from a vector, we must -provide one or several indices in square brackets. For instance: +如果我们想从一个向量中提取一个或多个值,我们必须 +在方括号中提供一个或多个索引。 例如: ```{r, results="show", purl=TRUE} -molecules <- c("dna", "rna", "peptide", "protein") -molecules[2] -molecules[c(3, 2)] +分子 <- c("dna", "rna", "肽", "蛋白质") +分子[2] +分子[c(3, 2)] ``` -We can also repeat the indices to create an object with more elements -than the original one: +我们还可以重复索引来创建一个比原始对象具有更多元素 +的对象: ```{r, results="show", purl=TRUE} -more_molecules <- molecules[c(1, 2, 3, 2, 1, 4)] -more_molecules +更多分子 <- 分子[c(1, 2, 3, 2, 1, 4)] +更多分子 ``` -R indices start at 1. Programming languages like Fortran, MATLAB, -Julia, and R start counting at 1, because that's what human beings -typically do. Languages in the C family (including C++, Java, Perl, -and Python) count from 0 because that's simpler for computers to do. +R 索引从 1 开始。 Fortran、MATLAB、 +Julia 和 R 等编程语言从 1 开始计数,因为这是人类 +通常所做的。 C 系列语言(包括 C++、Java、Perl、 +和 Python)从 0 开始计数,因为这对于计算机来说更简单。 -Finally, it is also possible to get all the elements of a vector -except some specified elements using negative indices: +最后,还可以使用负索引获取向量 +的所有元素,除了一些指定元素: ```{r} -molecules ## all molecules -molecules[-1] ## all but the first one -molecules[-c(1, 3)] ## all but 1st/3rd ones -molecules[c(-1, -3)] ## all but 1st/3rd ones +分子 ## 所有分子 +分子[-1] ## 除第一个之外的所有分子 +分子[-c(1, 3)] ## 除第 1/3 个之外的所有分子 +分子[c(-1, -3)] ## 除第 1/3 个之外的所有分子 ``` -## Conditional subsetting +## 条件子集 -Another common way of subsetting is by using a logical vector. `TRUE` will -select the element with the same index, while `FALSE` will not: +另一种常见的子集方法是使用逻辑向量。 `TRUE` 将 +选择具有相同索引的元素,而 `FALSE` 则不会: ```{r, purl=TRUE} weight_g <- c(21, 34, 39, 54, 55) weight_g[c(TRUE, FALSE, TRUE, TRUE, FALSE)] ``` -Typically, these logical vectors are not typed by hand, but are the -output of other functions or logical tests. For instance, if you -wanted to select only the values above 50: +通常,这些逻辑向量不是手工输入的,而是其他函数或逻辑测试的 +输出。 例如,如果您 +只想选择 50 以上的值: ```{r, purl=TRUE} -## will return logicals with TRUE for the indices that meet -## the condition +## 将返回满足 +条件的索引的逻辑值为 TRUE ## 条件 weight_g > 50 -## so we can use this to select only the values above 50 +## 因此我们可以使用它来仅选择高于 50 的值 weight_g[weight_g > 50] ``` -You can combine multiple tests using `&` (both conditions are true, -AND) or `|` (at least one of the conditions is true, OR): +您可以使用 `&`(两个条件都为真, +AND)或 `|`(至少有一个条件为真,OR)组合多个测试: ```{r, results="show", purl=TRUE} -weight_g[weight_g < 30 | weight_g > 50] -weight_g[weight_g >= 30 & weight_g == 21] +权重_g[权重_g < 30 | 权重_g > 50] +权重_g[权重_g >= 30 & 权重_g == 21] ``` -Here, `<` stands for "less than", `>` for "greater than", `>=` for -"greater than or equal to", and `==` for "equal to". The double equal -sign `==` is a test for numerical equality between the left and right -hand sides, and should not be confused with the single `=` sign, which -performs variable assignment (similar to `<-`). +这里,`<` 代表“小于”,`>` 代表“大于”,`>=` 代表 +“大于或等于”,`==` 代表“等于”。 双等号 +符号 `==` 用于测试左右两边 +之间是否数值相等,不要与单个 `=` 符号混淆,后者 +执行变量赋值(类似于 `<-`)。 -A common task is to search for certain strings in a vector. One could -use the "or" operator `|` to test for equality to multiple values, but -this can quickly become tedious. The function `%in%` allows you to -test if any of the elements of a search vector are found: +一个常见的任务是在向量中搜索某些字符串。 人们可以 +使用“或”运算符 `|` 来测试多个值是否相等,但是 +这很快就会变得乏味。 函数 `%in%` 允许您 +测试是否找到搜索向量的任何元素: ```{r, purl=TRUE} molecules <- c("dna", "rna", "protein", "peptide") -molecules[molecules == "rna" | molecules == "dna"] # returns both rna and dna +molecules[molecules == "rna" |molecules == "dna"] # 返回 rna 和 dna molecules %in% c("rna", "dna", "metabolite", "peptide", "glycerol") molecules[molecules %in% c("rna", "dna", "metabolite", "peptide", "glycerol")] ``` -::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::: 挑战 -## Challenge: +## 挑战: -Can you figure out why `"four" > "five"` returns `TRUE`? +你能弄清楚为什么 `"four" > "five"` 返回 `TRUE` 吗? -::::::::::::::: solution +::::::::::::::: 解决方案 -## Solution +## 解决方案 ```{r} -"four" > "five" +“四” > “五” ``` -When using `>` or `<` on strings, R compares their alphabetical order. -Here `"four"` comes after `"five"`, and therefore is _greater than_ -it. +在字符串上使用 `>` 或 `<` 时,R 会比较它们的字母顺序。 +这里的“four”位于“five”之后,因此大于 +它。 ::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::: +::::::::::::::::::::::::::::::::::::::::::::::::::::: -## Names +## 名字 -It is possible to name each element of a vector. The code chunk below -shows an initial vector without any names, how names are set, and -retrieved. +可以命名向量的每个元素。 下面的代码块 +显示了没有任何名称的初始向量,如何设置名称,以及如何检索 +。 ```{r} x <- c(1, 5, 3, 5, 10) -names(x) ## no names +names(x) ## 没有名字 names(x) <- c("A", "B", "C", "D", "E") -names(x) ## now we have names +names(x) ## 现在我们有名字了 ``` -When a vector has names, it is possible to access elements by their -name, in addition to their index. +当向量具有名称时,除了索引之外,还可以通过其 +名称来访问元素。 ```{r} x[c(1, 3)] x[c("A", "C")] ``` -## Missing data +## 缺失数据 -As R was designed to analyze datasets, it includes the concept of -missing data (which is uncommon in other programming -languages). Missing data are represented in vectors as `NA`. +由于 R 旨在分析数据集,因此它包含 +缺失数据的概念(这在其他编程 +语言中并不常见)。 缺失数据在向量中表示为“NA”。 When doing operations on numbers, most functions will return `NA` if -the data you are working with include missing values. This feature -makes it harder to overlook the cases where you are dealing with -missing data. You can add the argument `na.rm = TRUE` to calculate -the result while ignoring the missing values. +the data you are working with include missing values. 此功能 +使您更难忽视处理 +缺失数据的情况。 您可以添加参数“na.rm = TRUE”来计算 +结果,同时忽略缺失值。 ```{r} -heights <- c(2, 4, 4, NA, 6) -mean(heights) -max(heights) -mean(heights, na.rm = TRUE) -max(heights, na.rm = TRUE) +高度 <- c(2, 4, 4, NA, 6) +平均值(高度) +最大值(高度) +平均值(高度, na.rm = TRUE) +最大值(高度, na.rm = TRUE) ``` If your data include missing values, you may want to become familiar -with the functions `is.na()`, `na.omit()`, and `complete.cases()`. See -below for examples. +with the functions `is.na()`, `na.omit()`, and `complete.cases()`. 请参阅下文 +中的示例。 ```{r} -## Extract those elements which are not missing values. +## 提取那些不是缺失值的元素。 heights[!is.na(heights)] -## Returns the object with incomplete cases removed. -## The returned object is an atomic vector of type `"numeric"` -## (or `"double"`). +## 返回删除了不完整案例的对象。 +## 返回的对象是类型为 `"numeric"` 的原子向量 +## (或 `"double"`)。 na.omit(heights) -## Extract those elements which are complete cases. -## The returned object is an atomic vector of type `"numeric"` -## (or `"double"`). +## 提取那些完整案例的元素。 +## 返回的对象是类型为 `"numeric"` 的原子向量 +## (或 `"double"`)。 heights[complete.cases(heights)] ``` -::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::: 挑战 -## Challenge: +## 挑战: -1. Using this vector of heights in inches, create a new vector with the NAs removed. +1. 使用这个以英寸为单位的高度向量,创建一个删除了 NA 的新向量。 ```{r} -heights <- c(63, 69, 60, 65, NA, 68, 61, 70, 61, 59, 64, 69, 63, 63, NA, 72, 65, 64, 70, 63, 65) +高度 <- c(63, 69, 60, 65, NA, 68, 61, 70, 61, 59, 64, 69, 63, 63, NA, 72, 65, 64, 70, 63, 65) ``` -2. Use the function `median()` to calculate the median of the `heights` vector. -3. Use R to figure out how many people in the set are taller than 67 inches. +2. 使用函数“median()”计算“高度”向量的中值。 +3. 使用 R 找出集合中有多少人的身高超过 67 英寸。 -::::::::::::::: solution +::::::::::::::: 解决方案 -## Solution +## 解决方案 ```{r, purl=TRUE} heights_no_na <- heights[!is.na(heights)] -## or +## 或 heights_no_na <- na.omit(heights) ``` ```{r, purl=TRUE} -median(heights, na.rm = TRUE) +中位数(高度,na.rm = TRUE) ``` ```{r, purl=TRUE} -heights_above_67 <- heights_no_na[heights_no_na > 67] -length(heights_above_67) +高度_高于_67 <- 高度_无_na[高度_无_na > 67] +长度(高度_高于_67) ``` ::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::: +::::::::::::::::::::::::::::::::::::::::::::::::::::: -## Generating vectors {#sec:genvec} +## 生成向量 {#sec:genvec} ```{r, echo=FALSE} -set.seed(1) +设置种子(1) ``` -### Constructors +### 构造函数 -There exists some functions to generate vectors of different type. To -generate a vector of numerics, one can use the `numeric()` -constructor, providing the length of the output vector as -parameter. The values will be initialised with 0. +存在一些函数来生成不同类型的向量。 要 +生成一个数字向量,可以使用 `numeric()` +构造函数,并将输出向量的长度作为 +参数。 这些值将被初始化为 0。 ```{r, purl=TRUE} -numeric(3) -numeric(10) +数字(3) +数字(10) ``` -Note that if we ask for a vector of numerics of length 0, we obtain -exactly that: +请注意,如果我们要求长度为 0 的数字向量,我们将获得 +: ```{r, purl=TRUE} -numeric(0) +数字(0) ``` -There are similar constructors for characters and logicals, named -`character()` and `logical()` respectively. +字符和逻辑值有类似的构造函数,分别名为 +`character()` 和 `logical()`。 -::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::: 挑战 -## Challenge: +## 挑战: -What are the defaults for character and logical vectors? +字符和逻辑向量的默认值是什么? -::::::::::::::: solution +::::::::::::::: 解决方案 -## Solution +## 解决方案 ```{r, purl=TRUE} -character(2) ## the empty character -logical(2) ## FALSE +character(2) ## 空字符 +logical(2) ## FALSE ``` ::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::: +::::::::::::::::::::::::::::::::::::::::::::::::::::: -### Replicate elements +### 复制元素 -The `rep` function allow to repeat a value a certain number of -times. If we want to initiate a vector of numerics of length 5 with -the value -1, for example, we could do the following: +`rep` 函数允许将一个值重复一定次数 +次。 例如,如果我们想要用 +值 -1 来初始化一个长度为 5 的数字向量,我们可以执行以下操作: ```{r, purl=TRUE} -rep(-1, 5) +代表(-1,5) ``` -Similarly, to generate a vector populated with missing values, which -is often a good way to start, without setting assumptions on the data -to be collected: +类似地,要生成一个填充了缺失值的向量, +通常是一个很好的开始方式,而无需对要收集的数据 +设定假设: ```{r, purl=TRUE} -rep(NA, 5) +代表(NA,5) ``` -`rep` can take vectors of any length as input (above, we used vectors -of length 1) and any type. For example, if we want to repeat the -values 1, 2 and 3 five times, we would do the following: +`rep` 可以将任意长度的向量作为输入(上面,我们使用了长度为 1 的向量 +)和任何类型。 例如,如果我们想重复 +值 1、2 和 3 五次,我们可以执行以下操作: ```{r, purl=TRUE} -rep(c(1, 2, 3), 5) +代表(c(1,2,3),5) ``` -::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::: 挑战 -## Challenge: +## 挑战: -What if we wanted to repeat the values 1, 2 and 3 five times, but -obtain five 1s, five 2s and five 3s in that order? There are two -possibilities - see `?rep` or `?sort` for help. +如果我们想重复值 1、2 和 3 五次,但 +却按顺序获得五个 1、五个 2 和五个 3,该怎么办? 有两种 +可能性 - 请参阅 `?rep` 或 `?sort` 寻求帮助。 -::::::::::::::: solution +::::::::::::::: 解决方案 -## Solution +## 解决方案 ```{r, purl=TRUE} rep(c(1, 2, 3), each = 5) @@ -780,132 +781,132 @@ sort(rep(c(1, 2, 3), 5)) ::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::: +::::::::::::::::::::::::::::::::::::::::::::::::::::: -### Sequence generation +### 序列生成 -Another very useful function is `seq`, to generate a sequence of -numbers. For example, to generate a sequence of integers from 1 to 20 -by steps of 2, one would use: +另一个非常有用的函数是“seq”,用于生成一个 +数字序列。 例如,为了以 2 为步长生成从 1 到 20 +的整数序列,可以使用: ```{r, purl=TRUE} -seq(from = 1, to = 20, by = 2) +seq(从 = 1,到 = 20,按 = 2) ``` -The default value of `by` is 1 and, given that the generation of a -sequence of one value to another with steps of 1 is frequently used, -there's a shortcut: +`by` 的默认值为 1,并且鉴于经常使用以 1 为步长生成一个值到另一个值的 +序列, +有一个快捷方式: ```{r, purl=TRUE} seq(1, 5, 1) -seq(1, 5) ## default by +seq(1, 5) ## 默认为 1:5 ``` -To generate a sequence of numbers from 1 to 20 of final length of 3, -one would use: +要生成最终长度为 3, +的从 1 到 20 的数字序列,可以使用: ```{r, purl=TRUE} -seq(from = 1, to = 20, length.out = 3) +seq(从 = 1,到 = 20,长度.out = 3) ``` -### Random samples and permutations +### 随机样本和排列 -A last group of useful functions are those that generate random -data. The first one, `sample`, generates a random permutation of -another vector. For example, to draw a random order to 10 students -oral exam, I first assign each student a number from 1 to ten (for -instance based on the alphabetic order of their name) and then: +最后一组有用的函数是那些生成随机 +数据的函数。 第一个,“样本”,生成另一个向量的随机排列 +。 例如,为了对 10 名学生 +口语考试进行随机排序,我首先为每个学生分配一个从 1 到 10 的数字(例如 +根据他们姓名的字母顺序排列),然后: ```{r, purl=TRUE} -sample(1:10) +样品(1:10) ``` -Without further arguments, `sample` will return a permutation of all -elements of the vector. If I want a random sample of a certain size, I -would set this value as the second argument. Below, I sample 5 random -letters from the alphabet contained in the pre-defined `letters` vector: +如果没有进一步的参数,“sample”将返回向量中所有 +元素的排列。 如果我想要一个特定大小的随机样本,我 +会将该值设置为第二个参数。 下面,我从预定义的“字母”向量中的字母表中随机抽取 5 个 +个字母: ```{r, purl=TRUE} -sample(letters, 5) +样本(字母,5) ``` -If I wanted an output larger than the input vector, or being able to -draw some elements multiple times, I would need to set the `replace` -argument to `TRUE`: +如果我想要一个大于输入向量的输出,或者能够 +多次绘制一些元素,我需要将`replace` +参数设置为`TRUE`: ```{r, purl=TRUE} -sample(1:5, 10, replace = TRUE) +样本(1:5,10,替换=TRUE) ``` -::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::: 挑战 -## Challenge: +## 挑战: -When trying the functions above out, you will have realised that the -samples are indeed random and that one doesn't get the same -permutation twice. To be able to reproduce these random draws, one can +当尝试上述函数时,您将意识到 +样本确实是随机的,并且不会两次得到相同的 +排列。 To be able to reproduce these random draws, one can set the random number generation seed manually with `set.seed()` before drawing the random sample. -Test this feature with your neighbour. First draw two random -permutations of `1:10` independently and observe that you get -different results. +和你的邻居一起测试此功能。 首先独立绘制两个随机的 +排列 `1:10`,并观察是否得到了 +个不同的结果。 -Now set the seed with, for example, `set.seed(123)` and repeat the -random draw. Observe that you now get the same random draws. +现在设置种子,例如`set.seed(123)`并重复 +随机抽取。 注意,您现在获得相同的随机抽取。 -Repeat by setting a different seed. +通过设置不同的种子来重复。 -::::::::::::::: solution +::::::::::::::: 解决方案 -## Solution +## 解决方案 -Different permutations +不同的排列 ```{r, purl=TRUE} -sample(1:10) -sample(1:10) +样品(1:10) +样品(1:10) ``` -Same permutations with seed 123 +与种子 123 相同的排列 ```{r, purl=TRUE} -set.seed(123) -sample(1:10) -set.seed(123) -sample(1:10) +设置.种子(123) +样本(1:10) +设置.种子(123) +样本(1:10) ``` -A different seed +不同的种子 ```{r, purl=TRUE} -set.seed(1) -sample(1:10) -set.seed(1) -sample(1:10) +设置.种子(1) +样本(1:10) +设置.种子(1) +样本(1:10) ``` ::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::: +::::::::::::::::::::::::::::::::::::::::::::::::::::: -### Drawing samples from a normal distribution +### 从正态分布中抽取样本 -The last function we are going to see is `rnorm`, that draws a random -sample from a normal distribution. Two normal distributions of means 0 -and 100 and standard deviations 1 and 5, noted _N(0, 1)_ and -_N(100, 5)_, are shown below. +我们将要看到的最后一个函数是`rnorm`,它从正态分布中抽取一个随机的 +样本。 下面显示了两个均值 0 +和 100、标准差 1 和 5 的正态分布,分别记为 _N(0, 1)_ 和 +_N(100, 5)_。 ```{r, echo=FALSE, fig.width=12, fig.height=6, fig.cap="Two normal distributions: *N(0, 1)* on the left and *N(100, 5)* on the right."} par(mfrow = c(1, 2)) -plot(density(rnorm(1000)), main = "", sub = "N(0, 1)") -plot(density(rnorm(1000, 100, 5)), main = "", sub = "N(100, 5)") +图(密度(rnorm(1000)),main = "", sub = "N(0, 1)") +图(密度(rnorm(1000, 100, 5)),main = "", sub = "N(100, 5)") ``` -The three arguments, `n`, `mean` and `sd`, define the size of the -sample, and the parameters of the normal distribution, i.e the mean -and its standard deviation. The defaults of the latter are 0 and 1. +三个参数 `n`、`mean` 和 `sd` 定义了 +样本的大小,以及正态分布的参数,即平均值 +及其标准差。 后者的默认值为0和1。 ```{r, purl=TRUE} rnorm(5) @@ -913,12 +914,12 @@ rnorm(5, 2, 2) rnorm(5, 100, 5) ``` -Now that we have learned how to write scripts, and the basics of R's -data structures, we are ready to start working with larger data, and -learn about data frames. +现在我们已经学习了如何编写脚本,以及 R 的 +数据结构的基础知识,我们已经准备好开始处理更大的数据,并且 +了解数据框。 -:::::::::::::::::::::::::::::::::::::::: keypoints +:::::::::::::::::::::::::::::::::::::::: 关键点 -- How to interact with R +- 如何与 R 交互 -:::::::::::::::::::::::::::::::::::::::::::::::::: +::::::::::::::::::::::::::::::::::::::::::::::::::::: From 91506136bd1e3e47e87f26fc1bc12cca10fc551b Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Thu, 1 Aug 2024 00:07:31 +0900 Subject: [PATCH 199/334] New translations 25-starting-with-data.md (French) --- locale/fr/episodes/25-starting-with-data.Rmd | 804 +++++++++---------- 1 file changed, 402 insertions(+), 402 deletions(-) diff --git a/locale/fr/episodes/25-starting-with-data.Rmd b/locale/fr/episodes/25-starting-with-data.Rmd index 8506d99ee..411f2d942 100644 --- a/locale/fr/episodes/25-starting-with-data.Rmd +++ b/locale/fr/episodes/25-starting-with-data.Rmd @@ -1,6 +1,6 @@ --- source: Rmd -title: Starting with data +title: Commencer par les données teaching: 30 exercises: 30 --- @@ -8,115 +8,115 @@ exercises: 30 ```{r, include=FALSE} ``` -::::::::::::::::::::::::::::::::::::::: objectives +::::::::::::::::::::::::::::::::::::::: objectifs -- Describe what a `data.frame` is. -- Load external data from a .csv file into a data frame. -- Summarize the contents of a data frame. -- Describe what a factor is. -- Convert between strings and factors. -- Reorder and rename factors. -- Format dates. -- Export and save data. +- Décrivez ce qu'est un « data.frame ». +- Chargez des données externes à partir d'un fichier .csv dans un bloc de données. +- Résumer le contenu d'un bloc de données. +- Décrivez ce qu'est un facteur. +- Convertissez entre les chaînes et les facteurs. +- Réorganisez et renommez les facteurs. +- Formater les dates. +- Exportez et enregistrez les données. :::::::::::::::::::::::::::::::::::::::::::::::::: :::::::::::::::::::::::::::::::::::::::: questions -- First data analysis in R +- Première analyse de données dans R -:::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::: -> This episode is based on the Data Carpentries's _Data Analysis and -> Visualisation in R for Ecologists_ lesson. +> Cet épisode est basé sur la leçon _Analyse des données et +> Visualisation dans R pour les écologistes_ de Data Carpentries. -## Presentation of the gene expression data +## Présentation des données d'expression des gènes -We are going to use part of the data published by Blackmore , _The -effect of upper-respiratory infection on transcriptomic changes in the -CNS_. The goal of the study was to determine the effect of an -upper-respiratory infection on changes in RNA transcription occurring -in the cerebellum and spinal cord post infection. Gender matched eight +Nous allons utiliser une partie des données publiées par Blackmore , _L'effet +de l'infection des voies respiratoires supérieures sur les modifications transcriptomiques du +SNC_. Le but de l'étude était de déterminer l'effet d'une infection des voies respiratoires supérieures +sur les modifications de la transcription de l'ARN se produisant +dans le cervelet et la moelle épinière après l'infection. Gender matched eight week old C57BL/6 mice were inoculated with saline or with Influenza A by intranasal route and transcriptomic changes in the cerebellum and spinal cord tissues were evaluated by RNA-seq at days 0 (non-infected), 4 and 8. -The dataset is stored as a comma-separated values (CSV) file. Each row -holds information for a single RNA expression measurement, and the first eleven -columns represent: - -| Column | Description | -| ---------- | -------------------------------------------------------------------------------------------------------------------------------------------- | -| gene | The name of the gene that was measured | -| sample | The name of the sample the gene expression was measured in | -| expression | The value of the gene expression | -| organism | The organism/species - here all data stem from mice | -| age | The age of the mouse (all mice were 8 weeks here) | -| sex | The sex of the mouse | -| infection | The infection state of the mouse, i.e. infected with Influenza A or not infected. | -| strain | The Influenza A strain. | -| time | The duration of the infection (in days). | -| tissue | The tissue that was used for the gene expression experiment, i.e. cerebellum or spinal cord. | -| mouse | The mouse unique identifier. | - -We are going to use the R function `download.file()` to download the -CSV file that contains the gene expression data, and we will use -`read.csv()` to load into memory the content of the CSV file as an -object of class `data.frame`. Inside the `download.file` command, the -first entry is a character string with the source URL. This source URL -downloads a CSV file from a GitHub repository. The text after the -comma (`"data/rnaseq.csv"`) is the destination of the file on your -local machine. You'll need to have a folder on your machine called -`"data"` where you'll download the file. So this command downloads the -remote file, names it `"rnaseq.csv"` and adds it to a preexisting -folder named `"data"`. +L'ensemble de données est stocké sous forme de fichier CSV (valeurs séparées par des virgules). Chaque ligne +contient des informations pour une seule mesure d'expression d'ARN, et les onze premières colonnes +représentent : + +| Colonne | Description | +| ----------- | ------------------------------------------------------------------------------------------------------------------------ | +| gène | Le nom du gène qui a été mesuré | +| échantillon | Le nom de l’échantillon dans lequel l’expression du gène a été mesurée | +| expression | La valeur de l'expression des gènes | +| organisme | L'organisme/l'espèce - ici toutes les données proviennent de souris | +| âge | L'âge de la souris (toutes les souris avaient 8 semaines ici) | +| sexe | Le sexe de la souris | +| infection | L'état d'infection de la souris, c'est-à-dire infectée par la grippe A ou non infectée. | +| souche | La souche grippale A. | +| temps | La durée de l'infection (en jours). | +| tissu | Le tissu utilisé pour l'expérience d'expression génique, c'est-à-dire le cervelet ou la moelle épinière. | +| souris | L'identifiant unique de la souris. | + +Nous allons utiliser la fonction R `download.file()` pour télécharger le fichier CSV +qui contient les données d'expression génique, et nous utiliserons +`read.csv()` pour charger en mémoire le contenu du fichier CSV en tant qu'objet +de classe `data.frame`. Dans la commande `download.file`, la première entrée +est une chaîne de caractères avec l'URL source. Cette URL source +télécharge un fichier CSV à partir d'un référentiel GitHub. Le texte après la virgule +("data/rnaseq.csv"`) est la destination du fichier sur votre machine locale +. Vous aurez besoin d'un dossier sur votre ordinateur appelé +`"data"`dans lequel vous téléchargerez le fichier. Cette commande télécharge donc le fichier distant +, le nomme`"rnaseq.csv"`et l'ajoute à un dossier +préexistant nommé`"data"\`. ```{r, eval=TRUE} download.file(url = "https://github.com/carpentries-incubator/bioc-intro/raw/main/episodes/data/rnaseq.csv", - destfile = "data/rnaseq.csv") + destfile = "data/rnaseq.csv" ) ``` -You are now ready to load the data: +Vous êtes maintenant prêt à charger les données : ```{r, eval=TRUE, purl=TRUE} -rna <- read.csv("data/rnaseq.csv") +arn <- read.csv("data/rnaseq.csv") ``` -This statement doesn't produce any output because, as you might -recall, assignments don't display anything. If we want to check that -our data has been loaded, we can see the contents of the data frame by -typing its name: +Cette instruction ne produit aucune sortie car, comme vous vous en souvenez peut-être +, les affectations n'affichent rien. Si nous voulons vérifier que +nos données ont été chargées, nous pouvons voir le contenu du bloc de données en +en tapant son nom : ```{r, eval=FALSE} -rna +arn ``` -Wow... that was a lot of output. At least it means the data loaded -properly. Let's check the top (the first 6 lines) of this data frame -using the function `head()`: +Ouah... c'était beaucoup de production. Au moins, cela signifie que les données ont été chargées +correctement. Vérifions le haut (les 6 premières lignes) de ce bloc de données +en utilisant la fonction `head()` : ```{r, purl=TRUE} head(rna) -## Try also +## Essayez aussi ## View(rna) ``` **Note** -`read.csv()` assumes that fields are delineated by commas, however, in -several countries, the comma is used as a decimal separator and the -semicolon (;) is used as a field delineator. If you want to read in -this type of files in R, you can use the `read.csv2()` function. It -behaves exactly like `read.csv()` but uses different parameters for -the decimal and the field separators. If you are working with another -format, they can be both specified by the user. Check out the help for -`read.csv()` by typing `?read.csv` to learn more. There is also the -`read.delim()` function for reading tab separated data files. It is important to -note that all of these functions are actually wrapper functions for -the main `read.table()` function with different arguments. As such, -the data above could have also been loaded by using `read.table()` -with the separation argument as `,`. The code is as follows: +`read.csv()` suppose que les champs sont délimités par des virgules, cependant, dans +plusieurs pays, la virgule est utilisée comme séparateur décimal et le +point-virgule (;) est utilisé comme champ délinéateur. Si vous souhaitez lire en +ce type de fichiers dans R, vous pouvez utiliser la fonction `read.csv2()`. Il +se comporte exactement comme `read.csv()` mais utilise des paramètres différents pour +la décimale et les séparateurs de champ. Si vous travaillez avec un autre format +, ils peuvent tous deux être spécifiés par l'utilisateur. Consultez l'aide pour +`read.csv()` en tapant `?read.csv` pour en savoir plus. Il existe également la fonction +`read.delim()` pour lire des fichiers de données séparés par des tabulations. Il est important de +noter que toutes ces fonctions sont en fait des fonctions wrapper pour +la fonction principale `read.table()` avec différents arguments. En tant que tel, +les données ci-dessus auraient également pu être chargées en utilisant `read.table()` +avec l'argument de séparation comme `,`. Le code est comme suit: ```{r, eval=TRUE, purl=TRUE} rna <- read.table(file = "data/rnaseq.csv", @@ -124,130 +124,130 @@ rna <- read.table(file = "data/rnaseq.csv", header = TRUE) ``` -The header argument has to be set to TRUE to be able to read the -headers as by default `read.table()` has the header argument set to +L'argument d'en-tête doit être défini sur TRUE pour pouvoir lire les en-têtes +car par défaut `read.table()` a l'argument d'en-tête défini sur FALSE. -## What are data frames? +## Que sont les trames de données ? -Data frames are the _de facto_ data structure for most tabular data, -and what we use for statistics and plotting. +Les trames de données sont la structure de données _de facto_ pour la plupart des données tabulaires, +et ce que nous utilisons pour les statistiques et le traçage. -A data frame can be created by hand, but most commonly they are -generated by the functions `read.csv()` or `read.table()`; in other -words, when importing spreadsheets from your hard drive (or the web). +Un bloc de données peut être créé à la main, mais le plus souvent ils sont +générés par les fonctions `read.csv()` ou `read.table()` ; en d'autres termes +, lors de l'importation de feuilles de calcul depuis votre disque dur (ou le Web). -A data frame is the representation of data in the format of a table -where the columns are vectors that all have the same length. Because -columns are vectors, each column must contain a single type of data -(e.g., characters, integers, factors). For example, here is a figure -depicting a data frame comprising a numeric, a character, and a -logical vector. +Une trame de données est la représentation de données sous le format d'un tableau +où les colonnes sont des vecteurs qui ont tous la même longueur. Étant donné que les colonnes +sont des vecteurs, chaque colonne doit contenir un seul type de données +(par exemple, des caractères, des entiers, des facteurs). Par exemple, voici une figure +représentant une trame de données comprenant un chiffre, un caractère et un vecteur logique +. ![](./fig/data-frame.svg) -We can see this when inspecting the <b>str</b>ucture of a data frame -with the function `str()`: +Nous pouvons le voir lors de l'inspection de la <b>str</b>ucture d'une trame de données +avec la fonction `str()` : ```{r} -str(rna) +str(arn) ``` -## Inspecting `data.frame` Objects +## Inspection des objets `data.frame` -We already saw how the functions `head()` and `str()` can be useful to -check the content and the structure of a data frame. Here is a -non-exhaustive list of functions to get a sense of the -content/structure of the data. Let's try them out! +Nous avons déjà vu comment les fonctions `head()` et `str()` peuvent être utiles pour +vérifier le contenu et la structure d'une trame de données. Voici une +liste non exhaustive de fonctions pour avoir une idée du +contenu/structure des données. Essayons-les ! -**Size**: +**Taille**: -- `dim(rna)` - returns a vector with the number of rows as the first - element, and the number of columns as the second element (the - **dim**ensions of the object). -- `nrow(rna)` - returns the number of rows. -- `ncol(rna)` - returns the number of columns. +- `dim(rna)` - renvoie un vecteur avec le nombre de lignes comme premier élément + et le nombre de colonnes comme deuxième élément (les + **dim**ensions de l'objet ). +- `nrow(rna)` - renvoie le nombre de lignes. +- `ncol(rna)` - renvoie le nombre de colonnes. -**Content**: +**Contenu**: -- `head(rna)` - shows the first 6 rows. -- `tail(rna)` - shows the last 6 rows. +- `head(rna)` - affiche les 6 premières lignes. +- `tail(rna)` - affiche les 6 dernières lignes. -**Names**: +**Des noms**: -- `names(rna)` - returns the column names (synonym of `colnames()` for - `data.frame` objects). -- `rownames(rna)` - returns the row names. +- `names(rna)` - renvoie les noms de colonnes (synonyme de `colnames()` pour les objets + `data.frame`). +- `rownames(rna)` - renvoie les noms de lignes. -**Summary**: +**Résumé**: -- `str(rna)` - structure of the object and information about the - class, length and content of each column. -- `summary(rna)` - summary statistics for each column. +- `str(rna)` - structure de l'objet et informations sur la classe + , longueur et contenu de chaque colonne. +- `summary(rna)` - statistiques récapitulatives pour chaque colonne. -Note: most of these functions are "generic", they can be used on other types of -objects besides `data.frame`. +Remarque : la plupart de ces fonctions sont "génériques", elles peuvent être utilisées sur d'autres types d'objets +en plus de `data.frame`. -::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::: défi -## Challenge: +## Défi: -Based on the output of `str(rna)`, can you answer the following -questions? +Sur la base du résultat de `str(rna)`, pouvez-vous répondre aux +questions suivantes ? -- What is the class of the object `rna`? -- How many rows and how many columns are in this object? +- Quelle est la classe de l’objet « rna » ? +- Combien de lignes et combien de colonnes y a-t-il dans cet objet ? -::::::::::::::: solution +::::::::::::::: solution ## Solution -- class: data frame -- how many rows: 66465, how many columns: 11 +- classe : trame de données +- combien de lignes : 66465, combien de colonnes : 11 ::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::: -## Indexing and subsetting data frames +## Indexation et sous-ensemble de trames de données -Our `rna` data frame has rows and columns (it has 2 dimensions); if we -want to extract some specific data from it, we need to specify the -"coordinates" we want. Row numbers come first, followed by -column numbers. However, note that different ways of specifying these -coordinates lead to results with different classes. +Notre bloc de données « rna » comporte des lignes et des colonnes (il a 2 dimensions) ; si nous +voulons en extraire des données spécifiques, nous devons spécifier les +"coordonnées" que nous voulons. Les numéros de ligne viennent en premier, suivis des numéros de colonne +. Cependant, notez que différentes manières de spécifier ces coordonnées +conduisent à des résultats avec des classes différentes. ```{r, eval=FALSE, purl=TRUE} -# first element in the first column of the data frame (as a vector) +# premier élément de la première colonne du bloc de données (sous forme de vecteur) rna[1, 1] -# first element in the 6th column (as a vector) -rna[1, 6] -# first column of the data frame (as a vector) +# premier élément de la 6ème colonne (sous forme de vecteur) +rna [1, 6] +# première colonne du bloc de données (sous forme de vecteur) rna[, 1] -# first column of the data frame (as a data.frame) +# première colonne du bloc de données (sous forme de data.frame ) rna[1] -# first three elements in the 7th column (as a vector) +# les trois premiers éléments de la 7ème colonne (en tant que vecteur) rna[1:3, 7] -# the 3rd row of the data frame (as a data.frame) +# la 3ème ligne de la trame de données (en tant que data.frame) rna[3, ] -# equivalent to head_rna <- head(rna) +# équivalent à head_rna <- head(rna) head_rna <- rna[1:6, ] head_rna ``` -`:` is a special function that creates numeric vectors of integers in -increasing or decreasing order, test `1:10` and `10:1` for -instance. See section @ref(sec:genvec) for details. +`:` est une fonction spéciale qui crée des vecteurs numériques d'entiers dans +ordre croissant ou décroissant, testez `1:10` et `10:1` pour l'instance +. Voir la section @ref(sec:genvec) pour plus de détails. -You can also exclude certain indices of a data frame using the "`-`" sign: +Vous pouvez également exclure certains indices d'un bloc de données à l'aide du signe "`-`" : ```{r, eval=FALSE, purl=TRUE} -rna[, -1] ## The whole data frame, except the first column -rna[-c(7:66465), ] ## Equivalent to head(rna) +rna[, -1] ## La trame de données entière, sauf la première colonne +rna[-c(7:66465), ] ## Équivalent à head(rna) ``` -Data frames can be subsetted by calling indices (as shown previously), -but also by calling their column names directly: +Les trames de données peuvent être sous-ensembles en appelant des indices (comme indiqué précédemment), +mais aussi en appelant directement leurs noms de colonnes : ```{r, eval=FALSE, purl=TRUE} rna["gene"] # Result is a data.frame @@ -256,37 +256,37 @@ rna[["gene"]] # Result is a vector rna$gene # Result is a vector ``` -In RStudio, you can use the autocompletion feature to get the full and -correct names of the columns. +Dans RStudio, vous pouvez utiliser la fonction de saisie semi-automatique pour obtenir les noms complets et +corrects des colonnes. -::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::: défi -## Challenge +## Défi -1. Create a `data.frame` (`rna_200`) containing only the data in - row 200 of the `rna` dataset. +1. Créez un `data.frame` (`rna_200`) contenant uniquement les données de + ligne 200 de l'ensemble de données `rna`. -2. Notice how `nrow()` gave you the number of rows in a `data.frame`? +2. Remarquez comment `nrow()` vous a donné le nombre de lignes dans un `data.frame` ? -- Use that number to pull out just that last row in the initial - `rna` data frame. +- Utilisez ce numéro pour extraire uniquement la dernière ligne de la trame de données initiale + `rna`. -- Compare that with what you see as the last row using `tail()` to - make sure it's meeting expectations. +- Comparez cela avec ce que vous voyez comme la dernière ligne en utilisant `tail()` pour + assurez-vous qu'il répond aux attentes. -- Pull out that last row using `nrow()` instead of the row number. +- Retirez cette dernière ligne en utilisant `nrow()` au lieu du numéro de ligne. -- Create a new data frame (`rna_last`) from that last row. +- Créez un nouveau bloc de données (`rna_last`) à partir de cette dernière ligne. -3. Use `nrow()` to extract the row that is in the middle of the - `rna` dataframe. Store the content of this row in an object - named `rna_middle`. +3. Utilisez `nrow()` pour extraire la ligne qui se trouve au milieu du dataframe + `rna`. Stockez le contenu de cette ligne dans un objet + nommé `rna_middle`. -4. Combine `nrow()` with the `-` notation above to reproduce the - behavior of `head(rna)`, keeping just the first through 6th - rows of the rna dataset. +4. Combinez `nrow()` avec la notation `-` ci-dessus pour reproduire le comportement + de `head(rna)`, en ne conservant que la première à la 6ème + lignes de l'ensemble de données rna. -::::::::::::::: solution +::::::::::::::: solution ## Solution @@ -294,81 +294,81 @@ correct names of the columns. ## 1. rna_200 <- rna[200, ] ## 2. -## Saving `n_rows` to improve readability and reduce duplication -n_rows <- nrow(rna) +## Sauvegarde de `n_rows` pour améliorer la lisibilité et réduire la duplication +n_rows < - nrow(rna) rna_last <- rna[n_rows, ] ## 3. rna_middle <- rna[n_rows / 2, ] -## 4. +## 4 . rna_head <- rna[-(7:n_rows), ] ``` ::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::: -## Factors +## Facteurs -Factors represent **categorical data**. They are stored as integers -associated with labels and they can be ordered or unordered. While -factors look (and often behave) like character vectors, they are -actually treated as integer vectors by R. So you need to be very -careful when treating them as strings. +Les facteurs représentent des **données catégorielles**. Ils sont stockés sous forme d'entiers +associés aux étiquettes et ils peuvent être ordonnés ou non. Alors que les facteurs +ressemblent (et se comportent souvent) à des vecteurs de caractères, ils sont +en fait traités comme des vecteurs entiers par R. Vous devez donc être très +prudent lorsque vous les traitez comme des chaînes. -Once created, factors can only contain a pre-defined set of values, -known as _levels_. By default, R always sorts levels in alphabetical -order. For instance, if you have a factor with 2 levels: +Une fois créés, les facteurs ne peuvent contenir qu'un ensemble prédéfini de valeurs, +appelées _niveaux_. Par défaut, R trie toujours les niveaux par ordre alphabétique +. Par exemple, si vous avez un facteur à 2 niveaux : ```{r, purl=TRUE} -sex <- factor(c("male", "female", "female", "male", "female")) +sexe <- factor(c("mâle", "femelle", "femelle", "mâle", "femelle")) ``` R will assign `1` to the level `"female"` and `2` to the level `"male"` (because `f` comes before `m`, even though the first element -in this vector is `"male"`). You can see this by using the function -`levels()` and you can find the number of levels using `nlevels()`: +in this vector is `"male"`). Vous pouvez le voir en utilisant la fonction +`levels()` et vous pouvez trouver le nombre de niveaux en utilisant `nlevels()` : ```{r, purl=TRUE} -levels(sex) -nlevels(sex) +niveaux(sexe) +nniveaux(sexe) ``` -Sometimes, the order of the factors does not matter, other times you -might want to specify the order because it is meaningful (e.g., "low", -"medium", "high"), it improves your visualization, or it is required -by a particular type of analysis. Here, one way to reorder our levels -in the `sex` vector would be: +Parfois, l'ordre des facteurs n'a pas d'importance, d'autres fois vous +souhaiterez peut-être spécifier l'ordre car il est significatif (par exemple, "faible", +"moyen", "élevé"), il améliore votre visualisation, ou il est requis +par un type particulier d'analyse. Ici, une façon de réorganiser nos niveaux +dans le vecteur « sexe » serait : ```{r, purl=TRUE} -sex ## current order -sex <- factor(sex, levels = c("male", "female")) -sex ## after re-ordering +sex ## commande actuelle +sex <- factor(sex,levels = c("male", "female")) +sex ## après la nouvelle commande ``` -In R's memory, these factors are represented by integers (1, 2, 3), -but are more informative than integers because factors are self -describing: `"female"`, `"male"` is more descriptive than `1`, -`2`. Which one is "male"? You wouldn't be able to tell just from the -integer data. Factors, on the other hand, have this information built-in. -It is particularly helpful when there are many levels (like the -gene biotype in our example dataset). +Dans la mémoire de R, ces facteurs sont représentés par des entiers (1, 2, 3), +mais sont plus informatifs que les entiers car les facteurs sont auto-descriptifs + : `"femelle"`, `"mâle" ` est plus descriptif que `1`, +`2`. Lequel est « mâle » ? Vous ne seriez pas en mesure de le savoir uniquement à partir des données entières +. Les facteurs, en revanche, intègrent cette information. +Ceci est particulièrement utile lorsqu'il existe de nombreux niveaux (comme le biotype du gène +dans notre exemple d'ensemble de données). -When your data is stored as a factor, you can use the `plot()` -function to get a quick glance at the number of observations -represented by each factor level. Let's look at the number of males -and females in our data. +Lorsque vos données sont stockées sous forme de facteur, vous pouvez utiliser la fonction `plot()` +pour avoir un aperçu rapide du nombre d'observations +représenté par chaque niveau de facteur. Regardons le nombre d'hommes +et de femmes dans nos données. ```{r firstfactorplot, fig.cap="Bar plot of the number of females and males.", purl=TRUE} -plot(sex) +intrigue (sexe) ``` -### Converting to character +### Conversion en personnage -If you need to convert a factor to a character vector, you use +Si vous devez convertir un facteur en vecteur de caractères, vous utilisez `as.character(x)`. ```{r, purl=TRUE} -as.character(sex) +as.personnage (sexe) ``` <!-- ### Numeric factors --> @@ -409,45 +409,45 @@ as.character(sex) <!-- vector `year_fct` inside the square brackets --> -### Renaming factors +### Facteurs de renommage -If we want to rename these factor, it is sufficient to change its -levels: +Si l'on veut renommer ces facteurs, il suffit de changer ses niveaux +: ```{r, purl=TRUE} -levels(sex) -levels(sex) <- c("M", "F") -sex -plot(sex) +niveaux(sexe) +niveaux(sexe) <- c("M", "F") +sexe +intrigue(sexe) ``` -:::::::::::::::::::::::::::::::::::::: challenge +:::::::::::::::::::::::::::::::::::::: défi -## Challenge: +## Défi: -- Rename "F" and "M" to "Female" and "Male" respectively. +- Renommez « F » et « M » respectivement en « Femme » et « Mâle ». -::::::::::::::: solution +::::::::::::::: solution ## Solution ```{r, eval=TRUE, purl=TRUE} -levels(sex) -levels(sex) <- c("Male", "Female") +niveaux(sexe) +niveaux(sexe) <- c("Homme", "Femme") ``` ::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::: -::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::: défi -## Challenge: +## Défi: -We have seen how data frames are created when using `read.csv()`, but -they can also be created by hand with the `data.frame()` function. -There are a few mistakes in this hand-crafted `data.frame`. Can you -spot and fix them? Don't hesitate to experiment! +Nous avons vu comment les trames de données sont créées lors de l'utilisation de `read.csv()`, mais +elles peuvent également être créées à la main avec la fonction `data.frame()`. +Il y a quelques erreurs dans ce « data.frame » fabriqué à la main. Pouvez-vous +les repérer et les réparer ? N'hésitez pas à expérimenter ! ```{r, eval=FALSE} animal_data <- data.frame( @@ -456,54 +456,54 @@ animal_data <- data.frame( weight = c(45, 8 1.1, 0.8)) ``` -::::::::::::::: solution +::::::::::::::: solution ## Solution -- missing quotations around the names of the animals -- missing one entry in the "feel" column (probably for one of the furry animals) -- missing one comma in the weight column +- guillemets manquants autour des noms des animaux +- il manque une entrée dans la colonne "sensation" (probablement pour l'un des animaux à fourrure) +- il manque une virgule dans la colonne poids ::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::: -::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::: défi -## Challenge: +## Défi: -Can you predict the class for each of the columns in the following -example? +Pouvez-vous prédire la classe de chacune des colonnes dans l'exemple +suivant ? -Check your guesses using `str(country_climate)`: +Vérifiez vos suppositions en utilisant `str(country_climate)` : -- Are they what you expected? Why? Why not? +- Sont-ils ce à quoi vous vous attendiez ? Pourquoi? Pourquoi pas? -- Try again by adding `stringsAsFactors = TRUE` after the last - variable when creating the data frame. What is happening now? - `stringsAsFactors` can also be set when reading text-based - spreadsheets into R using `read.csv()`. +- Réessayez en ajoutant `stringsAsFactors = TRUE` après la dernière variable + lors de la création du bloc de données. Qu'est ce qu'il se passe maintenant? + `stringsAsFactors` peut également être défini lors de la lecture de feuilles de calcul + basées sur du texte dans R à l'aide de `read.csv()`. ```{r, eval=FALSE, purl=TRUE} country_climate <- data.frame( - country = c("Canada", "Panama", "South Africa", "Australia"), - climate = c("cold", "hot", "temperate", "hot/temperate"), - temperature = c(10, 30, 18, "15"), - northern_hemisphere = c(TRUE, TRUE, FALSE, "FALSE"), + country = c("Canada", "Panama", "Afrique du Sud", "Australie"), + climat = c("froid", "chaud" , "tempéré", "chaud/tempéré"), + température = c(10, 30, 18, "15"), + hémisphère_nord = c(VRAI, VRAI, FAUX, "FAUX" ), has_kangaroo = c(FALSE, FALSE, FALSE, 1) - ) +) ``` -::::::::::::::: solution +::::::::::::::: solution ## Solution ```{r, eval=TRUE, purl=TRUE} country_climate <- data.frame( - country = c("Canada", "Panama", "South Africa", "Australia"), - climate = c("cold", "hot", "temperate", "hot/temperate"), - temperature = c(10, 30, 18, "15"), - northern_hemisphere = c(TRUE, TRUE, FALSE, "FALSE"), + country = c("Canada", "Panama", "Afrique du Sud", "Australie"), + climat = c("froid", "chaud" , "tempéré", "chaud/tempéré"), + température = c(10, 30, 18, "15"), + hémisphère_nord = c(VRAI, VRAI, FAUX, "FAUX" ), has_kangaroo = c(FALSE, FALSE, FALSE, 1) ) str(country_climate) @@ -511,260 +511,260 @@ str(country_climate) ::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::: + +La conversion automatique du type de données est parfois une bénédiction, parfois un +désagrément. Sachez qu'il existe, apprenez les règles et vérifiez que les données +que vous importez dans R sont du type correct dans votre bloc de données. Sinon, utilisez-le +à votre avantage pour détecter les erreurs qui auraient pu être introduites lors de la saisie des données +(une lettre dans une colonne qui ne doit contenir que des chiffres par exemple). -The automatic conversion of data type is sometimes a blessing, sometimes an -annoyance. Be aware that it exists, learn the rules, and double check that data -you import in R are of the correct type within your data frame. If not, use it -to your advantage to detect mistakes that might have been introduced during data -entry (a letter in a column that should only contain numbers for instance). +Apprenez-en plus dans ce tutoriel RStudio -Learn more in this RStudio -tutorial ## Matrices -Before proceeding, now that we have learnt about data frames, let's -recap package installation and learn about a new data type, namely the -`matrix`. Like a `data.frame`, a matrix has two dimensions, rows and -columns. But the major difference is that all cells in a `matrix` must -be of the same type: `numeric`, `character`, `logical`, ... In that -respect, matrices are closer to a `vector` than a `data.frame`. +Avant de continuer, maintenant que nous avons découvert les trames de données, +récapitulons l'installation du package et découvrons un nouveau type de données, à savoir la +`matrice`. Comme un `data.frame`, une matrice a deux dimensions, des lignes et +colonnes. Mais la différence majeure est que toutes les cellules d'une « matrice » doivent +être du même type : « numérique », « caractère », « logique », ... À cet égard +, les matrices sont plus proches d'un « vecteur » que d'un « data.frame ». -The default constructor for a matrix is `matrix`. It takes a vector of -values to populate the matrix and the number of row and/or -columns[^ncol]. The values are sorted along the columns, as illustrated -below. +Le constructeur par défaut d'une matrice est « matrice ». Il faut un vecteur de +valeurs pour remplir la matrice et le nombre de lignes et/ou +colonnes[^ncol]. Les valeurs sont triées le long des colonnes, comme illustré +ci-dessous. ```{r mat1, purl=TRUE} -m <- matrix(1:9, ncol = 3, nrow = 3) +m <- matrice (1:9, ncol = 3, nrow = 3) m ``` -[^ncol]: Either the number of rows or columns are enough, as the other one can be deduced from the length of the values. Try out what happens if the values and number of rows/columns don't add up. +[^ncol]: Soit le nombre de lignes, soit le nombre de colonnes sont suffisants, l'autre pouvant être déduit de la longueur des valeurs. Essayez ce qui se passe si les valeurs et le nombre de lignes/colonnes ne s'additionnent pas. -::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::: défi -## Challenge: +## Défi: -Using the function `installed.packages()`, create a `character` matrix -containing the information about all packages currently installed on -your computer. Explore it. +À l'aide de la fonction `installed.packages()`, créez une matrice `caractère` +contenant les informations sur tous les packages actuellement installés sur +votre ordinateur. Explorez-le. -::::::::::::::: solution +::::::::::::::: solution ## Solution: ```{r pkg_sln, eval=FALSE, purl=TRUE} -## create the matrix -ip <- installed.packages() +## créer la matrice +ip <- Installed.packages() head(ip) -## try also View(ip) -## number of package +## essayez aussi View(ip) +## numéro de paquet nrow(ip) -## names of all installed packages +## noms de tous les packages installés rownames(ip) -## type of information we have about each package -colnames(ip) +## type d'informations dont nous disposons sur chaque package +noms de colonnes (ip) ``` ::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::: -It is often useful to create large random data matrices as test -data. The exercise below asks you to create such a matrix with random -data drawn from a normal distribution of mean 0 and standard deviation -1, which can be done with the `rnorm()` function. +Il est souvent utile de créer de grandes matrices de données aléatoires comme données de test +. L'exercice ci-dessous vous demande de créer une telle matrice avec des données aléatoires +tirées d'une distribution normale de moyenne 0 et d'écart type +1, ce qui peut être fait avec la fonction `rnorm()`. -::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::: défi -## Challenge: +## Défi: -Construct a matrix of dimension 1000 by 3 of normally distributed data -(mean 0, standard deviation 1) +Construire une matrice de dimension 1000 par 3 de données normalement distribuées +(moyenne 0, écart type 1) -::::::::::::::: solution +::::::::::::::: solution ## Solution ```{r rnormmat_sln, purl=TRUE} set.seed(123) -m <- matrix(rnorm(3000), ncol = 3) +m <- matrice(rnorm(3000), ncol = 3) dim(m) -head(m) +tête(m) ``` ::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::: -## Formatting Dates +## Formatage des dates -One of the most common issues that new (and experienced!) R users have -is converting date and time information into a variable that is -appropriate and usable during analyses. +L'un des problèmes les plus courants rencontrés par les nouveaux (et expérimentés !) Les utilisateurs de R ont +la conversion des informations de date et d'heure en une variable +appropriée et utilisable lors des analyses. -### Note on dates in spreadsheet programs +### Remarque sur les dates dans les tableurs -Dates in spreadsheets are generally stored in a single column. While -this seems the most natural way to record dates, it actually is not -best practice. A spreadsheet application will display the dates in a -seemingly correct way (to a human observer) but how it actually -handles and stores the dates may be problematic. It is often much -safer to store dates with YEAR, MONTH and DAY in separate columns or -as YEAR and DAY-OF-YEAR in separate columns. +Les dates dans les feuilles de calcul sont généralement stockées dans une seule colonne. Bien que +cela semble la manière la plus naturelle d'enregistrer des dates, ce n'est en réalité pas +la meilleure pratique. Un tableur affichera les dates d'une manière +apparemment correcte (pour un observateur humain), mais la façon dont elle +gère et stocke réellement les dates peut être problématique. Il est souvent +plus sûr de stocker les dates avec ANNÉE, MOIS et JOUR dans des colonnes séparées ou +comme ANNÉE et JOUR DE L'ANNÉE dans des colonnes séparées. -Spreadsheet programs such as LibreOffice, Microsoft Excel, OpenOffice, -Gnumeric, ... have different (and often incompatible) ways of encoding -dates (even for the same program between versions and operating -systems). Additionally, Excel can turn things that aren't dates into -dates -(@Zeeberg:2004), for example names or identifiers like MAR1, DEC1, -OCT4. So if you're avoiding the date format overall, it's easier to -identify these issues. +Tableurs tels que LibreOffice, Microsoft Excel, OpenOffice, +Gnumeric, ... ont des manières différentes (et souvent incompatibles) d'encoder les dates +(même pour le même programme entre les versions et les systèmes d'exploitation +). De plus, Excel peut [transformer des éléments qui ne sont pas des dates en dates +](https://nsaunders.wordpress.com/2012/10/22/gene-name-errors-and-excel-lessons-not -learned/) +(@Zeeberg:2004), par exemple des noms ou des identifiants comme MAR1, DEC1, +OCT4. Donc, si vous évitez globalement le format de date, il est plus facile d' +identifier ces problèmes. -The Dates as +La section Dates as data -section of the Data Carpentry lesson provides additional insights -about pitfalls of dates with spreadsheets. +de la leçon Data Carpentry fournit des informations supplémentaires +sur les pièges des dates avec des feuilles de calcul. -We are going to use the `ymd()` function from the package -**`lubridate`** (which belongs to the **`tidyverse`**; learn more -[here](https://www.tidyverse.org/)). . **`lubridate`** gets installed -as part of the **`tidyverse`** installation. When you load the -**`tidyverse`** (`library(tidyverse)`), the core packages (the -packages used in most data analyses) get loaded. **`lubridate`** -however does not belong to the core tidyverse, so you have to load it -explicitly with `library(lubridate)`. +Nous allons utiliser la fonction `ymd()` du package +**`lubridate`** (qui appartient au **`tidyverse`** ; en savoir plus +[ici] (https://www.tidyverse.org/)). . **`lubridate`** est installé +dans le cadre de l'installation de **`tidyverse`**. Lorsque vous chargez le +**`tidyverse`** (`library(tidyverse)`), les packages de base (les packages +utilisés dans la plupart des analyses de données) sont chargés. **`lubridate`** +n'appartient cependant pas au noyau spiceverse, vous devez donc le charger +explicitement avec `library(lubridate)`. -Start by loading the required package: +Commencez par charger le package requis : ```{r loadlibridate, message=FALSE, purl=TRUE} -library("lubridate") +bibliothèque("lubrifier") ``` -`ymd()` takes a vector representing year, month, and day, and converts -it to a `Date` vector. `Date` is a class of data recognized by R as -being a date and can be manipulated as such. The argument that the -function requires is flexible, but, as a best practice, is a character -vector formatted as "YYYY-MM-DD". +`ymd()` prend un vecteur représentant l'année, le mois et le jour, et le convertit +en un vecteur `Date`. `Date` est une classe de données reconnue par R comme +étant une date et peut être manipulée comme telle. L'argument requis par la fonction +est flexible, mais, à titre de bonne pratique, il s'agit d'un vecteur de caractère +au format "AAAA-MM-JJ". -Let's create a date object and inspect the structure: +Créons un objet date et inspectons la structure : ```{r, purl=TRUE} -my_date <- ymd("2015-01-01") -str(my_date) +ma_date <- ymd("2015-01-01") +str(ma_date) ``` -Now let's paste the year, month, and day separately - we get the same result: +Collons maintenant l'année, le mois et le jour séparément - nous obtenons le même résultat : ```{r, purl=TRUE} -# sep indicates the character to use to separate each component +# sep indique le caractère à utiliser pour séparer chaque composant my_date <- ymd(paste("2015", "1", "1", sep = "-")) -str(my_date) +str(my_date ) ``` -Let's now familiarise ourselves with a typical date manipulation -pipeline. The small data below has stored dates in different `year`, -`month` and `day` columns. +Familiarisons-nous maintenant avec un pipeline typique de manipulation de date +. Les petites données ci-dessous ont stocké des dates dans différentes colonnes « année », +« mois » et « jour ». ```{r, purl=TRUE} -x <- data.frame(year = c(1996, 1992, 1987, 1986, 2000, 1990, 2002, 1994, 1997, 1985), - month = c(2, 3, 3, 10, 1, 8, 3, 4, 5, 5), - day = c(24, 8, 1, 5, 8, 17, 13, 10, 11, 24), - value = c(4, 5, 1, 9, 3, 8, 10, 2, 6, 7)) +x <- data.frame(année = c(1996, 1992, 1987, 1986, 2000, 1990, 2002, 1994, 1997, 1985), + mois = c(2, 3, 3, 10, 1 , 8, 3, 4, 5, 5), + jour = c(24, 8, 1, 5, 8, 17, 13, 10, 11, 24), + valeur = c (4, 5, 1, 9, 3, 8, 10, 2, 6, 7)) x ``` -Now we apply this function to the `x` dataset. We first create a -character vector from the `year`, `month`, and `day` columns of `x` -using `paste()`: +Nous appliquons maintenant cette fonction à l'ensemble de données « x ». Nous créons d'abord un vecteur de caractères +à partir des colonnes `year`, `month` et `day` de `x` +en utilisant `paste()` : ```{r, purl=TRUE} -paste(x$year, x$month, x$day, sep = "-") +coller(x$year, x$month, x$day, sep = "-") ``` -This character vector can be used as the argument for `ymd()`: +Ce vecteur de caractères peut être utilisé comme argument pour `ymd()` : ```{r, purl=TRUE} -ymd(paste(x$year, x$month, x$day, sep = "-")) +ymd(coller(x$year, x$month, x$day, sep = "-")) ``` -The resulting `Date` vector can be added to `x` as a new column called `date`: +Le vecteur `Date` résultant peut être ajouté à `x` en tant que nouvelle colonne appelée `date` : ```{r, purl=TRUE} x$date <- ymd(paste(x$year, x$month, x$day, sep = "-")) -str(x) # notice the new column, with 'date' as the class +str(x) # remarquez la nouvelle colonne, avec 'date' comme classe ``` -Let's make sure everything worked correctly. One way to inspect the -new column is to use `summary()`: +Assurons-nous que tout a fonctionné correctement. Une façon d'inspecter la nouvelle colonne +est d'utiliser `summary()` : ```{r, purl=TRUE} -summary(x$date) +résumé(x$date) ``` -Note that `ymd()` expects to have the year, month and day, in that -order. If you have for instance day, month and year, you would need +Notez que `ymd()` s'attend à avoir l'année, le mois et le jour, dans cet ordre +. Si vous avez par exemple le jour, le mois et l'année, vous aurez besoin de `dmy()`. ```{r, purl=TRUE} -dmy(paste(x$day, x$month, x$year, sep = "-")) +dmy(coller(x$day, x$month, x$year, sep = "-")) ``` -`lubdridate` has many functions to address all date variations. +`lubdridate` a de nombreuses fonctions pour gérer toutes les variations de date. -## Summary of R objects +## Résumé des objets R -So far, we have seen several types of R object varying in the number -of dimensions and whether they could store a single or multiple data -types: +Jusqu'à présent, nous avons vu plusieurs types d'objets R variant selon le nombre +de dimensions et s'ils pouvaient stocker un ou plusieurs types de données + : -- **`vector`**: one dimension (they have a length), single type of data. -- **`matrix`**: two dimensions, single type of data. -- **`data.frame`**: two dimensions, one type per column. +- **`vecteur`** : une dimension (ils ont une longueur), un seul type de données. +- **`matrice`** : deux dimensions, un seul type de données. +- **`data.frame`** : deux dimensions, un type par colonne. -## Lists +## Listes -A data type that we haven't seen yet, but that is useful to know, and -follows from the summary that we have just seen are lists: +Un type de données que nous n'avons pas encore vu, mais qu'il est utile de connaître, et +découle du résumé que nous venons de voir sont des listes : -- **`list`**: one dimension, every item can be of a different data - type. +- **`list`** : une dimension, chaque élément peut être d'un type de données différent + . -Below, let's create a list containing a vector of numbers, characters, -a matrix, a dataframe and another list: +Ci-dessous, créons une liste contenant un vecteur de nombres, de caractères, +une matrice, un dataframe et une autre liste : ```{r list0, purl=TRUE} -l <- list(1:10, ## numeric - letters, ## character - installed.packages(), ## a matrix - cars, ## a data.frame - list(1, 2, 3)) ## a list -length(l) +l <- list(1:10, ## numérique + lettres, ## caractère + installé.packages(), ## une matrice + voitures, ## un data.frame + liste(1, 2, 3)) ## une liste +longueur(l) str(l) ``` -List subsetting is done using `[]` to subset a new sub-list or `[[]]` -to extract a single element of that list (using indices or names, if -the list is named). +Le sous-ensemble de liste est effectué en utilisant `[]` pour sous-ensembler une nouvelle sous-liste ou `[[]]` +pour extraire un seul élément de cette liste (en utilisant des indices ou des noms, si +la liste est appelé). ```{r, purl=TRUE} -l[[1]] ## first element -l[1:2] ## a list of length 2 -l[1] ## a list of length 1 +l[[1]] ## premier élément +l[1:2] ## une liste de longueur 2 +l[1] ## une liste de longueur 1 ``` -## Exporting and saving tabular data {#sec:exportandsave} +## Exportation et sauvegarde de données tabulaires {#sec:exportandsave} -We have seen how to read a text-based spreadsheet into R using the -`read.table` family of functions. To export a `data.frame` to a -text-based spreadsheet, we can use the `write.table` set of functions -(`write.csv`, `write.delim`, ...). They all take the variable to be -exported and the file to be exported to. For example, to export the -`rna` data to the `my_rna.csv` file in the `data_output` -directory, we would execute: +Nous avons vu comment lire une feuille de calcul textuelle dans R à l'aide de la famille de fonctions +`read.table`. Pour exporter un `data.frame` vers une feuille de calcul texte +, nous pouvons utiliser l'ensemble de fonctions `write.table` +(`write.csv`, `write.delim`, ...). Ils prennent tous la variable à +exportée et le fichier vers lequel exporter. Par exemple, pour exporter les données +`rna` vers le fichier `my_rna.csv` dans le répertoire `data_output` +, nous exécuterions : ```{r, eval=FALSE, purl=TRUE} write.csv(rna, file = "data_output/my_rna.csv") @@ -777,8 +777,8 @@ by default surround each field with quotes, and thus we will be able to read it back into R correctly, despite also using commas as column separators. -:::::::::::::::::::::::::::::::::::::::: keypoints +:::::::::::::::::::::::::::::::::::::::: points clés -- Tabular data in R +- Données tabulaires dans R -:::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::: From ae775e7999ca58e6b4581bebc8c50af9ed8a2a93 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Thu, 1 Aug 2024 00:07:35 +0900 Subject: [PATCH 200/334] New translations 25-starting-with-data.md (Chinese Simplified) --- locale/zh/episodes/25-starting-with-data.Rmd | 844 +++++++++---------- 1 file changed, 422 insertions(+), 422 deletions(-) diff --git a/locale/zh/episodes/25-starting-with-data.Rmd b/locale/zh/episodes/25-starting-with-data.Rmd index 8506d99ee..3cfcff2c7 100644 --- a/locale/zh/episodes/25-starting-with-data.Rmd +++ b/locale/zh/episodes/25-starting-with-data.Rmd @@ -1,300 +1,300 @@ --- -source: Rmd -title: Starting with data -teaching: 30 -exercises: 30 +source: 放射科 +title: 从数据开始 +teaching: 三十 +exercises: 三十 --- ```{r, include=FALSE} ``` -::::::::::::::::::::::::::::::::::::::: objectives +:::::::::::::::::::::::::::::::::::::::::: 目标 -- Describe what a `data.frame` is. -- Load external data from a .csv file into a data frame. -- Summarize the contents of a data frame. -- Describe what a factor is. -- Convert between strings and factors. -- Reorder and rename factors. -- Format dates. -- Export and save data. +- 描述什么是“data.frame”。 +- 将 .csv 文件中的外部数据加载到数据框中。 +- 总结数据框的内容。 +- 描述什么是因素。 +- 在字符串和因子之间转换。 +- 重新排序并重命名因素。 +- 格式化日期。 +- 导出并保存数据。 :::::::::::::::::::::::::::::::::::::::::::::::::: :::::::::::::::::::::::::::::::::::::::: questions -- First data analysis in R - -:::::::::::::::::::::::::::::::::::::::::::::::::: - -> This episode is based on the Data Carpentries's _Data Analysis and -> Visualisation in R for Ecologists_ lesson. - -## Presentation of the gene expression data - -We are going to use part of the data published by Blackmore , _The -effect of upper-respiratory infection on transcriptomic changes in the -CNS_. The goal of the study was to determine the effect of an -upper-respiratory infection on changes in RNA transcription occurring -in the cerebellum and spinal cord post infection. Gender matched eight -week old C57BL/6 mice were inoculated with saline or with Influenza A by -intranasal route and transcriptomic changes in the cerebellum and -spinal cord tissues were evaluated by RNA-seq at days 0 -(non-infected), 4 and 8. - -The dataset is stored as a comma-separated values (CSV) file. Each row -holds information for a single RNA expression measurement, and the first eleven -columns represent: - -| Column | Description | -| ---------- | -------------------------------------------------------------------------------------------------------------------------------------------- | -| gene | The name of the gene that was measured | -| sample | The name of the sample the gene expression was measured in | -| expression | The value of the gene expression | -| organism | The organism/species - here all data stem from mice | -| age | The age of the mouse (all mice were 8 weeks here) | -| sex | The sex of the mouse | -| infection | The infection state of the mouse, i.e. infected with Influenza A or not infected. | -| strain | The Influenza A strain. | -| time | The duration of the infection (in days). | -| tissue | The tissue that was used for the gene expression experiment, i.e. cerebellum or spinal cord. | -| mouse | The mouse unique identifier. | - -We are going to use the R function `download.file()` to download the -CSV file that contains the gene expression data, and we will use -`read.csv()` to load into memory the content of the CSV file as an -object of class `data.frame`. Inside the `download.file` command, the -first entry is a character string with the source URL. This source URL -downloads a CSV file from a GitHub repository. The text after the -comma (`"data/rnaseq.csv"`) is the destination of the file on your -local machine. You'll need to have a folder on your machine called -`"data"` where you'll download the file. So this command downloads the -remote file, names it `"rnaseq.csv"` and adds it to a preexisting -folder named `"data"`. +- 使用 R 进行首次数据分析 + +::::::::::::::::::::::::::::::::::::::::::::::::::::: + +> 本集基于 Data Carpentries 的_面向生态学家的 R 语言数据分析和 +> 可视化_课程。 + +## 基因表达数据的呈现 + +我们将使用 Blackmore 发布的部份数据,\* +上呼吸道感染对 +中枢神经系统转录组变化的影响\*。 该研究的目的是确定 +上呼吸道感染对感染后小脑和脊髓中发生的 +RNA 转录变化的影响。 性别匹配的八只 +周龄 C57BL/6 小鼠通过 +鼻内途径接种盐水或甲型流感病毒,并在第 0 +(未感染)、第 4 和第 8 天通过 RNA-seq 评估小脑和 +脊髓组织中的转录组变化。 + +数据集存储为逗号分隔值(CSV)文件。 每一行 +包含单个 RNA 表达测量的信息,前十一列 +代表: + +| 柱子 | 描述 | +| -- | -------------------- | +| 基因 | 被测量的基因名称 | +| 样本 | 测量基因表达的样本名称 | +| 表达 | 基因表达的价值 | +| 生物 | 生物体/物种 - 此处所有数据均来自小鼠 | +| 年龄 | 小鼠的年龄(这里所有的小鼠都是8周龄) | +| 性别 | 老鼠的性别 | +| 感染 | 小鼠的感染状态,即感染甲型流感或未感染。 | +| 拉紧 | A 型流感病毒株。 | +| 时间 | 感染持续时间(以天为单位)。 | +| 组织 | 用于基因表达实验的组织,即小脑或脊髓。 | +| 老鼠 | 鼠标唯一标识符。 | + +我们将使用 R 函数 `download.file()` 下载包含基因表达数据的 +CSV 文件,并使用 +`read.csv()` 将 CSV 文件的内容作为 +类 `data.frame` 的对象加载到内存中。 在 `download.file` 命令中, +第一个条目是带有源 URL 的字符串。 此源 URL +从 GitHub 存储库下载 CSV 文件。 +逗号后面的文本(“data/rnaseq.csv”)是该文件在您 +本地机器上的目标位置。 您需要在您的机器上建立一个名为 +`“data”`的文件夹,您将在该文件夹中下载文件。 因此,此命令下载 +远程文件,将其命名为“rnaseq.csv”并将其添加到名为“data”的预先存在的 +文件夹中。 ```{r, eval=TRUE} download.file(url = "https://github.com/carpentries-incubator/bioc-intro/raw/main/episodes/data/rnaseq.csv", - destfile = "data/rnaseq.csv") + destfile = "data/rnaseq.csv" ) ``` -You are now ready to load the data: +您现在可以加载数据了: ```{r, eval=TRUE, purl=TRUE} -rna <- read.csv("data/rnaseq.csv") +rna <- read.csv(“数据/rnaseq.csv”) ``` -This statement doesn't produce any output because, as you might -recall, assignments don't display anything. If we want to check that -our data has been loaded, we can see the contents of the data frame by -typing its name: +此语句不会产生任何输出,因为您可能 +还记得,分配不会显示任何内容。 如果我们想检查 +我们的数据是否已加载,我们可以通过 +输入其名称来查看数据框的内容: ```{r, eval=FALSE} -rna +核糖核酸 ``` -Wow... that was a lot of output. At least it means the data loaded -properly. Let's check the top (the first 6 lines) of this data frame -using the function `head()`: +哇... 那是大量的输出。 至少这意味着数据已正确加载 +。 让我们使用函数“head()”检查这个数据框 +的顶部(前 6 行): ```{r, purl=TRUE} head(rna) -## Try also +## 也尝试 ## View(rna) ``` -**Note** +**笔记** -`read.csv()` assumes that fields are delineated by commas, however, in -several countries, the comma is used as a decimal separator and the -semicolon (;) is used as a field delineator. If you want to read in -this type of files in R, you can use the `read.csv2()` function. It -behaves exactly like `read.csv()` but uses different parameters for -the decimal and the field separators. If you are working with another -format, they can be both specified by the user. Check out the help for -`read.csv()` by typing `?read.csv` to learn more. There is also the -`read.delim()` function for reading tab separated data files. It is important to -note that all of these functions are actually wrapper functions for -the main `read.table()` function with different arguments. As such, -the data above could have also been loaded by using `read.table()` -with the separation argument as `,`. The code is as follows: +`read.csv()` 假定字段由逗号分隔,但是在 +几个国家/地区,逗号用作小数分隔符,而 +分号 (;) 用作字段分隔符。 如果您想在 R 中读取 +这种类型的文件,您可以使用 `read.csv2()` 函数。 它的 +行为与 `read.csv()` 完全相同,但对 +小数点和字段分隔符使用不同的参数。 如果您使用另一种 +格式,则用户可以同时指定它们。 通过输入 `?read.csv` 查看 +`read.csv()` 的帮助以了解更多信息。 还有 +`read.delim()` 函数用于读取制表符分隔的数据文件。 值得注意的是 +所有这些函数实际上都是 +主`read.table()`函数的包装函数,具有不同的参数。 因此, +上述数据也可以通过使用 `read.table()` +以分隔参数 `,` 来加载。 代码如下: ```{r, eval=TRUE, purl=TRUE} -rna <- read.table(file = "data/rnaseq.csv", - sep = ",", - header = TRUE) +rna <- read.table(file = “data/rnaseq.csv”, + sep = “,”, + header = TRUE) ``` -The header argument has to be set to TRUE to be able to read the -headers as by default `read.table()` has the header argument set to -FALSE. +必须将 header 参数设置为 TRUE 才能读取 +标题,因为默认情况下 `read.table()` 将 header 参数设置为 +FALSE。 -## What are data frames? +## 什么是数据框? -Data frames are the _de facto_ data structure for most tabular data, -and what we use for statistics and plotting. +数据框是大多数表格数据的_事实上的_数据结构, +以及我们用于统计和绘图的数据结构。 A data frame can be created by hand, but most commonly they are generated by the functions `read.csv()` or `read.table()`; in other words, when importing spreadsheets from your hard drive (or the web). -A data frame is the representation of data in the format of a table -where the columns are vectors that all have the same length. Because -columns are vectors, each column must contain a single type of data -(e.g., characters, integers, factors). For example, here is a figure -depicting a data frame comprising a numeric, a character, and a -logical vector. +数据框是以表格 +格式表示的数据,其中列是所有具有相同长度的向量。 因为 +列是向量,所以每一列必须包含单一类型的数据 +(例如,字符、整数、因子)。 例如,这里有一个图 +,描绘了一个包含数字、字符和 +逻辑向量的数据框。 -![](./fig/data-frame.svg) +![](./图/数据框.svg) -We can see this when inspecting the <b>str</b>ucture of a data frame -with the function `str()`: +当我们用函数 `str()` 检查数据框 +的 <b>str</b>结构时,我们可以看到这一点: ```{r} -str(rna) +str(RNA) ``` -## Inspecting `data.frame` Objects +## 检查 `data.frame` 对象 -We already saw how the functions `head()` and `str()` can be useful to -check the content and the structure of a data frame. Here is a -non-exhaustive list of functions to get a sense of the -content/structure of the data. Let's try them out! +我们已经看到了函数“head()”和“str()”如何有助于 +检查数据框的内容和结构。 这里有一个 +非详尽的函数列表,可以帮助您了解 +数据的内容/结构。 让我们尝试一下吧! -**Size**: +**尺寸**: -- `dim(rna)` - returns a vector with the number of rows as the first - element, and the number of columns as the second element (the - **dim**ensions of the object). -- `nrow(rna)` - returns the number of rows. -- `ncol(rna)` - returns the number of columns. +- `dim(rna)` - 返回一个向量,其行数作为第一个 + 元素,列数作为第二个元素(对象的 + **维度**)。 +- `nrow(rna)`——返回行数。 +- `ncol(rna)`——返回列数。 -**Content**: +**内容**: -- `head(rna)` - shows the first 6 rows. -- `tail(rna)` - shows the last 6 rows. +- `head(rna)`-显示前 6 行。 +- `tail(rna)`-显示最后 6 行。 -**Names**: +**姓名**: -- `names(rna)` - returns the column names (synonym of `colnames()` for - `data.frame` objects). -- `rownames(rna)` - returns the row names. +- `names(rna)` - 返回列名(对于 + `data.frame` 对象,`colnames()` 的同义词)。 +- `rownames(rna)` - 返回行名称。 -**Summary**: +**概括**: -- `str(rna)` - structure of the object and information about the - class, length and content of each column. -- `summary(rna)` - summary statistics for each column. +- `str(rna)`——对象的结构和有关 + 类、长度和每列内容的信息。 +- `summary(rna)`——每列的汇总统计数据。 -Note: most of these functions are "generic", they can be used on other types of -objects besides `data.frame`. +注意:这些函数大部分都是“通用的”,除了“data.frame”之外,它们还可以用于其他类型的 +对象。 -::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::: 挑战 -## Challenge: +## 挑战: -Based on the output of `str(rna)`, can you answer the following -questions? +根据 `str(rna)` 的输出,你能回答以下 +个问题吗? -- What is the class of the object `rna`? -- How many rows and how many columns are in this object? +- 对象“rna”的类别是什么? +- 这个对象有多少行、多少列? -::::::::::::::: solution +::::::::::::::: 解决方案 -## Solution +## 解决方案 -- class: data frame -- how many rows: 66465, how many columns: 11 +- 类别:数据框 +- 行数:66465,列数:11 ::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::: +::::::::::::::::::::::::::::::::::::::::::::::::::::: -## Indexing and subsetting data frames +## 索引和子集数据框 -Our `rna` data frame has rows and columns (it has 2 dimensions); if we -want to extract some specific data from it, we need to specify the -"coordinates" we want. Row numbers come first, followed by -column numbers. However, note that different ways of specifying these -coordinates lead to results with different classes. +我们的 `rna` 数据框有行和列(它有 2 个维度);如果我们 +想从中提取一些特定数据,我们需要指定我们想要的 +“坐标”。 行号在前,然后是 +列号。 但是,请注意,指定这些 +坐标的不同方式会导致不同类别的结果。 ```{r, eval=FALSE, purl=TRUE} -# first element in the first column of the data frame (as a vector) +# 数据框第一列的第一个元素(作为向量) rna[1, 1] -# first element in the 6th column (as a vector) +# 第六列的第一个元素(作为向量) rna[1, 6] -# first column of the data frame (as a vector) +# 数据框的第一列(作为向量) rna[, 1] -# first column of the data frame (as a data.frame) +# 数据框的第一列(作为数据框) rna[1] -# first three elements in the 7th column (as a vector) +# 第七列的前三个元素(作为向量) rna[1:3, 7] -# the 3rd row of the data frame (as a data.frame) +# 数据框的第 3 行(作为数据框) rna[3, ] -# equivalent to head_rna <- head(rna) +# 等同于 head_rna <- head(rna) head_rna <- rna[1:6, ] head_rna ``` -`:` is a special function that creates numeric vectors of integers in -increasing or decreasing order, test `1:10` and `10:1` for -instance. See section @ref(sec:genvec) for details. +`:` 是一个特殊函数,它按 +的升序或降序创建整数数值向量,例如对 +实例测试 `1:10` 和 `10:1`。 有关详细信息,请参阅部分@ref(sec:genvec)。 -You can also exclude certain indices of a data frame using the "`-`" sign: +您还可以使用“-”符号排除数据框的某些索引: ```{r, eval=FALSE, purl=TRUE} -rna[, -1] ## The whole data frame, except the first column -rna[-c(7:66465), ] ## Equivalent to head(rna) +rna[, -1] ## 整个数据框,除了第一列 +rna[-c(7:66465), ] ## 等同于 head(rna) ``` -Data frames can be subsetted by calling indices (as shown previously), -but also by calling their column names directly: +数据框可以通过调用索引(如前所示) +进行子集化,也可以通过直接调用其列名进行子集化: ```{r, eval=FALSE, purl=TRUE} -rna["gene"] # Result is a data.frame -rna[, "gene"] # Result is a vector -rna[["gene"]] # Result is a vector -rna$gene # Result is a vector +rna["gene"] # 结果是一个数据框 +rna[, "gene"] # 结果是一个向量 +rna[["gene"]] # 结果是一个向量 +rna$gene # 结果是一个向量 ``` In RStudio, you can use the autocompletion feature to get the full and correct names of the columns. -::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::: 挑战 -## Challenge +## 挑战 -1. Create a `data.frame` (`rna_200`) containing only the data in - row 200 of the `rna` dataset. +1. 创建一个 `data.frame` (`rna_200`),其中仅包含 `rna` 数据集中 + 行 200 中的数据。 -2. Notice how `nrow()` gave you the number of rows in a `data.frame`? +2. 注意“nrow()”如何给出“data.frame”中的行数? -- Use that number to pull out just that last row in the initial - `rna` data frame. +- 使用该数字提取初始 + `rna` 数据框中的最后一行。 -- Compare that with what you see as the last row using `tail()` to - make sure it's meeting expectations. +- 将其与您看到的最后一行进行比较,使用`tail()`到 + 确保它符合预期。 -- Pull out that last row using `nrow()` instead of the row number. +- 使用“nrow()”而不是行号来拉出最后一行。 -- Create a new data frame (`rna_last`) from that last row. +- 从最后一行创建一个新的数据框(“rna_last”)。 -3. Use `nrow()` to extract the row that is in the middle of the - `rna` dataframe. Store the content of this row in an object - named `rna_middle`. +3. 使用 `nrow()` 提取位于 + `rna` 数据框中间的行。 将此行的内容存储在名为“rna_middle”的对象 + 中。 -4. Combine `nrow()` with the `-` notation above to reproduce the - behavior of `head(rna)`, keeping just the first through 6th - rows of the rna dataset. +4. 将 `nrow()` 与上面的 `-` 符号结合起来,重现 `head(rna)` 的 + 行为,仅保留 rna 数据集的第一到第六个 + 行。 -::::::::::::::: solution +::::::::::::::: 解决方案 -## Solution +## 解决方案 ```{r, purl=TRUE} ## 1. rna_200 <- rna[200, ] ## 2. -## Saving `n_rows` to improve readability and reduce duplication +## 保存 `n_rows` 以提高可读性并减少重复 n_rows <- nrow(rna) rna_last <- rna[n_rows, ] ## 3. @@ -305,70 +305,70 @@ rna_head <- rna[-(7:n_rows), ] ::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::: +::::::::::::::::::::::::::::::::::::::::::::::::::::: -## Factors +## 因素 -Factors represent **categorical data**. They are stored as integers -associated with labels and they can be ordered or unordered. While +因素代表**分类数据**。 它们存储为与标签相关的整数 +,并且可以有序或无序。 While factors look (and often behave) like character vectors, they are actually treated as integer vectors by R. So you need to be very careful when treating them as strings. -Once created, factors can only contain a pre-defined set of values, -known as _levels_. By default, R always sorts levels in alphabetical -order. For instance, if you have a factor with 2 levels: +一旦创建,因素只能包含一组预定义的值, +称为_级别_。 默认情况下,R 总是按字母顺序 +对级别进行排序。 例如,如果你有一个具有 2 个级别的因子: ```{r, purl=TRUE} -sex <- factor(c("male", "female", "female", "male", "female")) +性别 <- 因子(c(“男性”,“女性”,“女性”,“男性”,“女性”)) ``` -R will assign `1` to the level `"female"` and `2` to the level -`"male"` (because `f` comes before `m`, even though the first element -in this vector is `"male"`). You can see this by using the function -`levels()` and you can find the number of levels using `nlevels()`: +R 将为级别“女性”分配“1”,为级别 +“男性”分配“2”(因为“f”位于“m”之前,即使此向量中的第一个元素 +是“男性”)。 您可以使用函数 +`levels()` 来看到这一点,并且可以使用 `nlevels()` 来找到级别的数量: ```{r, purl=TRUE} -levels(sex) -nlevels(sex) +水平(性别) +n水平(性别) ``` -Sometimes, the order of the factors does not matter, other times you -might want to specify the order because it is meaningful (e.g., "low", -"medium", "high"), it improves your visualization, or it is required -by a particular type of analysis. Here, one way to reorder our levels -in the `sex` vector would be: +有时,因素的顺序并不重要,有时你 +可能想要指定顺序,因为它很有意义(例如,“低”, +,“中”,“高”),它可以改善你的可视化,或者它是特定类型的分析所必需的 +。 这里,重新排序“性别”向量中的级别 +的一种方法是: ```{r, purl=TRUE} -sex ## current order +sex ## 当前顺序 sex <- factor(sex, levels = c("male", "female")) -sex ## after re-ordering +sex ## 重新排序后 ``` -In R's memory, these factors are represented by integers (1, 2, 3), -but are more informative than integers because factors are self -describing: `"female"`, `"male"` is more descriptive than `1`, -`2`. Which one is "male"? You wouldn't be able to tell just from the -integer data. Factors, on the other hand, have this information built-in. -It is particularly helpful when there are many levels (like the -gene biotype in our example dataset). +在 R 的内存中,这些因素由整数 (1, 2, 3) 表示, +但比整数更具信息量,因为因素是自我 +描述的:“女性”、“男性”比 `1`、 +`2` 更具描述性。 哪一个是“男性”? 您无法仅从 +整数数据来判断。 另一方面,因素本身就包含这些信息。 +当存在多个级别时它特别有用(例如我们的示例数据集中的 +基因生物型)。 -When your data is stored as a factor, you can use the `plot()` -function to get a quick glance at the number of observations -represented by each factor level. Let's look at the number of males -and females in our data. +当您的数据被存储为一个因子时,您可以使用 `plot()` +函数快速浏览每个因子级别所代表的观测值 +的数量。 让我们看看数据中的男性 +和女性的数量。 ```{r firstfactorplot, fig.cap="Bar plot of the number of females and males.", purl=TRUE} -plot(sex) +情节(性别) ``` -### Converting to character +### 转换为字符 -If you need to convert a factor to a character vector, you use -`as.character(x)`. +如果您需要将因子转换为字符向量,则可以使用 +`as.character(x)`。 ```{r, purl=TRUE} -as.character(sex) +作为角色(性别) ``` <!-- ### Numeric factors --> @@ -409,100 +409,100 @@ as.character(sex) <!-- vector `year_fct` inside the square brackets --> -### Renaming factors +### 重命名因素 -If we want to rename these factor, it is sufficient to change its -levels: +如果我们想重命名这些因素,只需改变其 +级别即可: ```{r, purl=TRUE} -levels(sex) -levels(sex) <- c("M", "F") -sex -plot(sex) +水平(性别) +水平(性别)<- c("M", "F") +性别 +情节(性别) ``` -:::::::::::::::::::::::::::::::::::::: challenge +:::::::::::::::::::::::::::::::::::::: 挑战 -## Challenge: +## 挑战: -- Rename "F" and "M" to "Female" and "Male" respectively. +- 将“F”和“M”分别重命名为“女性”和“男性”。 -::::::::::::::: solution +::::::::::::::: 解决方案 -## Solution +## 解决方案 ```{r, eval=TRUE, purl=TRUE} -levels(sex) -levels(sex) <- c("Male", "Female") +水平(性别) +水平(性别)<-c(“男”,“女”) ``` ::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::: +::::::::::::::::::::::::::::::::::::::::::::::::::::: -::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::: 挑战 -## Challenge: +## 挑战: -We have seen how data frames are created when using `read.csv()`, but -they can also be created by hand with the `data.frame()` function. -There are a few mistakes in this hand-crafted `data.frame`. Can you -spot and fix them? Don't hesitate to experiment! +我们已经看到了使用“read.csv()”时如何创建数据框,但是 +也可以使用“data.frame()”函数手动创建它们。 +这个手工制作的“data.frame”中存在一些错误。 你能 +发现并修复它们吗? 不要犹豫去尝试吧! ```{r, eval=FALSE} animal_data <- data.frame( - animal = c(dog, cat, sea cucumber, sea urchin), - feel = c("furry", "squishy", "spiny"), + animal = c(狗, 猫, 海参, 海胆), + feel = c("毛茸茸的", "柔软的", "多刺的"), weight = c(45, 8 1.1, 0.8)) ``` -::::::::::::::: solution +::::::::::::::: 解决方案 -## Solution +## 解决方案 -- missing quotations around the names of the animals -- missing one entry in the "feel" column (probably for one of the furry animals) -- missing one comma in the weight column +- 动物名称周围缺少引号 +- “感觉”栏中缺少一项(可能是针对其中一种毛茸茸的动物) +- 体重栏缺少一个逗号 ::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::: +::::::::::::::::::::::::::::::::::::::::::::::::::::: -::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::: 挑战 -## Challenge: +## 挑战: -Can you predict the class for each of the columns in the following -example? +您能预测以下 +示例中每一列的类别吗? -Check your guesses using `str(country_climate)`: +使用 `str(country_climate)` 检查你的猜测: -- Are they what you expected? Why? Why not? +- 它们符合你的预期吗? 为什么? 为什么不? -- Try again by adding `stringsAsFactors = TRUE` after the last - variable when creating the data frame. What is happening now? - `stringsAsFactors` can also be set when reading text-based - spreadsheets into R using `read.csv()`. +- 在创建数据框时,通过在最后一个 + 变量后添加 `stringsAsFactors = TRUE` 再试一次。 现在发生了什么事? + 当使用“read.csv()”将基于文本的 + 电子表格读入 R 时,也可以设置“stringsAsFactors”。 ```{r, eval=FALSE, purl=TRUE} country_climate <- data.frame( - country = c("Canada", "Panama", "South Africa", "Australia"), - climate = c("cold", "hot", "temperate", "hot/temperate"), - temperature = c(10, 30, 18, "15"), + country = c("加拿大", "巴拿马", "南非", "澳大利亚"), + Climate = c("冷", "热", "温和", "热/温和"), + Temperature = c(10, 30, 18, "15"), northern_hemisphere = c(TRUE, TRUE, FALSE, "FALSE"), has_kangaroo = c(FALSE, FALSE, FALSE, 1) - ) +) ``` -::::::::::::::: solution +::::::::::::::: 解决方案 -## Solution +## 解决方案 ```{r, eval=TRUE, purl=TRUE} country_climate <- data.frame( - country = c("Canada", "Panama", "South Africa", "Australia"), - climate = c("cold", "hot", "temperate", "hot/temperate"), - temperature = c(10, 30, 18, "15"), + country = c("加拿大", "巴拿马", "南非", "澳大利亚"), + Climate = c("冷", "热", "温和", "热/温和"), + Temperature = c(10, 30, 18, "15"), northern_hemisphere = c(TRUE, TRUE, FALSE, "FALSE"), has_kangaroo = c(FALSE, FALSE, FALSE, 1) ) @@ -511,263 +511,263 @@ str(country_climate) ::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::: +::::::::::::::::::::::::::::::::::::::::::::::::::::: The automatic conversion of data type is sometimes a blessing, sometimes an -annoyance. Be aware that it exists, learn the rules, and double check that data -you import in R are of the correct type within your data frame. If not, use it -to your advantage to detect mistakes that might have been introduced during data -entry (a letter in a column that should only contain numbers for instance). +annoyance. 请注意它的存在,学习规则,并仔细检查您在 R 中导入的数据 +在您的数据框内是否属于正确类型。 如果没有,请利用它 +来检测在数据 +输入期间可能引入的错误(例如,某一列中应该只包含数字的字母)。 -Learn more in this RStudio -tutorial +欲了解更多信息,请参阅 RStudio +教程 -## Matrices +## 矩阵 -Before proceeding, now that we have learnt about data frames, let's -recap package installation and learn about a new data type, namely the -`matrix`. Like a `data.frame`, a matrix has two dimensions, rows and -columns. But the major difference is that all cells in a `matrix` must -be of the same type: `numeric`, `character`, `logical`, ... In that -respect, matrices are closer to a `vector` than a `data.frame`. +在继续之前,既然我们已经了解了数据框,让我们 +回顾一下包安装并了解一种新的数据类型,即 +“矩阵”。 与“数据框”类似,矩阵有两个维度:行和 +列。 但主要的区别在于“矩阵”中的所有单元格必须 +属于同一类型:“数字”、“字符”、“逻辑”…… 从 +方面来看,矩阵更接近于“向量”而不是“数据框”。 -The default constructor for a matrix is `matrix`. It takes a vector of -values to populate the matrix and the number of row and/or -columns[^ncol]. The values are sorted along the columns, as illustrated -below. +矩阵的默认构造函数是“矩阵”。 它采用 +值的向量来填充矩阵和行数和/或 +列数[^ncol]。 这些值按照列排序,如下图所示 +。 ```{r mat1, purl=TRUE} -m <- matrix(1:9, ncol = 3, nrow = 3) +m <- 矩阵(1:9,ncol = 3,nrow = 3) m ``` -[^ncol]: Either the number of rows or columns are enough, as the other one can be deduced from the length of the values. Try out what happens if the values and number of rows/columns don't add up. +[^ncol]: 行数或列数就足够了,因为另一个可以从值的长度推断出来。 尝试一下如果值和行数/列数不相加会发生什么。 -::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::: 挑战 -## Challenge: +## 挑战: -Using the function `installed.packages()`, create a `character` matrix -containing the information about all packages currently installed on -your computer. Explore it. +使用函数“installed.packages()”,创建一个“字符”矩阵 +,其中包含有关当前安装在 +计算机上的所有包的信息。 探索它。 -::::::::::::::: solution +::::::::::::::: 解决方案 -## Solution: +## 解决方案: ```{r pkg_sln, eval=FALSE, purl=TRUE} -## create the matrix -ip <- installed.packages() +## 创建矩阵 +ip <- mounted.packages() head(ip) -## try also View(ip) -## number of package +## 也尝试 View(ip) +## 包的数量 nrow(ip) -## names of all installed packages +## 所有已安装包的名称 rownames(ip) -## type of information we have about each package +## 关于每个包我们拥有的信息类型 colnames(ip) ``` ::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::: +::::::::::::::::::::::::::::::::::::::::::::::::::::: -It is often useful to create large random data matrices as test -data. The exercise below asks you to create such a matrix with random -data drawn from a normal distribution of mean 0 and standard deviation -1, which can be done with the `rnorm()` function. +创建大型随机数据矩阵作为测试 +数据通常很有用。 下面的练习要求你创建这样一个矩阵,其中包含从均值为 0、标准差为 +1 的正态分布中抽取的随机 +数据,这可以使用 `rnorm()` 函数完成。 -::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::: 挑战 -## Challenge: +## 挑战: -Construct a matrix of dimension 1000 by 3 of normally distributed data -(mean 0, standard deviation 1) +构建一个维度为 1000、长度为 3 的正态分布数据矩阵 +(平均值为 0,标准差为 1) -::::::::::::::: solution +::::::::::::::: 解决方案 -## Solution +## 解决方案 ```{r rnormmat_sln, purl=TRUE} -set.seed(123) -m <- matrix(rnorm(3000), ncol = 3) -dim(m) -head(m) +设置.种子(123) +m <- 矩阵(rnorm(3000),ncol = 3) +dim(m) +head(m) ``` ::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::: +::::::::::::::::::::::::::::::::::::::::::::::::::::: -## Formatting Dates +## 格式化日期 -One of the most common issues that new (and experienced!) R users have -is converting date and time information into a variable that is -appropriate and usable during analyses. +最常见的问题之一是新手(和经验丰富的人!) R 用户有 +将日期和时间信息转换为 +适当且可在分析期间使用的变量。 -### Note on dates in spreadsheet programs +### 注意电子表格程序中的日期 -Dates in spreadsheets are generally stored in a single column. While -this seems the most natural way to record dates, it actually is not -best practice. A spreadsheet application will display the dates in a -seemingly correct way (to a human observer) but how it actually -handles and stores the dates may be problematic. It is often much +电子表格中的日期通常存储在单个列中。 虽然 +这似乎是记录日期最自然的方式,但实际上它并不是 +最佳做法。 电子表格应用程序将以 +看似正确的方式(对于人类观察者而言)显示日期,但它实际 +处理和存储日期的方式可能会有问题。 It is often much safer to store dates with YEAR, MONTH and DAY in separate columns or as YEAR and DAY-OF-YEAR in separate columns. -Spreadsheet programs such as LibreOffice, Microsoft Excel, OpenOffice, -Gnumeric, ... have different (and often incompatible) ways of encoding -dates (even for the same program between versions and operating -systems). Additionally, Excel can turn things that aren't dates into -dates -(@Zeeberg:2004), for example names or identifiers like MAR1, DEC1, -OCT4. So if you're avoiding the date format overall, it's easier to -identify these issues. - -The Dates as -data -section of the Data Carpentry lesson provides additional insights -about pitfalls of dates with spreadsheets. - -We are going to use the `ymd()` function from the package -**`lubridate`** (which belongs to the **`tidyverse`**; learn more -[here](https://www.tidyverse.org/)). . **`lubridate`** gets installed -as part of the **`tidyverse`** installation. When you load the -**`tidyverse`** (`library(tidyverse)`), the core packages (the -packages used in most data analyses) get loaded. **`lubridate`** -however does not belong to the core tidyverse, so you have to load it -explicitly with `library(lubridate)`. - -Start by loading the required package: +电子表格程序,例如 LibreOffice、Microsoft Excel、OpenOffice、 +Gnumeric、... 有不同的(并且通常不兼容的)方式来编码 +日期(即使对于同一程序,不同版本和操作系统也存在不同的 +)。 此外,Excel 可以 将非日期的内容转换为 +日期 +(@Zeeberg:2004),例如 MAR1、DEC1、 +OCT4 等名称或标识符。 因此,如果您总体上避免使用日期格式,则 +更容易识别这些问题。 + +数据木工课程的 日期为 +数据 +部分提供了有关电子表格中日期缺陷的额外见解 +。 + +我们将使用包 +**`lubridate`** 中的 `ymd()` 函数(属于 **`tidyverse`**;了解更多信息 +[这里](https://www.tidyverse.org/))。 。 **`lubridate`** 作为 **`tidyverse`** 安装的一部分进行安装 +。 当你加载 +**`tidyverse`** (`library(tidyverse)`) 时,核心包(大多数数据分析中使用的 +包)也会被加载。 然而,**`lubridate`** +不属于核心 tidyverse,因此您必须使用 `library(lubridate)` 明确加载它 +。 + +首先加载所需的包: ```{r loadlibridate, message=FALSE, purl=TRUE} -library("lubridate") +图书馆(“lubridate”) ``` -`ymd()` takes a vector representing year, month, and day, and converts -it to a `Date` vector. `Date` is a class of data recognized by R as -being a date and can be manipulated as such. The argument that the -function requires is flexible, but, as a best practice, is a character -vector formatted as "YYYY-MM-DD". +`ymd()` 采用代表年、月、日的向量,并将 +转换为 `Date` 向量。 `日期` 是 R 识别的一类数据, +表示日期,并且可以这样进行操作。 +函数所需的参数很灵活,但最佳实践是将字符 +向量格式化为“YYYY-MM-DD”。 -Let's create a date object and inspect the structure: +让我们创建一个日期对象并检查其结构: ```{r, purl=TRUE} my_date <- ymd("2015-01-01") str(my_date) ``` -Now let's paste the year, month, and day separately - we get the same result: +现在让我们分别粘贴年份、月份和日期——我们得到相同的结果: ```{r, purl=TRUE} -# sep indicates the character to use to separate each component +# sep 表示用于分隔每个组件的字符 my_date <- ymd(paste("2015", "1", "1", sep = "-")) str(my_date) ``` -Let's now familiarise ourselves with a typical date manipulation -pipeline. The small data below has stored dates in different `year`, +现在让我们熟悉典型的日期操作 +管道。 The small data below has stored dates in different `year`, `month` and `day` columns. ```{r, purl=TRUE} x <- data.frame(year = c(1996, 1992, 1987, 1986, 2000, 1990, 2002, 1994, 1997, 1985), - month = c(2, 3, 3, 10, 1, 8, 3, 4, 5, 5), - day = c(24, 8, 1, 5, 8, 17, 13, 10, 11, 24), - value = c(4, 5, 1, 9, 3, 8, 10, 2, 6, 7)) + month = c(2, 3, 3, 10, 1, 8, 3, 4, 5, 5), + day = c(24, 8, 1, 5, 8, 17, 13, 10, 11, 24), + value = c(4, 5, 1, 9, 3, 8, 10, 2, 6, 7)) x ``` -Now we apply this function to the `x` dataset. We first create a -character vector from the `year`, `month`, and `day` columns of `x` -using `paste()`: +现在我们将此函数应用于“x”数据集。 我们首先使用 `paste()` 从 `x` +的 `year`、`month` 和 `day` 列创建一个 +字符向量: ```{r, purl=TRUE} -paste(x$year, x$month, x$day, sep = "-") +粘贴(x$year,x$month,x$day,sep =“-”) ``` -This character vector can be used as the argument for `ymd()`: +该字符向量可用作 `ymd()` 的参数: ```{r, purl=TRUE} -ymd(paste(x$year, x$month, x$day, sep = "-")) +ymd(粘贴(x$year, x$month, x$day, sep = "-")) ``` -The resulting `Date` vector can be added to `x` as a new column called `date`: +生成的 `Date` 向量可以添加到 `x` 作为名为 `date` 的新列: ```{r, purl=TRUE} x$date <- ymd(paste(x$year, x$month, x$day, sep = "-")) -str(x) # notice the new column, with 'date' as the class +str(x) # 注意新列,以 'date' 为类 ``` -Let's make sure everything worked correctly. One way to inspect the -new column is to use `summary()`: +让我们确保一切正常。 检查 +新列的一种方法是使用 `summary()`: ```{r, purl=TRUE} -summary(x$date) +摘要(x$date) ``` -Note that `ymd()` expects to have the year, month and day, in that -order. If you have for instance day, month and year, you would need -`dmy()`. +请注意,`ymd()` 需要按 +的顺序显示年、月、日。 例如,如果您有日、月和年,您将需要 +`dmy()`。 ```{r, purl=TRUE} -dmy(paste(x$day, x$month, x$year, sep = "-")) +dmy(粘贴(x$day, x$month, x$year, sep = "-")) ``` -`lubdridate` has many functions to address all date variations. +`lubdridate` 有许多函数可以解决所有日期变化。 -## Summary of R objects +## R 对象摘要 -So far, we have seen several types of R object varying in the number -of dimensions and whether they could store a single or multiple data -types: +到目前为止,我们已经看到了几种类型的 R 对象,它们的维数 +不同,并且它们可以存储单个或多个数据 +类型: -- **`vector`**: one dimension (they have a length), single type of data. -- **`matrix`**: two dimensions, single type of data. -- **`data.frame`**: two dimensions, one type per column. +- **`向量`**:一维(有长度),单一类型的数据。 +- **`矩阵`**:二维,单一类型的数据。 +- **`data.frame`**:两个维度,每列一种类型。 -## Lists +## 列表 -A data type that we haven't seen yet, but that is useful to know, and -follows from the summary that we have just seen are lists: +我们还没有见过这种数据类型,但了解它很有用,并且根据我们刚刚看到的总结, +是列表: -- **`list`**: one dimension, every item can be of a different data - type. +- **`列表`**:一维,每个项目可以是不同的数据 + 类型。 -Below, let's create a list containing a vector of numbers, characters, -a matrix, a dataframe and another list: +下面,让我们创建一个包含数字、字符、 +矩阵、数据框和另一个列表的向量列表: ```{r list0, purl=TRUE} -l <- list(1:10, ## numeric - letters, ## character - installed.packages(), ## a matrix - cars, ## a data.frame - list(1, 2, 3)) ## a list -length(l) +l <- list(1:10, ## 数字 + 字母, ## 字符 + installed.packages(), ## 矩阵 + 汽车, ## 数据框 + list(1, 2, 3)) ## 列表 +长度(l) str(l) ``` -List subsetting is done using `[]` to subset a new sub-list or `[[]]` -to extract a single element of that list (using indices or names, if -the list is named). +列表子集化是使用 `[]` 来子集化新的子列表或使用 `[[]]` +来提取该列表的单个元素(使用索引或名称,如果 +列表已命名)。 ```{r, purl=TRUE} -l[[1]] ## first element -l[1:2] ## a list of length 2 -l[1] ## a list of length 1 +l[[1]] ## 第一个元素 +l[1:2] ## 长度为 2 的列表 +l[1] ## 长度为 1 的列表 ``` -## Exporting and saving tabular data {#sec:exportandsave} +## 导出和保存表格数据 {#sec:exportandsave} -We have seen how to read a text-based spreadsheet into R using the -`read.table` family of functions. To export a `data.frame` to a -text-based spreadsheet, we can use the `write.table` set of functions -(`write.csv`, `write.delim`, ...). They all take the variable to be -exported and the file to be exported to. For example, to export the -`rna` data to the `my_rna.csv` file in the `data_output` -directory, we would execute: +我们已经了解了如何使用 +`read.table` 系列函数将基于文本的电子表格读入 R。 要将 `data.frame` 导出到 +基于文本的电子表格,我们可以使用 `write.table` 函数集 +(`write.csv`、`write.delim`,...)。 它们都将要导出的变量 +和要导出到的文件。 例如,要将 +`rna` 数据导出到 `data_output` +目录中的 `my_rna.csv` 文件,我们可以执行: ```{r, eval=FALSE, purl=TRUE} -write.csv(rna, file = "data_output/my_rna.csv") +写入.csv(rna,文件 = “data_output/my_rna.csv”) ``` This new csv file can now be shared with other collaborators who @@ -777,8 +777,8 @@ by default surround each field with quotes, and thus we will be able to read it back into R correctly, despite also using commas as column separators. -:::::::::::::::::::::::::::::::::::::::: keypoints +:::::::::::::::::::::::::::::::::::::::: 关键点 -- Tabular data in R +- R 中的表格数据 -:::::::::::::::::::::::::::::::::::::::::::::::::: +::::::::::::::::::::::::::::::::::::::::::::::::::::: From 5963d8b917727e63d54cf22873130baa3331ee3b Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Thu, 1 Aug 2024 00:07:37 +0900 Subject: [PATCH 201/334] New translations 30-dplyr.md (French) --- locale/fr/episodes/30-dplyr.Rmd | 906 ++++++++++++++++---------------- 1 file changed, 453 insertions(+), 453 deletions(-) diff --git a/locale/fr/episodes/30-dplyr.Rmd b/locale/fr/episodes/30-dplyr.Rmd index b50395a63..3683d6c13 100644 --- a/locale/fr/episodes/30-dplyr.Rmd +++ b/locale/fr/episodes/30-dplyr.Rmd @@ -1,6 +1,6 @@ --- source: Rmd -title: Manipulating and analysing data with dplyr +title: Manipulation et analyse de données avec dplyr teaching: 75 exercises: 75 --- @@ -8,85 +8,85 @@ exercises: 75 ```{r, include=FALSE} ``` -::::::::::::::::::::::::::::::::::::::: objectives +::::::::::::::::::::::::::::::::::::::: objectifs -- Describe the purpose of the **`dplyr`** and **`tidyr`** packages. -- Describe several of their functions that are extremely useful to - manipulate data. -- Describe the concept of a wide and a long table format, and see - how to reshape a data frame from one format to the other one. -- Demonstrate how to join tables. +- Décrivez l'objectif des packages **`dplyr`** et **`tidyr`**. +- Décrivez plusieurs de leurs fonctions extrêmement utiles pour + manipuler des données. +- Décrivez le concept d'un format de tableau large et long, et voyez + comment remodeler un bloc de données d'un format à l'autre. +- Montrez comment joindre des tables. -:::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::: questions +:::::::::::::::::::::::::::::::::::::::: des questions -- Data analysis in R using the tidyverse meta-package +- Analyse de données dans R à l'aide du méta-paquet Tidyverse -:::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::: ```{r loaddata_dplyr, echo=FALSE, purl=FALSE, message=FALSE} if (!file.exists("data/rnaseq.csv")) -download.file(url = "https://github.com/carpentries-incubator/bioc-intro/raw/main/episodes/data/rnaseq.csv", +download.file(url = "https://github.com/carpentries-incubator/bioc-intro/raw/main/episodes/ data/rnaseq.csv", destfile = "data/rnaseq.csv") ``` -> This episode is based on the Data Carpentries's _Data Analysis and -> Visualisation in R for Ecologists_ lesson. +> Cet épisode est basé sur la leçon _Analyse des données et +> Visualisation dans R pour les écologistes_ de Data Carpentries. -## Data manipulation using **`dplyr`** and **`tidyr`** +## Manipulation des données à l'aide de **`dplyr`** et **`tidyr`** -Bracket subsetting is handy, but it can be cumbersome and difficult to -read, especially for complicated operations. +Le sous-ensemble entre crochets est pratique, mais il peut être fastidieux et difficile à +lire, en particulier pour les opérations compliquées. -Some packages can greatly facilitate our task when we manipulate data. -Packages in R are basically sets of additional functions that let you -do more stuff. The functions we've been using so far, like `str()` or -`data.frame()`, come built into R; Loading packages can give you access to other -specific functions. Before you use a package for the first time you need to install -it on your machine, and then you should import it in every subsequent -R session when you need it. +Certains packages peuvent grandement faciliter notre tâche lorsque nous manipulons des données. +Les packages dans R sont essentiellement des ensembles de fonctions supplémentaires qui vous permettent +de faire plus de choses. Les fonctions que nous avons utilisées jusqu'à présent, comme `str()` ou +`data.frame()`, sont intégrées à R ; Le chargement de packages peut vous donner accès à d'autres +fonctions spécifiques. Avant d'utiliser un package pour la première fois, vous devez l'installer +sur votre machine, puis vous devez l'importer à chaque +session R suivante lorsque vous en avez besoin. -- The package **`dplyr`** provides powerful tools for data manipulation tasks. - It is built to work directly with data frames, with many manipulation tasks - optimised. +- Le package **`dplyr`** fournit des outils puissants pour les tâches de manipulation de données. + Il est conçu pour fonctionner directement avec des trames de données, avec de nombreuses tâches de manipulation + optimisées. -- As we will see latter on, sometimes we want a data frame to be reshaped to be able - to do some specific analyses or for visualisation. The package **`tidyr`** addresses - this common problem of reshaping data and provides tools for manipulating - data in a tidy way. +- Comme nous le verrons plus loin, nous souhaitons parfois qu'un bloc de données soit remodelé pour pouvoir + effectuer des analyses spécifiques ou pour la visualisation. Le package **`tidyr`** résout + ce problème courant de remodelage des données et fournit des outils pour manipuler les + données de manière ordonnée. -To learn more about **`dplyr`** and **`tidyr`** after the workshop, -you may want to check out this handy data transformation with - -and this one about +Pour en savoir plus sur **`dplyr`** et **`tidyr`** après l'atelier, +vous voudrez peut-être consulter ceci transformation de données pratique avec +\*\* +et ceci celui sur . -- The **`tidyverse`** package is an "umbrella-package" that installs - several useful packages for data analysis which work well together, - such as **`tidyr`**, **`dplyr`**, **`ggplot2`**, **`tibble`**, etc. - These packages help us to work and interact with the data. - They allow us to do many things with your data, such as subsetting, transforming, - visualising, etc. +- Le package **`tidyverse`** est un "package parapluie" qui installe + plusieurs packages utiles pour l'analyse des données qui fonctionnent bien ensemble, + tels que **`tidyr`**, \* \*`dplyr`\*\*, **`ggplot2`**, **`tibble`**, etc. + Ces packages nous aident à travailler et à interagir avec les données. + Ils nous permettent de faire beaucoup de choses avec vos données, comme le sous-ensemble, la transformation, la + visualisation, etc. -If you did the set up, you should have already installed the tidyverse package. -Check to see if you have it by trying to load in from the library: +Si vous avez effectué la configuration, vous devriez déjà avoir installé le package Tidyverse. +Vérifiez si vous l'avez en essayant de le charger depuis la bibliothèque : ```{r, message=FALSE, purl=TRUE} -## load the tidyverse packages, incl. dplyr -library("tidyverse") +## chargez les packages Tidyverse, incl. dplyr +bibliothèque("tidyverse") ``` -If you got an error message `there is no package called ‘tidyverse’` then you have not -installed the package yet for this version of R. To install the **`tidyverse`** package type: +Si vous recevez un message d'erreur `il n'y a pas de package appelé 'tidyverse'` alors vous n'avez pas +installé le package pour cette version de R. Pour installer le package **`tidyverse`**, tapez : ```{r, eval=FALSE, purl=TRUE} BiocManager::install("tidyverse") ``` -If you had to install the **`tidyverse`** package, do not forget to load it in this R session by using the `library()` command above! +Si vous avez dû installer le package **`tidyverse`**, n'oubliez pas de le charger dans cette session R en utilisant la commande `library()` ci-dessus ! -## Loading data with tidyverse +## Chargement de données avec Tidyverse Instead of `read.csv()`, we will read in our data using the `read_csv()` function (notice the `_` instead of the `.`), from the tidyverse package @@ -95,320 +95,320 @@ function (notice the `_` instead of the `.`), from the tidyverse package ```{r, message=FALSE, purl=TRUE} rna <- read_csv("data/rnaseq.csv") -## view the data +## afficher les données rna ``` -Notice that the class of the data is now referred to as a "tibble". +Notez que la classe des données est désormais appelée « tibble ». -Tibbles tweak some of the behaviors of the data frame objects we introduced in the -previously. The data structure is very similar to a data frame. For our purposes -the only differences are that: +Tibbles modifie certains des comportements des objets de trame de données que nous avons introduits dans le +précédemment. La structure des données est très similaire à une trame de données. Pour nos besoins +, les seules différences sont les suivantes : -1. It displays the data type of each column under its name. - Note that \<`dbl`\> is a data type defined to hold numeric values with - decimal points. +1. Il affiche le type de données de chaque colonne sous son nom. + Notez que \<`dbl`\> est un type de données défini pour contenir des valeurs numériques avec + points décimaux. -2. It only prints the first few rows of data and only as many columns as fit on - one screen. +2. Il imprime uniquement les premières lignes de données et seulement autant de colonnes que peuvent contenir + un écran. -We are now going to learn some of the most common **`dplyr`** functions: +Nous allons maintenant apprendre certaines des fonctions **`dplyr`** les plus courantes : -- `select()`: subset columns -- `filter()`: subset rows on conditions -- `mutate()`: create new columns by using information from other columns -- `group_by()` and `summarise()`: create summary statistics on grouped data -- `arrange()`: sort results -- `count()`: count discrete values +- `select()` : sous-ensemble de colonnes +- `filter()` : sous-ensemble de lignes sur conditions +- `mutate()` : crée de nouvelles colonnes en utilisant les informations d'autres colonnes +- `group_by()` et `summarise()` : créent des statistiques récapitulatives sur des données groupées +- `arrange()` : trier les résultats +- `count()` : compte les valeurs discrètes -## Selecting columns and filtering rows +## Sélection de colonnes et filtrage de lignes -To select columns of a data frame, use `select()`. The first argument -to this function is the data frame (`rna`), and the subsequent -arguments are the columns to keep. +Pour sélectionner les colonnes d'un bloc de données, utilisez `select()`. Le premier argument +de cette fonction est la trame de données (`rna`), et les arguments +suivants sont les colonnes à conserver. ```{r, purl=TRUE} -select(rna, gene, sample, tissue, expression) +sélectionner (ARN, gène, échantillon, tissu, expression) ``` -To select all columns _except_ certain ones, put a "-" in front of -the variable to exclude it. +Pour sélectionner toutes les colonnes _sauf_ certaines, mettez un "-" devant +la variable pour l'exclure. ```{r, purl=TRUE} -select(rna, -tissue, -organism) +sélectionner (arn, -tissu, -organisme) ``` -This will select all the variables in `rna` except `tissue` -and `organism`. +Cela sélectionnera toutes les variables de `rna` sauf `tissu` +et `organism`. -To choose rows based on a specific criteria, use `filter()`: +Pour choisir des lignes en fonction d'un critère spécifique, utilisez `filter()` : ```{r, purl=TRUE} -filter(rna, sex == "Male") +filter(arn, sex == "Male") filter(rna, sex == "Male" & infection == "NonInfected") ``` -Now let's imagine we are interested in the human homologs of the mouse -genes analysed in this dataset. This information can be found in the -last column of the `rna` tibble, named -`hsapiens_homolog_associated_gene_name`. To visualise it easily, we -will create a new table containing just the 2 columns `gene` and +Imaginons maintenant que nous nous intéressions aux homologues humains des gènes +de souris analysés dans cet ensemble de données. Ces informations se trouvent dans la +dernière colonne du tibble `rna`, nommée +`hsapiens_homolog_associated_gene_name`. Pour le visualiser facilement, nous +allons créer un nouveau tableau contenant uniquement les 2 colonnes `gene` et `hsapiens_homolog_associated_gene_name`. ```{r} -genes <- select(rna, gene, hsapiens_homolog_associated_gene_name) -genes +gènes <- select(arn, gene, hsapiens_homolog_associated_gene_name) +gènes ``` -Some mouse genes have no human homologs. These can be retrieved using -`filter()` and the `is.na()` function, that determines whether -something is an `NA`. +Certains gènes de souris n'ont pas d'homologues humains. Ceux-ci peuvent être récupérés en utilisant +`filter()` et la fonction `is.na()`, qui détermine si +quelque chose est un `NA`. ```{r, purl=TRUE} -filter(genes, is.na(hsapiens_homolog_associated_gene_name)) +filtre (gènes, is.na (hsapiens_homolog_associated_gene_name)) ``` -If we want to keep only mouse genes that have a human homolog, we can -insert a "!" symbol that negates the result, so we're asking for -every row where hsapiens\_homolog\_associated\_gene\_name _is not_ an +Si on veut conserver uniquement les gènes de souris qui ont un homologue humain, on peut +insérer un "!" symbole qui annule le résultat, nous demandons donc +chaque ligne où hsapiens\_homolog\_associated\_gene\_name _n'est pas_ un `NA`. ```{r, purl=TRUE} -filter(genes, !is.na(hsapiens_homolog_associated_gene_name)) +filtre(gènes, !is.na(hsapiens_homolog_associated_gene_name)) ``` -## Pipes +## Tuyaux -What if you want to select and filter at the same time? There are three -ways to do this: use intermediate steps, nested functions, or pipes. +Et si vous souhaitez sélectionner et filtrer en même temps ? Il existe trois +façons de procéder : utiliser des étapes intermédiaires, des fonctions imbriquées ou des tuyaux. -With intermediate steps, you create a temporary data frame and use -that as input to the next function, like this: +Avec des étapes intermédiaires, vous créez un bloc de données temporaire et l'utilisez +comme entrée de la fonction suivante, comme ceci : ```{r, purl=TRUE} rna2 <- filter(rna, sex == "Male") -rna3 <- select(rna2, gene, sample, tissue, expression) +rna3 <- select(rna2, gène, échantillon, tissu, expression) rna3 ``` -This is readable, but can clutter up your workspace with lots of -intermediate objects that you have to name individually. With multiple -steps, that can be hard to keep track of. +Ceci est lisible, mais peut encombrer votre espace de travail avec de nombreux +objets intermédiaires que vous devez nommer individuellement. Avec plusieurs étapes +, cela peut être difficile à suivre. -You can also nest functions (i.e. one function inside of another), -like this: +Vous pouvez également imbriquer des fonctions (c'est-à-dire une fonction dans une autre), +comme ceci : ```{r, purl=TRUE} -rna3 <- select(filter(rna, sex == "Male"), gene, sample, tissue, expression) +rna3 <- select(filter(rna, sex == "Male"), gène, échantillon, tissu, expression) rna3 ``` -This is handy, but can be difficult to read if too many functions are nested, as -R evaluates the expression from the inside out (in this case, filtering, then selecting). +C'est pratique, mais peut être difficile à lire si trop de fonctions sont imbriquées, car +R évalue l'expression de l'intérieur vers l'extérieur (dans ce cas, filtrer, puis sélectionner). -The last option, _pipes_, are a recent addition to R. Pipes let you take -the output of one function and send it directly to the next, which is useful -when you need to do many things to the same dataset. +La dernière option, _pipes_, est un ajout récent à R. Pipes vous permet de prendre +la sortie d'une fonction et de l'envoyer directement à la suivante, ce qui est utile +lorsque vous devez faire beaucoup de choses dans le même ensemble de données. -Pipes in R look like `%>%` (made available via the **`magrittr`** -package) or `|>` (through base R). If you use RStudio, you can type +Les tuyaux dans R ressemblent à `%>%` (mis à disposition via le package **`magrittr`** +) ou `|>` (via la base R). If you use RStudio, you can type the pipe with <kbd>Ctrl</kbd> + <kbd>Shift</kbd> + <kbd>M</kbd> if you have a PC or <kbd>Cmd</kbd> + <kbd>Shift</kbd> + <kbd>M</kbd> if you have a Mac. -In the above code, we use the pipe to send the `rna` dataset first -through `filter()` to keep rows where `sex` is Male, then through -`select()` to keep only the `gene`, `sample`, `tissue`, and -`expression`columns. +Dans le code ci-dessus, nous utilisons le tube pour envoyer l'ensemble de données `rna` d'abord +via `filter()` pour conserver les lignes où `sex` est Homme, puis via +`select()` pour conserver uniquement les colonnes `gène`, `échantillon`, `tissu` et +`expression`. -The pipe `%>%` takes the object on its left and passes it directly as -the first argument to the function on its right, we don't need to -explicitly include the data frame as an argument to the `filter()` and -`select()` functions any more. +Le tube `%>%` prend l'objet à sa gauche et le passe directement comme +le premier argument de la fonction à sa droite, nous n'avons pas besoin de +inclure explicitement le bloc de données comme un argument pour les fonctions `filter()` et +`select()`. ```{r, purl=TRUE} -rna %>% +arn %>% filter(sex == "Male") %>% - select(gene, sample, tissue, expression) + select(gène, échantillon, tissu, expression) ``` -Some may find it helpful to read the pipe like the word "then". For instance, -in the above example, we took the data frame `rna`, _then_ we `filter`ed -for rows with `sex == "Male"`, _then_ we `select`ed columns `gene`, `sample`, -`tissue`, and `expression`. +Certains trouveront peut-être utile de lire le tube comme le mot « alors ». Par exemple, +dans l'exemple ci-dessus, nous avons pris la trame de données `rna`, _puis_ nous avons `filtré` +pour les lignes avec `sex == "Male"`, _puis_ nous avons « sélectionné les colonnes « gène », « échantillon », +« tissu » et « expression ». -The **`dplyr`** functions by themselves are somewhat simple, but by -combining them into linear workflows with the pipe, we can accomplish -more complex manipulations of data frames. +Les fonctions **`dplyr`** en elles-mêmes sont quelque peu simples, mais en +les combinant dans des flux de travail linéaires avec le tube, nous pouvons accomplir +des manipulations plus complexes de trames de données. -If we want to create a new object with this smaller version of the data, we -can assign it a new name: +Si nous voulons créer un nouvel objet avec cette version plus petite des données, nous pouvons +lui attribuer un nouveau nom : ```{r, purl=TRUE} rna3 <- rna %>% filter(sex == "Male") %>% - select(gene, sample, tissue, expression) + select(gène, échantillon, tissu, expression) rna3 ``` -::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::: défi -## Challenge: +## Défi: -Using pipes, subset the `rna` data to keep observations in female mice at time 0, -where the gene has an expression higher than 50000, and retain only the columns -`gene`, `sample`, `time`, `expression` and `age`. +À l'aide de tuyaux, sous-ensemblez les données `rna` pour conserver les observations chez les souris femelles au temps 0, +où le gène a une expression supérieure à 50 000, et ne conservez que les colonnes +`gene`, `sample `, `time`, `expression` et `age`. -::::::::::::::: solution +::::::::::::::: solution ## Solution ```{r} -rna %>% - filter(expression > 50000, - sex == "Female", - time == 0 ) %>% - select(gene, sample, time, expression, age) +arn %>% + filtre(expression > 50000, + sexe == "Femme", + temps == 0 ) %>% + select(gène, échantillon , heure, expression, âge) ``` ::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::: -## Mutate +## Subir une mutation -Frequently you'll want to create new columns based on the values of existing -columns, for example to do unit conversions, or to find the ratio of values in two -columns. For this we'll use `mutate()`. +Vous souhaiterez fréquemment créer de nouvelles colonnes basées sur les valeurs des colonnes +existantes, par exemple pour effectuer des conversions d'unités ou pour trouver le rapport des valeurs dans deux colonnes +. Pour cela, nous utiliserons `mutate()`. -To create a new column of time in hours: +Pour créer une nouvelle colonne de temps en heures : ```{r, purl=TRUE} rna %>% - mutate(time_hours = time * 24) %>% + muter(time_hours = time * 24) %>% select(time, time_hours) ``` -You can also create a second new column based on the first new column within the same call of `mutate()`: +Vous pouvez également créer une deuxième nouvelle colonne basée sur la première nouvelle colonne dans le même appel de `mutate()` : ```{r, purl=TRUE} rna %>% - mutate(time_hours = time * 24, + muter(time_hours = time * 24, time_mn = time_hours * 60) %>% select(time, time_hours, time_mn) ``` -::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::: défi -## Challenge +## Défi -Create a new data frame from the `rna` data that meets the following -criteria: contains only the `gene`, `chromosome_name`, -`phenotype_description`, `sample`, and `expression` columns. The expression -values should be log-transformed. This data frame must -only contain genes located on sex chromosomes, associated with a -phenotype\_description, and with a log expression higher than 5. +Créez un nouveau bloc de données à partir des données `rna` qui répond aux critères +suivants : contient uniquement le `gène`, le `nom_chromosome`, +`phenotype_description`, `sample` et `expression` Colonnes. Les valeurs de l'expression +doivent être transformées en log. Cette trame de données doit +contenir uniquement des gènes situés sur les chromosomes sexuels, associés à un phénotype +\_description, et avec une expression log supérieure à 5. -**Hint**: think about how the commands should be ordered to produce -this data frame! +**Astuce** : réfléchissez à la façon dont les commandes doivent être ordonnées pour produire +ce bloc de données ! -::::::::::::::: solution +::::::::::::::: solution ## Solution ```{r, eval=TRUE, purl=TRUE} -rna %>% +arn %>% mutate(expression = log(expression)) %>% - select(gene, chromosome_name, phenotype_description, sample, expression) %>% - filter(chromosome_name == "X" | chromosome_name == "Y") %>% - filter(!is.na(phenotype_description)) %>% - filter(expression > 5) + select(gène, nom_chromosome, description_phénotype, échantillon, expression) %>% + filtre(nom_chromosome = = "X" | nom_chromo == "Y") %>% + filtre(!is.na(phenotype_description)) %>% + filtre(expression > 5) ``` ::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::: -## Split-apply-combine data analysis +## Analyse de données fractionnée-appliquée-combinée -Many data analysis tasks can be approached using the -_split-apply-combine_ paradigm: split the data into groups, apply some -analysis to each group, and then combine the results. **`dplyr`** -makes this very easy through the use of the `group_by()` function. +De nombreuses tâches d'analyse de données peuvent être abordées à l'aide du paradigme +_split-apply-combine_ : divisez les données en groupes, appliquez une analyse +à chaque groupe, puis combinez les résultats. **`dplyr`** +rend cela très facile grâce à l'utilisation de la fonction `group_by()`. ```{r} -rna %>% - group_by(gene) +arn %>% + group_by(gène) ``` -The `group_by()` function doesn't perform any data processing, it -groups the data into subsets: in the example above, our initial -`tibble` of `r nrow(rna)` observations is split into -`r length(unique(rna$gene))` groups based on the `gene` variable. +La fonction `group_by()` n'effectue aucun traitement de données, elle +regroupe les données en sous-ensembles : dans l'exemple ci-dessus, notre +`tibble` initial de `r nrow(rna)` les observations sont divisées en +`r length(unique(rna$gene))` en fonction de la variable `gene`. -We could similarly decide to group the tibble by the samples: +On pourrait de même décider de regrouper les tibbles par échantillons : ```{r} -rna %>% - group_by(sample) +arn %>% + group_by(échantillon) ``` -Here our initial `tibble` of `r nrow(rna)` observations is split into -`r length(unique(rna$sample))` groups based on the `sample` variable. +Ici, notre `tibble` initial d'observations `r nrow(rna)` est divisé en groupes +`r length(unique(rna$sample))` en fonction de la variable `sample`. -Once the data has been grouped, subsequent operations will be -applied on each group independently. +Une fois les données regroupées, les opérations suivantes seront +appliquées sur chaque groupe indépendamment. -### The `summarise()` function +### La fonction `summaris()` -`group_by()` is often used together with `summarise()`, which -collapses each group into a single-row summary of that group. +`group_by()` est souvent utilisé avec `summarise()`, qui +réduit chaque groupe en un résumé sur une seule ligne de ce groupe. -`group_by()` takes as arguments the column names that contain the -**categorical** variables for which you want to calculate the summary -statistics. So to compute the mean `expression` by gene: +`group_by()` prend comme arguments les noms de colonnes qui contiennent les variables +**catégorielles** pour lesquelles vous souhaitez calculer les statistiques récapitulatives +. Donc, pour calculer l'expression moyenne par gène : ```{r} -rna %>% +arn %>% group_by(gene) %>% - summarise(mean_expression = mean(expression)) + résumé(mean_expression = moyenne(expression)) ``` -We could also want to calculate the mean expression levels of all genes in each sample: +Nous pourrions également vouloir calculer les niveaux d’expression moyens de tous les gènes dans chaque échantillon : ```{r} rna %>% group_by(sample) %>% - summarise(mean_expression = mean(expression)) + résumé(mean_expression = moyenne(expression)) ``` -But we can can also group by multiple columns: +Mais on peut aussi regrouper par plusieurs colonnes : ```{r} -rna %>% - group_by(gene, infection, time) %>% - summarise(mean_expression = mean(expression)) +arn %>% + group_by(gène, infection, temps) %>% + résumé(mean_expression = moyenne(expression)) ``` -Once the data is grouped, you can also summarise multiple variables at the same -time (and not necessarily on the same variable). For instance, we could add a -column indicating the median `expression` by gene and by condition: +Une fois les données regroupées, vous pouvez également résumer plusieurs variables en même temps +(et pas nécessairement sur la même variable). Par exemple, nous pourrions ajouter une colonne +indiquant l'expression médiane par gène et par condition : ```{r, purl=TRUE} -rna %>% - group_by(gene, infection, time) %>% - summarise(mean_expression = mean(expression), - median_expression = median(expression)) +arn %>% + group_by(gène, infection, temps) %>% + résumé(mean_expression = moyenne(expression), + médiane_expression = médiane(expression)) ``` -::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::: défi -## Challenge +## Défi -Calculate the mean expression level of gene "Dok3" by timepoints. +Calculer le niveau d’expression moyen du gène « Dok3 » par points temporels. -::::::::::::::: solution +::::::::::::::: solution ## Solution @@ -416,87 +416,87 @@ Calculate the mean expression level of gene "Dok3" by timepoints. rna %>% filter(gene == "Dok3") %>% group_by(time) %>% - summarise(mean = mean(expression)) + summarise(mean = moyenne(expression)) ``` ::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::: -### Counting +### Compte -When working with data, we often want to know the number of observations found -for each factor or combination of factors. For this task, **`dplyr`** provides -`count()`. For example, if we wanted to count the number of rows of data for -each infected and non-infected samples, we would do: +Lorsque nous travaillons avec des données, nous souhaitons souvent connaître le nombre d'observations trouvées +pour chaque facteur ou combinaison de facteurs. Pour cette tâche, **`dplyr`** fournit +`count()`. Par exemple, si nous voulions compter le nombre de lignes de données pour +chaque échantillon infecté et non infecté, nous ferions : ```{r, purl=TRUE} -rna %>% - count(infection) +arn %>% + nombre (infection) ``` -The `count()` function is shorthand for something we've already seen: grouping by a variable, and summarising it by counting the number of observations in that group. In other words, `rna %>% count(infection)` is equivalent to: +La fonction `count()` est un raccourci pour quelque chose que nous avons déjà vu : regrouper par une variable et le résumer en comptant le nombre d'observations dans ce groupe. En d'autres termes, « rna %>% count(infection) » équivaut à : ```{r, purl=TRUE} -rna %>% +arn %>% group_by(infection) %>% - summarise(n = n()) + résumé(n = n()) ``` -The previous example shows the use of `count()` to count the number of rows/observations -for _one_ factor (i.e., `infection`). -If we wanted to count a _combination of factors_, such as `infection` and `time`, -we would specify the first and the second factor as the arguments of `count()`: +L'exemple précédent montre l'utilisation de `count()` pour compter le nombre de lignes/observations +pour _un_ facteur (c'est-à-dire `infection`). +Si nous voulions compter une _combinaison de facteurs_, telle que `infection` et `time`, +nous spécifierions le premier et le deuxième facteur comme arguments de `count()` : ```{r, purl=TRUE} -rna %>% - count(infection, time) +arn %>% + nombre (infection, temps) ``` -which is equivalent to this: +ce qui équivaut à ceci : ```{r, purl=TRUE} rna %>% group_by(infection, time) %>% - summarise(n = n()) + résumé(n = n()) ``` -It is sometimes useful to sort the result to facilitate the comparisons. -We can use `arrange()` to sort the table. -For instance, we might want to arrange the table above by time: +Il est parfois utile de trier le résultat pour faciliter les comparaisons. +Nous pouvons utiliser `arrange()` pour trier le tableau. +Par exemple, nous pourrions vouloir organiser le tableau ci-dessus par heure : ```{r, purl=TRUE} -rna %>% - count(infection, time) %>% - arrange(time) +arn %>% + compter (infection, temps) %>% + organiser (temps) ``` -or by counts: +ou par comptages : ```{r, purl=TRUE} -rna %>% +arn %>% count(infection, time) %>% - arrange(n) + arranger(n) ``` -To sort in descending order, we need to add the `desc()` function: +Pour trier par ordre décroissant, nous devons ajouter la fonction `desc()` : ```{r, purl=TRUE} -rna %>% +arn %>% count(infection, time) %>% arrange(desc(n)) ``` -::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::: défi -## Challenge +## Défi -1. How many genes were analysed in each sample? -2. Use `group_by()` and `summarise()` to evaluate the sequencing depth (the sum of all counts) in each sample. Which sample has the highest sequencing depth? -3. Pick one sample and evaluate the number of genes by biotype. -4. Identify genes associated with the "abnormal DNA methylation" phenotype description, and calculate their mean expression (in log) at time 0, time 4 and time 8. +1. Combien de gènes ont été analysés dans chaque échantillon ? +2. Utilisez `group_by()` et `summarise()` pour évaluer la profondeur de séquençage (la somme de tous les comptes) dans chaque échantillon. Quel échantillon a la profondeur de séquençage la plus élevée ? +3. Choisissez un échantillon et évaluez le nombre de gènes par biotype. +4. Identifiez les gènes associés à la description du phénotype « méthylation anormale de l’ADN » et calculez leur expression moyenne (en log) au temps 0, au temps 4 et au temps 8. -::::::::::::::: solution +::::::::::::::: solution ## Solution @@ -524,78 +524,78 @@ rna %>% ::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::: -## Reshaping data +## Remodeler les données -In the `rna` tibble, the rows contain expression values (the unit) that are -associated with a combination of 2 other variables: `gene` and `sample`. +Dans le tibble `rna`, les lignes contiennent des valeurs d'expression (l'unité) qui sont +associées à une combinaison de 2 autres variables : `gene` et `sample`. -All the other columns correspond to variables describing either -the sample (organism, age, sex, ...) or the gene (gene\_biotype, ENTREZ\_ID, product, ...). -The variables that don't change with genes or with samples will have the same value in all the rows. +Toutes les autres colonnes correspondent à des variables décrivant soit +l'échantillon (organisme, âge, sexe, ...) soit le gène (gène\_biotype, ENTREZ\_ID, produit, ...). +Les variables qui ne changent pas avec les gènes ou avec les échantillons auront la même valeur dans toutes les lignes. ```{r} -rna %>% - arrange(gene) +arn %>% + arranger(gène) ``` -This structure is called a `long-format`, as one column contains all the values, -and other column(s) list(s) the context of the value. +Cette structure est appelée « format long », car une colonne contient toutes les valeurs, +et d'autres colonnes répertorient le contexte de la valeur. -In certain cases, the `long-format` is not really "human-readable", and another format, -a `wide-format` is preferred, as a more compact way of representing the data. +Dans certains cas, le « format long » n'est pas vraiment « lisible par l'homme », et un autre format, +un « format large » est préféré, comme manière plus compacte de représenter les données. This is typically the case with gene expression values that scientists are used to look as matrices, were rows represent genes and columns represent samples. -In this format, it would therefore become straightforward -to explore the relationship between the gene expression levels within, and -between, the samples. +Dans ce format, il deviendrait donc simple +d'explorer la relation entre les niveaux d'expression génique au sein et +entre les échantillons. ```{r, echo=FALSE} rna %>% - select(gene, sample, expression) %>% - pivot_wider(names_from = sample, - values_from = expression) + select(gène, échantillon, expression) %>% + pivot_wider(names_from = échantillon, + valeurs_from = expression) ``` -To convert the gene expression values from `rna` into a wide-format, -we need to create a new table where the values of the `sample` column would -become the names of column variables. +Pour convertir les valeurs d'expression génique de `rna` en un format large, +nous devons créer une nouvelle table où les valeurs de la colonne `sample` deviendraient +les noms des variables de colonne. The key point here is that we are still following a tidy data structure, but we have **reshaped** the data according to the observations of interest: expression levels per gene instead of recording them per gene and per sample. -The opposite transformation would be to transform column names into -values of a new variable. +La transformation inverse serait de transformer les noms de colonnes en valeurs +d'une nouvelle variable. -We can do both these of transformations with two `tidyr` functions, -`pivot_longer()` and `pivot_wider()` (see -[here](https://tidyr.tidyverse.org/dev/articles/pivot.html) for -details). +Nous pouvons effectuer ces deux transformations avec deux fonctions `tidyr`, +`pivot_longer()` et `pivot_wider()` (voir +[ici](https://tidyr.tidyverse.org /dev/articles/pivot.html) pour +détails). -### Pivoting the data into a wider format +### Pivoter les données dans un format plus large -Let's select the first 3 columns of `rna` and use `pivot_wider()` -to transform the data into a wide-format. +Sélectionnons les 3 premières colonnes de `rna` et utilisons `pivot_wider()` +pour transformer les données en grand format. ```{r, purl=TRUE} rna_exp <- rna %>% - select(gene, sample, expression) + select(gène, échantillon, expression) rna_exp ``` -`pivot_wider` takes three main arguments: +`pivot_wider` prend trois arguments principaux : -1. the data to be transformed; -2. the `names_from` : the column whose values will become new column - names; -3. the `values_from`: the column whose values will fill the new - columns. +1. les données à transformer ; +2. le `names_from` : la colonne dont les valeurs deviendront de nouveaux noms de colonne + ; +3. les `values_from` : la colonne dont les valeurs rempliront les nouvelles colonnes + . -\`\`\`{r, fig.cap="Wide pivot of the `rna` data.", echo=FALSE, message=FALSE} +\`\`\`{r, fig.cap="Grand pivot des données `rna`.", echo=FALSE, message=FALSE} knitr::include_graphics("fig/pivot_wider.png") ```` @@ -607,10 +607,10 @@ rna_wide <- rna_exp %>% rna_wide ```` -Note that by default, the `pivot_wider()` function will add `NA` for missing values. +Notez que par défaut, la fonction `pivot_wider()` ajoutera `NA` pour les valeurs manquantes. -Let's imagine that for some reason, we had some missing expression values for some -genes in certain samples. In the following fictive example, the gene Cyp2d22 has only +Imaginons que, pour une raison quelconque, nous ayons des valeurs d'expression manquantes pour certains gènes +dans certains échantillons. In the following fictive example, the gene Cyp2d22 has only one expression value, in GSM2545338 sample. ```{r, purl=TRUE} @@ -623,39 +623,39 @@ rna_with_missing_values <- rna %>% rna_with_missing_values ``` -By default, the `pivot_wider()` function will add `NA` for missing -values. This can be parameterised with the `values_fill` argument of -the `pivot_wider()` function. +Par défaut, la fonction `pivot_wider()` ajoutera `NA` pour les valeurs +manquantes. Ceci peut être paramétré avec l'argument `values_fill` de +la fonction `pivot_wider()`. ```{r, purl=TRUE} rna_with_missing_values %>% - pivot_wider(names_from = sample, + pivot_wider(names_from = échantillon, values_from = expression) rna_with_missing_values %>% - pivot_wider(names_from = sample, + pivot_wider(names_from = échantillon, values_from = expression, - values_fill = 0) + valeurs_fill = 0) ``` -### Pivoting data into a longer format +### Pivoter les données dans un format plus long -In the opposite situation we are using the column names and turning them into -a pair of new variables. One variable represents the column names as -values, and the other variable contains the values previously -associated with the column names. +Dans la situation inverse, nous utilisons les noms de colonnes et les transformons en +une paire de nouvelles variables. Une variable représente les noms de colonnes sous forme de valeurs +, et l'autre variable contient les valeurs précédemment +associées aux noms de colonnes. -`pivot_longer()` takes four main arguments: +`pivot_longer()` prend quatre arguments principaux : -1. the data to be transformed; -2. the `names_to`: the new column name we wish to create and populate with the - current column names; -3. the `values_to`: the new column name we wish to create and populate with - current values; -4. the names of the columns to be used to populate the `names_to` and - `values_to` variables (or to drop). +1. les données à transformer ; +2. le `names_to` : le nouveau nom de colonne que nous souhaitons créer et remplir avec les + noms de colonnes actuels ; +3. les `values_to` : le nouveau nom de colonne que nous souhaitons créer et remplir avec + valeurs actuelles ; +4. les noms des colonnes à utiliser pour renseigner les variables `names_to` et + `values_to` (ou à supprimer). -\`\`\`{r, fig.cap="Long pivot of the `rna` data.", echo=FALSE, message=FALSE} +\`\`\`{r, fig.cap="Pivot long des données `rna`.", echo=FALSE, message=FALSE} knitr::include_graphics("fig/pivot_longer.png") ```` @@ -675,28 +675,28 @@ rna_long <- rna_wide %>% rna_long ```` -We could also have used a specification for what columns to -include. This can be useful if you have a large number of identifying -columns, and it's easier to specify what to gather than what to leave -alone. Here the `starts_with()` function can help to retrieve sample -names without having to list them all! -Another possibility would be to use the `:` operator! +Nous aurions également pu utiliser une spécification indiquant les colonnes à +inclure. Cela peut être utile si vous disposez d'un grand nombre de colonnes d'identification +, et il est plus facile de spécifier ce qu'il faut rassembler que ce qu'il faut laisser +seul. Ici, la fonction `starts_with()` peut aider à récupérer des exemples de noms +sans avoir à tous les lister ! +Une autre possibilité serait d'utiliser l'opérateur `:` ! ```{r} rna_wide %>% pivot_longer(names_to = "sample", values_to = "expression", - cols = starts_with("GSM")) -rna_wide %>% - pivot_longer(names_to = "sample", - values_to = "expression", + cols = start_with("GSM")) +rna_wide %> % + pivot_longer(names_to = "échantillon", + valeurs_to = "expression", GSM2545336:GSM2545380) ``` -Note that if we had missing values in the wide-format, the `NA` would be -included in the new long format. +Notez que si nous avions des valeurs manquantes dans le format large, le « NA » serait +inclus dans le nouveau format long. -Remember our previous fictive tibble containing missing values: +Souvenez-vous de notre précédent tibble fictif contenant des valeurs manquantes : ```{r} rna_with_missing_values @@ -712,113 +712,113 @@ wide_with_NA %>% -gene) ``` -Pivoting to wider and longer formats can be a useful way to balance out a dataset -so every replicate has the same composition. +Passer à des formats plus larges et plus longs peut être un moyen utile d'équilibrer un ensemble de données +afin que chaque réplique ait la même composition. -::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::: défi ## Question -Starting from the rna table, use the `pivot_wider()` function to create -a wide-format table giving the gene expression levels in each mouse. -Then use the `pivot_longer()` function to restore a long-format table. +A partir de la table arn, utilisez la fonction `pivot_wider()` pour créer +un tableau grand format donnant les niveaux d'expression génique chez chaque souris. +Utilisez ensuite la fonction `pivot_longer()` pour restaurer un tableau au format long. -::::::::::::::: solution +::::::::::::::: solution ## Solution ```{r, answer=TRUE, purl=TRUE} rna1 <- rna %>% -select(gene, mouse, expression) %>% -pivot_wider(names_from = mouse, values_from = expression) +select(gène, souris, expression) %>% +pivot_wider(names_from = souris, valeurs_from = expression) rna1 rna1 %>% -pivot_longer(names_to = "mouse_id", values_to = "counts", -gene) +pivot_longer(names_to = "mouse_id", valeurs_to = "counts", -gene) ``` ::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::: -::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::: défi ## Question -Subset genes located on X and Y chromosomes from the `rna` data frame and -spread the data frame with `sex` as columns, `chromosome_name` as -rows, and the mean expression of genes located in each chromosome as the values, -as in the following tibble: +Sous-ensemble de gènes situés sur les chromosomes X et Y de la trame de données « rna » et +répartissent la trame de données avec « sexe » en colonnes, « nom_chromosome » en +lignes et l'expression moyenne des gènes localisés dans chaque chromosome comme valeurs, +comme dans le tableau suivant : ```{r, echo=FALSE, message=FALSE} knitr::include_graphics("fig/Exercise_pivot_W.png") ``` -You will need to summarise before reshaping! +Il faudra résumer avant de remodeler ! -::::::::::::::: solution +::::::::::::::: solution ## Solution -Let's first calculate the mean expression level of X and Y linked genes from -male and female samples... +Calculons d'abord le niveau d'expression moyen des gènes liés X et Y à partir de +échantillons mâles et femelles... ```{r} - rna %>% + arn %>% filter(chromosome_name == "Y" | chromosome_name == "X") %>% group_by(sex, chromosome_name) %>% - summarise(mean = mean(expression)) + résumé(moyenne = moyenne(expression)) ``` -And pivot the table to wide format +Et faites pivoter le tableau au format large ```{r, answer=TRUE, purl=TRUE} rna_1 <- rna %>% filter(chromosome_name == "Y" | chromosome_name == "X") %>% group_by(sex, chromosome_name) %>% - summarise(mean = mean(expression)) %>% - pivot_wider(names_from = sex, - values_from = mean) + summarise(mean = moyenne(expression)) %>% + pivot_wider(names_from = sexe, + valeurs_from = moyenne) rna_1 ``` -Now take that data frame and transform it with `pivot_longer()` so -each row is a unique `chromosome_name` by `gender` combination. +Maintenant, prenez cette trame de données et transformez-la avec `pivot_longer()` afin que +chaque ligne soit une combinaison unique de `chromosome_name` par `gender`. ```{r, answer=TRUE, purl=TRUE} rna_1 %>% pivot_longer(names_to = "gender", - values_to = "mean", + valeurs_to = "mean", -chromosome_name) ``` ::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::: -::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::: défi ## Question -Use the `rna` dataset to create an expression matrix where each row -represents the mean expression levels of genes and columns represent -the different timepoints. +Utilisez l'ensemble de données `rna` pour créer une matrice d'expression où chaque ligne +représente les niveaux d'expression moyens des gènes et les colonnes représentent +les différents moments. -::::::::::::::: solution +::::::::::::::: solution ## Solution -Let's first calculate the mean expression by gene and by time +Calculons d'abord l'expression moyenne par gène et par temps ```{r} -rna %>% - group_by(gene, time) %>% - summarise(mean_exp = mean(expression)) +arn %>% + group_by(gène, temps) %>% + résumé(mean_exp = moyenne(expression)) ``` -before using the pivot\_wider() function +avant d'utiliser la fonction pivot\_wider() ```{r} rna_time <- rna %>% @@ -829,9 +829,9 @@ rna_time <- rna %>% rna_time ``` -Notice that this generates a tibble with some column names starting by a number. -If we wanted to select the column corresponding to the timepoints, -we could not use the column names directly... What happens when we select the column 4? +Notez que cela génère un tibble avec certains noms de colonnes commençant par un nombre. +Si nous voulions sélectionner la colonne correspondant aux points temporels, +nous ne pourrions pas utiliser directement les noms de colonnes... Que se passe-t-il lorsque l'on sélectionne la colonne 4 ? ```{r} rna %>% @@ -842,7 +842,7 @@ rna %>% select(gene, 4) ``` -To select the timepoint 4, we would have to quote the column name, with backticks "\\`" +Pour sélectionner le timepoint 4, il faudrait citer le nom de la colonne, avec des backticks "\\`" ```{r} rna %>% @@ -853,8 +853,8 @@ rna %>% select(gene, `4`) ``` -Another possibility would be to rename the column, -choosing a name that doesn't start by a number : +Une autre possibilité serait de renommer la colonne, +en choisissant un nom qui ne commence pas par un chiffre : ```{r} rna %>% @@ -868,35 +868,35 @@ rna %>% ::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::: -::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::: défi ## Question -Use the previous data frame containing mean expression levels per timepoint and create -a new column containing fold-changes between timepoint 8 and timepoint 0, and fold-changes -between timepoint 8 and timepoint 4. -Convert this table into a long-format table gathering the fold-changes calculated. +Utilisez la trame de données précédente contenant les niveaux d'expression moyens par point temporel et créez +une nouvelle colonne contenant les changements de pli entre le point temporel 8 et le point temporel 0, et les changements de pli +entre le point temporel 8 et le point temporel 4. +Convertissez ce tableau en un tableau au format long regroupant les changements de pli calculés. -::::::::::::::: solution +::::::::::::::: solution ## Solution -Starting from the rna\_time tibble: +À partir du tibble rna\_time : ```{r} -rna_time +arn_time ``` -Calculate fold-changes: +Calculer les changements de plis : ```{r} rna_time %>% - mutate(time_8_vs_0 = `8` / `0`, time_8_vs_4 = `8` / `4`) + muter (time_8_vs_0 = `8` / `0`, time_8_vs_4 = `8` / `4`) ``` -And use the pivot\_longer() function: +Et utilisez la fonction pivot\_longer() : ```{r} rna_time %>% @@ -908,43 +908,43 @@ rna_time %>% ::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::: -## Joining tables +## Joindre des tables -In many real life situations, data are spread across multiple tables. -Usually this occurs because different types of information are -collected from different sources. +Dans de nombreuses situations réelles, les données sont réparties sur plusieurs tables. +Cela se produit généralement parce que différents types d’informations sont +collectés à partir de différentes sources. -It may be desirable for some analyses to combine data from two or more -tables into a single data frame based on a column that would be common -to all the tables. +Il peut être souhaitable que certaines analyses combinent les données de deux ou plusieurs tables +en une seule trame de données basée sur une colonne qui serait commune +à toutes les tables. -The `dplyr` package provides a set of join functions for combining two -data frames based on matches within specified columns. Here, we -provide a short introduction to joins. For further reading, please -refer to the chapter about table -joins. The +Le package `dplyr` fournit un ensemble de fonctions de jointure pour combiner deux trames de données +basées sur des correspondances dans des colonnes spécifiées. Ici, nous +fournissons une brève introduction aux jointures. Pour en savoir plus, veuillez +vous référer au chapitre sur les jointures de table +. La Data Transformation Cheat Sheet -also provides a short overview on table joins. +fournit également un bref aperçu sur les jointures de table. -We are going to illustrate join using a small table, `rna_mini` that -we will create by subsetting the original `rna` table, keeping only 3 -columns and 10 lines. +Nous allons illustrer la jointure en utilisant une petite table, `rna_mini` que +nous allons créer en sous-définissant la table `rna` d'origine, en ne gardant que 3 +colonnes et 10 lignes. ```{r} rna_mini <- rna %>% - select(gene, sample, expression) %>% + select(gène, échantillon, expression) %>% head(10) rna_mini ``` -The second table, `annot1`, contains 2 columns, gene and -gene\_description. You can either -[download annot1.csv](https://carpentries-incubator.github.io/bioc-intro/data/annot1.csv) -by clicking on the link and then moving it to the `data/` folder, or -you can use the R code below to download it directly to the folder. +Le deuxième tableau, `annot1`, contient 2 colonnes, gene et +gene\_description. Vous pouvez soit +[télécharger annot1.csv](https://carpentries-incubator.github.io/bioc-intro/data/annot1.csv) +en cliquant sur le lien puis en vous déplaçant dans le dossier `data/`, ou +vous pouvez utiliser le code R ci-dessous pour le télécharger directement dans le dossier. ```{r, message=FALSE} download.file(url = "https://carpentries-incubator.github.io/bioc-intro/data/annot1.csv", @@ -953,23 +953,23 @@ annot1 <- read_csv(file = "data/annot1.csv") annot1 ``` -We now want to join these two tables into a single one containing all -variables using the `full_join()` function from the `dplyr` package. The -function will automatically find the common variable to match columns -from the first and second table. In this case, `gene` is the common -variable. Such variables are called keys. Keys are used to match -observations across different tables. +Nous voulons maintenant joindre ces deux tables en une seule contenant toutes les +variables en utilisant la fonction `full_join()` du package `dplyr`. La fonction +trouvera automatiquement la variable commune correspondant aux colonnes +de la première et de la deuxième table. Dans ce cas, « gène » est la variable commune +. Ces variables sont appelées clés. Les clés sont utilisées pour faire correspondre +observations dans différentes tables. ```{r} full_join(rna_mini, annot1) ``` -In real life, gene annotations are sometimes labelled differently. +Dans la vraie vie, les annotations génétiques sont parfois étiquetées différemment. -The `annot2` table is exactly the same than `annot1` except that the -variable containing gene names is labelled differently. Again, either -[download annot2.csv](https://carpentries-incubator.github.io/bioc-intro/data/annot2.csv) -yourself and move it to `data/` or use the R code below. +La table `annot2` est exactement la même que `annot1` sauf que la variable +contenant les noms de gènes est étiquetée différemment. Encore une fois, soit +[télécharger annot2.csv](https://carpentries-incubator.github.io/bioc-intro/data/annot2.csv) +vous-même et déplacez-le vers `data/ ` ou utilisez le code R ci-dessous. ```{r, message=FALSE} download.file(url = "https://carpentries-incubator.github.io/bioc-intro/data/annot2.csv", @@ -978,28 +978,28 @@ annot2 <- read_csv(file = "data/annot2.csv") annot2 ``` -In case none of the variable names match, we can set manually the -variables to use for the matching. These variables can be set using -the `by` argument, as shown below with `rna_mini` and `annot2` tables. +Si aucun des noms de variables ne correspond, nous pouvons définir manuellement les variables +à utiliser pour la correspondance. Ces variables peuvent être définies en utilisant +l'argument `by`, comme indiqué ci-dessous avec les tables `rna_mini` et `annot2`. ```{r} full_join(rna_mini, annot2, by = c("gene" = "external_gene_name")) ``` -As can be seen above, the variable name of the first table is retained -in the joined one. +Comme on peut le voir ci-dessus, le nom de variable de la première table est conservé +dans celle jointe. -::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::: défi -## Challenge: +## Défi: -Download the `annot3` table by clicking -[here](https://carpentries-incubator.github.io/bioc-intro/data/annot3.csv) -and put the table in your data/ repository. Using the `full_join()` -function, join tables `rna_mini` and `annot3`. What has happened for -genes _Klk6_, _mt-Tf_, _mt-Rnr1_, _mt-Tv_, _mt-Rnr2_, and _mt-Tl1_ ? +Téléchargez la table `annot3` en cliquant sur +[ici](https://carpentries-incubator.github.io/bioc-intro/data/annot3.csv) +et placez la table dans votre Dépôt de données. À l'aide de la fonction `full_join()` +, joignez les tables `rna_mini` et `annot3`. Que s'est-il passé pour les gènes +_Klk6_, _mt-Tf_, _mt-Rnr1_, _mt-Tv_, _mt-Rnr2_ et _mt-Tl1_ ? -::::::::::::::: solution +::::::::::::::: solution ## Solution @@ -1008,40 +1008,40 @@ annot3 <- read_csv("data/annot3.csv") full_join(rna_mini, annot3) ``` -Genes _Klk6_ is only present in `rna_mini`, while genes _mt-Tf_, _mt-Rnr1_, _mt-Tv_, -_mt-Rnr2_, and _mt-Tl1_ are only present in `annot3` table. Their respective values for the -variables of the table have been encoded as missing. +Les gènes _Klk6_ ne sont présents que dans `rna_mini`, tandis que les gènes _mt-Tf_, _mt-Rnr1_, _mt-Tv_, +_mt-Rnr2_ et _mt-Tl1_ sont présent uniquement dans la table `annot3`. Leurs valeurs respectives pour les variables +du tableau ont été codées comme manquantes. ::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::: -## Exporting data +## Exporter des données -Now that you have learned how to use `dplyr` to extract information from -or summarise your raw data, you may want to export these new data sets to share -them with your collaborators or for archival. +Maintenant que vous avez appris à utiliser `dplyr` pour extraire des informations de +ou résumer vos données brutes, vous souhaiterez peut-être exporter ces nouveaux ensembles de données pour les partager +avec vos collaborateurs ou pour les archiver. -Similar to the `read_csv()` function used for reading CSV files into R, there is -a `write_csv()` function that generates CSV files from data frames. +Semblable à la fonction `read_csv()` utilisée pour lire les fichiers CSV dans R, il existe +une fonction `write_csv()` qui génère des fichiers CSV à partir de trames de données. -Before using `write_csv()`, we are going to create a new folder, `data_output`, -in our working directory that will store this generated dataset. We don't want -to write generated datasets in the same directory as our raw data. -It's good practice to keep them separate. The `data` folder should only contain -the raw, unaltered data, and should be left alone to make sure we don't delete -or modify it. In contrast, our script will generate the contents of the `data_output` -directory, so even if the files it contains are deleted, we can always -re-generate them. +Avant d'utiliser `write_csv()`, nous allons créer un nouveau dossier, `data_output`, +dans notre répertoire de travail qui stockera cet ensemble de données généré. Nous ne voulons pas que +écrive les ensembles de données générés dans le même répertoire que nos données brutes. +C'est une bonne pratique de les garder séparés. Le dossier `data` ne doit contenir que +les données brutes et non modifiées, et doit être laissé seul pour nous assurer que nous ne supprimons pas +ou ne le modifions pas. En revanche, notre script générera le contenu du répertoire `data_output` +, donc même si les fichiers qu'il contient sont supprimés, nous pouvons toujours +les régénérer. -Let's use `write_csv()` to save the rna\_wide table that we have created previously. +Utilisons `write_csv()` pour sauvegarder la table rna\_wide que nous avons créée précédemment. ```{r, purl=TRUE, eval=FALSE} write_csv(rna_wide, file = "data_output/rna_wide.csv") ``` -:::::::::::::::::::::::::::::::::::::::: keypoints +:::::::::::::::::::::::::::::::::::::::: points clés -- Tabular data in R using the tidyverse meta-package +- Données tabulaires dans R utilisant le méta-paquet Tidyverse -:::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::: From 50b5990909b6b9c506e645915db02b695df51f6c Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Thu, 1 Aug 2024 00:07:41 +0900 Subject: [PATCH 202/334] New translations 30-dplyr.md (Chinese Simplified) --- locale/zh/episodes/30-dplyr.Rmd | 1014 +++++++++++++++---------------- 1 file changed, 507 insertions(+), 507 deletions(-) diff --git a/locale/zh/episodes/30-dplyr.Rmd b/locale/zh/episodes/30-dplyr.Rmd index b50395a63..6dc64f731 100644 --- a/locale/zh/episodes/30-dplyr.Rmd +++ b/locale/zh/episodes/30-dplyr.Rmd @@ -1,6 +1,6 @@ --- -source: Rmd -title: Manipulating and analysing data with dplyr +source: 放射科 +title: 使用 dplyr 处理和分析数据 teaching: 75 exercises: 75 --- @@ -8,497 +8,497 @@ exercises: 75 ```{r, include=FALSE} ``` -::::::::::::::::::::::::::::::::::::::: objectives +:::::::::::::::::::::::::::::::::::::::::: 目标 -- Describe the purpose of the **`dplyr`** and **`tidyr`** packages. -- Describe several of their functions that are extremely useful to - manipulate data. -- Describe the concept of a wide and a long table format, and see - how to reshape a data frame from one format to the other one. -- Demonstrate how to join tables. +- 描述\*\*`dplyr`**和**`tidyr`\*\*包的用途。 +- 描述一些对于 + 操作数据极其有用的函数。 +- 描述宽表和长表格式的概念,并了解 + 如何将数据框从一种格式重塑为另一种格式。 +- 演示如何连接表格。 -:::::::::::::::::::::::::::::::::::::::::::::::::: +::::::::::::::::::::::::::::::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::: questions +:::::::::::::::::::::::::::::::::::::::: 问题 -- Data analysis in R using the tidyverse meta-package +- 使用 tidyverse 元包在 R 中进行数据分析 -:::::::::::::::::::::::::::::::::::::::::::::::::: +::::::::::::::::::::::::::::::::::::::::::::::::::::: ```{r loaddata_dplyr, echo=FALSE, purl=FALSE, message=FALSE} -if (!file.exists("data/rnaseq.csv")) -download.file(url = "https://github.com/carpentries-incubator/bioc-intro/raw/main/episodes/data/rnaseq.csv", - destfile = "data/rnaseq.csv") +如果(!file.exists(“data/rnaseq.csv”)) +下载.file(url = “https://github.com/carpentries-incubator/bioc-intro/raw/main/episodes/data/rnaseq.csv”, + 目标文件 = “data/rnaseq.csv”) ``` -> This episode is based on the Data Carpentries's _Data Analysis and -> Visualisation in R for Ecologists_ lesson. +> 本集基于 Data Carpentries 的_面向生态学家的 R 语言数据分析和 +> 可视化_课程。 -## Data manipulation using **`dplyr`** and **`tidyr`** +## 使用 **`dplyr`** 和 **`tidyr`** 进行数据操作 Bracket subsetting is handy, but it can be cumbersome and difficult to read, especially for complicated operations. -Some packages can greatly facilitate our task when we manipulate data. -Packages in R are basically sets of additional functions that let you -do more stuff. The functions we've been using so far, like `str()` or -`data.frame()`, come built into R; Loading packages can give you access to other -specific functions. Before you use a package for the first time you need to install -it on your machine, and then you should import it in every subsequent -R session when you need it. +当我们处理数据时,一些包可以极大地方便我们的任务。 +R 中的包基本上是一组附加函数,可让您 +做更多的事情。 我们迄今为止使用的函数,如 `str()` 或 +`data.frame()`,都是 R 内置的;加载包可以让您访问其他 +特定函数。 第一次使用包之前,您需要在您的机器上安装 +它,然后您应该在需要它时在每个后续的 +R 会话中导入它。 -- The package **`dplyr`** provides powerful tools for data manipulation tasks. - It is built to work directly with data frames, with many manipulation tasks - optimised. +- 包\*\*`dplyr`\*\*为数据操作任务提供了强大的工具。 + 它被构建为直接与数据框一起工作,并且许多操作任务 + 已经进行了优化。 -- As we will see latter on, sometimes we want a data frame to be reshaped to be able - to do some specific analyses or for visualisation. The package **`tidyr`** addresses - this common problem of reshaping data and provides tools for manipulating - data in a tidy way. +- 正如我们稍后会看到的,有时我们希望重塑数据框以便能够 + 进行一些特定的分析或进行可视化。 包\*\*`tidyr`\*\*解决了 + 这个常见的数据重塑问题,并提供了以整洁的方式操作 + 数据的工具。 -To learn more about **`dplyr`** and **`tidyr`** after the workshop, -you may want to check out this handy data transformation with +如果要在研讨会结束后了解有关 **`dplyr`** 和 **`tidyr`** 的更多信息, +您可能需要查看这个 使用 -and this one about -. +和这个 关于 +。 -- The **`tidyverse`** package is an "umbrella-package" that installs - several useful packages for data analysis which work well together, - such as **`tidyr`**, **`dplyr`**, **`ggplot2`**, **`tibble`**, etc. - These packages help us to work and interact with the data. - They allow us to do many things with your data, such as subsetting, transforming, - visualising, etc. +- **`tidyverse`** 包是一个“总括包”,它安装了 + 几个用于数据分析的有用包,它们可以很好地协同工作, + 例如 **`tidyr`**、**`dplyr`**、**`ggplot2`**、**`tibble`** 等。 + 这些包帮助我们处理数据并与之交互。 + 它们允许我们对您的数据做很多事情,例如子集化、转换、 + 可视化等。 -If you did the set up, you should have already installed the tidyverse package. -Check to see if you have it by trying to load in from the library: +如果您已完成设置,则应该已经安装了 tidyverse 包。 +尝试从库中加载以检查您是否拥有它: ```{r, message=FALSE, purl=TRUE} -## load the tidyverse packages, incl. dplyr +## 加载 tidyverse 包,包括 dplyr library("tidyverse") ``` -If you got an error message `there is no package called ‘tidyverse’` then you have not -installed the package yet for this version of R. To install the **`tidyverse`** package type: +如果您收到错误消息“没有名为‘tidyverse’的包”,那么您尚未 +为此版本的 R 安装该包。要安装\*\*`tidyverse`\*\*包类型: ```{r, eval=FALSE, purl=TRUE} -BiocManager::install("tidyverse") +BiocManager::install(“tidyverse”) ``` -If you had to install the **`tidyverse`** package, do not forget to load it in this R session by using the `library()` command above! +如果您必须安装\*\*`tidyverse`\*\*包,请不要忘记使用上面的`library()`命令在此 R 会话中加载它! -## Loading data with tidyverse +## 使用 tidyverse 加载数据 -Instead of `read.csv()`, we will read in our data using the `read_csv()` -function (notice the `_` instead of the `.`), from the tidyverse package -**`readr`**. +我们不使用 `read.csv()`,而是使用 `read_csv()` +函数(注意用 `_` 而不是 `.`)读取数据,该函数来自 tidyverse 包 +**`readr`**。 ```{r, message=FALSE, purl=TRUE} rna <- read_csv("data/rnaseq.csv") -## view the data +## 查看数据 rna ``` -Notice that the class of the data is now referred to as a "tibble". +请注意,数据类别现在被称为“tibble”。 -Tibbles tweak some of the behaviors of the data frame objects we introduced in the -previously. The data structure is very similar to a data frame. For our purposes -the only differences are that: +Tibbles 调整了我们之前在 +中介绍的数据框对象的一些行为。 该数据结构与数据框非常相似。 对于我们的目的 +来说,唯一的区别是: -1. It displays the data type of each column under its name. - Note that \<`dbl`\> is a data type defined to hold numeric values with - decimal points. +1. 它在名称下显示每列的数据类型。 + 请注意,\<`dbl`\> 是一种数据类型,定义为保存带有 + 小数点的数值。 -2. It only prints the first few rows of data and only as many columns as fit on - one screen. +2. 它仅打印前几行数据,并且仅打印适合 + 一个屏幕的列数。 -We are now going to learn some of the most common **`dplyr`** functions: +我们现在要学习一些最常见的 **`dplyr`** 函数: -- `select()`: subset columns -- `filter()`: subset rows on conditions -- `mutate()`: create new columns by using information from other columns -- `group_by()` and `summarise()`: create summary statistics on grouped data -- `arrange()`: sort results -- `count()`: count discrete values +- `select()`:子集列 +- `filter()`:根据条件将行设为子集 +- `mutate()`:使用其他列的信息创建新列 +- `group_by()` 和 `summarise()`:对分组数据创建汇总统计数据 +- `arrange()`:对结果进行排序 +- `count()`:计数离散值 -## Selecting columns and filtering rows +## 选择列和过滤行 -To select columns of a data frame, use `select()`. The first argument -to this function is the data frame (`rna`), and the subsequent -arguments are the columns to keep. +要选择数据框的列,请使用“select()”。 该函数的第一个参数 +是数据框(`rna`),后续的 +参数是需要保留的列。 ```{r, purl=TRUE} -select(rna, gene, sample, tissue, expression) +选择(rna、基因、样本、组织、表达) ``` -To select all columns _except_ certain ones, put a "-" in front of -the variable to exclude it. +要选择除某些列之外的所有列,请在变量 +前面放置“-”以将其排除。 ```{r, purl=TRUE} -select(rna, -tissue, -organism) +选择(rna,-组织,-生物体) ``` -This will select all the variables in `rna` except `tissue` -and `organism`. +这将选择“rna”中除“tissue” +和“organism”之外的所有变量。 -To choose rows based on a specific criteria, use `filter()`: +要根据特定标准选择行,请使用“filter()”: ```{r, purl=TRUE} -filter(rna, sex == "Male") -filter(rna, sex == "Male" & infection == "NonInfected") +过滤器(rna,性别 == “男性”) +过滤器(rna,性别 == “男性” & 感染 == “未感染”) ``` -Now let's imagine we are interested in the human homologs of the mouse -genes analysed in this dataset. This information can be found in the -last column of the `rna` tibble, named -`hsapiens_homolog_associated_gene_name`. To visualise it easily, we -will create a new table containing just the 2 columns `gene` and -`hsapiens_homolog_associated_gene_name`. +现在让我们假设我们对该数据集中分析的小鼠 +基因的人类同源物感兴趣。 该信息可以在 `rna` tibble 的 +最后一列中找到,名为 +`hsapiens_homolog_associated_gene_name`。 为了轻松地将其形象化,我们 +将创建一个新表,仅包含 2 列“基因”和 +“hsapiens_homolog_associated_gene_name”。 ```{r} -genes <- select(rna, gene, hsapiens_homolog_associated_gene_name) -genes +基因 <- 选择(rna,基因,hsapiens_homolog_associated_gene_name) +基因 ``` -Some mouse genes have no human homologs. These can be retrieved using -`filter()` and the `is.na()` function, that determines whether -something is an `NA`. +一些小鼠基因没有人类同源物。 可以使用 +`filter()` 和 `is.na()` 函数检索这些,确定 +某物是否为 `NA`。 ```{r, purl=TRUE} -filter(genes, is.na(hsapiens_homolog_associated_gene_name)) +过滤器(基因,is.na(hsapiens_homolog_associated_gene_name)) ``` -If we want to keep only mouse genes that have a human homolog, we can -insert a "!" symbol that negates the result, so we're asking for -every row where hsapiens\_homolog\_associated\_gene\_name _is not_ an -`NA`. +如果我们只想保留具有人类同源物的小鼠基因,我们可以在 +中插入一个“!”符号来否定结果,因此我们要求在 hsapiens\_homolog\_associated\_gene\_name _不是_ +`NA` 的每一行中都为 +。 ```{r, purl=TRUE} -filter(genes, !is.na(hsapiens_homolog_associated_gene_name)) +过滤器(基因,!is.na(hsapiens_homolog_associated_gene_name)) ``` -## Pipes +## 管道 -What if you want to select and filter at the same time? There are three -ways to do this: use intermediate steps, nested functions, or pipes. +如果您想同时选择和过滤怎么办? 有三种 +方法可以做到这一点:使用中间步骤、嵌套函数或管道。 -With intermediate steps, you create a temporary data frame and use -that as input to the next function, like this: +通过中间步骤,您可以创建一个临时数据框并使用 +作为下一个函数的输入,如下所示: ```{r, purl=TRUE} rna2 <- filter(rna, sex == "Male") -rna3 <- select(rna2, gene, sample, tissue, expression) +rna3 <- select(rna2, 基因, 样本, 组织, 表达) rna3 ``` -This is readable, but can clutter up your workspace with lots of -intermediate objects that you have to name individually. With multiple -steps, that can be hard to keep track of. +这是可读的,但会使您的工作区变得混乱,因为有大量的 +中间对象需要您单独命名。 由于有多个 +步骤,因此很难跟踪。 -You can also nest functions (i.e. one function inside of another), -like this: +您还可以嵌套函数(即一个函数位于另一个函数内), +如下所示: ```{r, purl=TRUE} -rna3 <- select(filter(rna, sex == "Male"), gene, sample, tissue, expression) +rna3 <- select(filter(rna, sex == "Male"), 基因, 样本, 组织, 表达) rna3 ``` -This is handy, but can be difficult to read if too many functions are nested, as -R evaluates the expression from the inside out (in this case, filtering, then selecting). +这很方便,但如果嵌套的函数太多,可能会难以阅读,因为 +R 从内到外评估表达式(在本例中,先过滤,然后选择)。 -The last option, _pipes_, are a recent addition to R. Pipes let you take -the output of one function and send it directly to the next, which is useful -when you need to do many things to the same dataset. +最后一个选项 _管道_ 是最近添加到 R 中的。管道让您可以将 +一个函数的输出直接发送到下一个函数,当您需要对同一个数据集执行许多操作时,这很有用 +。 -Pipes in R look like `%>%` (made available via the **`magrittr`** -package) or `|>` (through base R). If you use RStudio, you can type -the pipe with <kbd>Ctrl</kbd> + <kbd>Shift</kbd> + <kbd>M</kbd> if you -have a PC or <kbd>Cmd</kbd> + <kbd>Shift</kbd> + <kbd>M</kbd> if you -have a Mac. +R 中的管道看起来像 `%>%`(通过\*\*`magrittr`\*\* +包提供)或 `|>`(通过基础 R)。 如果您使用 RStudio,则可以在管道中输入 +和 <kbd>Ctrl</kbd> + <kbd>Shift</kbd> + <kbd>M</kbd> (如果您 +有 PC)或 <kbd>Cmd</kbd> + <kbd>Shift</kbd> + <kbd>M</kbd> (如果您 +有 Mac)。 -In the above code, we use the pipe to send the `rna` dataset first -through `filter()` to keep rows where `sex` is Male, then through -`select()` to keep only the `gene`, `sample`, `tissue`, and -`expression`columns. +在上面的代码中,我们使用管道首先将 `rna` 数据集 +通过 `filter()` 发送以保留 `sex` 为男性的行,然后通过 +`select()` 仅保留 `gene`、`sample`、`tissue` 和 +`expression` 列。 -The pipe `%>%` takes the object on its left and passes it directly as -the first argument to the function on its right, we don't need to -explicitly include the data frame as an argument to the `filter()` and -`select()` functions any more. +管道 `%>%` 将其左侧的对象直接作为 +传递给其右侧的函数的第一个参数,我们不再需要 +明确将数据框作为 `filter()` 和 +`select()` 函数的参数。 ```{r, purl=TRUE} rna %>% filter(sex == "Male") %>% - select(gene, sample, tissue, expression) + select(基因,样本,组织,表达) ``` -Some may find it helpful to read the pipe like the word "then". For instance, -in the above example, we took the data frame `rna`, _then_ we `filter`ed -for rows with `sex == "Male"`, _then_ we `select`ed columns `gene`, `sample`, -`tissue`, and `expression`. +有些人可能会发现将管道读成“then”这个词很有帮助。 例如,在上面的例子中, +我们取数据框 `rna`,_然后_我们 `过滤` +中含有 `sex == "Male"` 的行,_然后_我们 `选择` 列 `gene`、`sample`、 +`tissue` 和 `expression`。 -The **`dplyr`** functions by themselves are somewhat simple, but by -combining them into linear workflows with the pipe, we can accomplish -more complex manipulations of data frames. +**`dplyr`** 函数本身有些简单,但通过 +将它们与管道组合成线性工作流,我们可以完成 +更复杂的数据框操作。 -If we want to create a new object with this smaller version of the data, we -can assign it a new name: +如果我们想用这个较小版本的数据创建一个新对象,我们 +可以给它分配一个新名称: ```{r, purl=TRUE} rna3 <- rna %>% filter(sex == "Male") %>% - select(gene, sample, tissue, expression) + select(gene, sample, organization, expression) rna3 ``` -::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::: 挑战 -## Challenge: +## 挑战: -Using pipes, subset the `rna` data to keep observations in female mice at time 0, -where the gene has an expression higher than 50000, and retain only the columns -`gene`, `sample`, `time`, `expression` and `age`. +使用管道,对 `rna` 数据进行子集化,以保留时间 0 时雌性小鼠的观察结果, +其中基因表达高于 50000,并且仅保留列 +`基因`、`样本`、`时间`、`表达` 和 `年龄`。 -::::::::::::::: solution +::::::::::::::: 解决方案 -## Solution +## 解决方案 ```{r} rna %>% - filter(expression > 50000, - sex == "Female", - time == 0 ) %>% - select(gene, sample, time, expression, age) + 过滤器(表达 > 50000、 + 性别 == “女性”、 + 时间 == 0 )%>% + 选择(基因、样本、时间、表达、年龄) ``` ::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::: +::::::::::::::::::::::::::::::::::::::::::::::::::::: -## Mutate +## 合变 -Frequently you'll want to create new columns based on the values of existing -columns, for example to do unit conversions, or to find the ratio of values in two -columns. For this we'll use `mutate()`. +您经常需要根据现有 +列的值创建新列,例如进行单位转换,或者查找两个 +列中的值之比。 为此,我们将使用“mutate()”。 -To create a new column of time in hours: +要创建以小时为单位的新时间列: ```{r, purl=TRUE} rna %>% - mutate(time_hours = time * 24) %>% - select(time, time_hours) + 突变(time_hours = time * 24)%>% + 选择(时间,time_hours) ``` -You can also create a second new column based on the first new column within the same call of `mutate()`: +您还可以在同一个 `mutate()` 调用中根据第一个新列创建第二个新列: ```{r, purl=TRUE} rna %>% - mutate(time_hours = time * 24, - time_mn = time_hours * 60) %>% - select(time, time_hours, time_mn) + 突变(time_hours = time * 24, + time_mn = time_hours * 60)%>% + 选择(时间,time_hours,time_mn) ``` -::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::: 挑战 -## Challenge +## 挑战 -Create a new data frame from the `rna` data that meets the following -criteria: contains only the `gene`, `chromosome_name`, -`phenotype_description`, `sample`, and `expression` columns. The expression -values should be log-transformed. This data frame must -only contain genes located on sex chromosomes, associated with a -phenotype\_description, and with a log expression higher than 5. +从满足以下 +条件的`rna`数据中创建一个新的数据框:仅包含`gene`、`chromosome_name`、 +`phenotype_description`、`sample`和`expression`列。 表达式 +值应该进行对数变换。 该数据框必须 +仅包含位于性染色体上、与 +表型描述相关且对数表达高于 5 的基因。 -**Hint**: think about how the commands should be ordered to produce -this data frame! +**提示**:思考一下应该如何排列命令来生成 +这个数据框! -::::::::::::::: solution +::::::::::::::: 解决方案 -## Solution +## 解决方案 ```{r, eval=TRUE, purl=TRUE} rna %>% - mutate(expression = log(expression)) %>% - select(gene, chromosome_name, phenotype_description, sample, expression) %>% - filter(chromosome_name == "X" | chromosome_name == "Y") %>% - filter(!is.na(phenotype_description)) %>% - filter(expression > 5) + 突变(表达式 = log(表达式))%>% + 选择(基因、染色体名称、表型描述、样本、表达)%>% + 过滤器(染色体名称 == “X” | 染色体名称 == “Y”)%>% + 过滤器(!is.na(表型描述))%>% + 过滤器(表达式 > 5) ``` ::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::: +::::::::::::::::::::::::::::::::::::::::::::::::::::: -## Split-apply-combine data analysis +## 拆分-应用-合并数据分析 -Many data analysis tasks can be approached using the -_split-apply-combine_ paradigm: split the data into groups, apply some -analysis to each group, and then combine the results. **`dplyr`** -makes this very easy through the use of the `group_by()` function. +许多数据分析任务可以使用 +_split-apply-combine_ 范式来完成:将数据分成组,对每组应用一些 +分析,然后合并结果。 **`dplyr`** +通过使用 `group_by()` 函数使这变得非常容易。 ```{r} rna %>% - group_by(gene) + group_by(基因) ``` -The `group_by()` function doesn't perform any data processing, it -groups the data into subsets: in the example above, our initial -`tibble` of `r nrow(rna)` observations is split into -`r length(unique(rna$gene))` groups based on the `gene` variable. +`group_by()` 函数不执行任何数据处理,它 +将数据分组为子集:在上面的例子中,我们最初的 +`tibble` 的 `r nrow(rna)` 观测值根据 `gene` 变量被分成 +`r length(unique(rna$gene))` 组。 -We could similarly decide to group the tibble by the samples: +我们可以类似地决定根据样本对 tibble 进行分组: ```{r} rna %>% - group_by(sample) + group_by(样本) ``` -Here our initial `tibble` of `r nrow(rna)` observations is split into -`r length(unique(rna$sample))` groups based on the `sample` variable. +这里我们最初的 `r nrow(rna)` 观测值的 `tibble` 被基于 `sample` 变量分成 +`r length(unique(rna$sample))` 组。 Once the data has been grouped, subsequent operations will be applied on each group independently. -### The `summarise()` function +### `summarise()` 函数 `group_by()` is often used together with `summarise()`, which collapses each group into a single-row summary of that group. -`group_by()` takes as arguments the column names that contain the -**categorical** variables for which you want to calculate the summary -statistics. So to compute the mean `expression` by gene: +`group_by()` 将包含要计算摘要 +**分类** 变量的列名作为参数 +统计数据。 因此要计算基因的平均“表达”: ```{r} rna %>% - group_by(gene) %>% - summarise(mean_expression = mean(expression)) + group_by(基因) %>% + 总结(平均表达 = 平均(表达)) ``` -We could also want to calculate the mean expression levels of all genes in each sample: +我们还可能想计算每个样本中所有基因的平均表达水平: ```{r} rna %>% group_by(sample) %>% - summarise(mean_expression = mean(expression)) + 总结(mean_expression = mean(expression)) ``` -But we can can also group by multiple columns: +但我们也可以按多列分组: ```{r} rna %>% - group_by(gene, infection, time) %>% - summarise(mean_expression = mean(expression)) + group_by(基因、感染、时间) %>% + 总结(平均表达 = 平均(表达)) ``` -Once the data is grouped, you can also summarise multiple variables at the same -time (and not necessarily on the same variable). For instance, we could add a -column indicating the median `expression` by gene and by condition: +一旦数据被分组,您还可以在同一 +时间总结多个变量(不一定是同一个变量)。 例如,我们可以添加一个 +列,表示基因和条件的中位“表达”: ```{r, purl=TRUE} rna %>% - group_by(gene, infection, time) %>% - summarise(mean_expression = mean(expression), - median_expression = median(expression)) + group_by(基因, 感染, 时间) %>% + 总结(平均表达 = 平均(表达), + 中位数表达 = 中位数(表达)) ``` -::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::: 挑战 -## Challenge +## 挑战 -Calculate the mean expression level of gene "Dok3" by timepoints. +按时间点计算基因“Dok3”的平均表达水平。 -::::::::::::::: solution +::::::::::::::: 解决方案 -## Solution +## 解决方案 ```{r, purl=TRUE} rna %>% - filter(gene == "Dok3") %>% - group_by(time) %>% - summarise(mean = mean(expression)) + 过滤器(基因 == “Dok3”)%>% + group_by(时间)%>% + 总结(平均值 = 平均值(表达)) ``` ::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::: +::::::::::::::::::::::::::::::::::::::::::::::::::::: -### Counting +### 数数 -When working with data, we often want to know the number of observations found -for each factor or combination of factors. For this task, **`dplyr`** provides -`count()`. For example, if we wanted to count the number of rows of data for -each infected and non-infected samples, we would do: +处理数据时,我们常常想知道每个因素或因素组合的观测值数量 +。 对于这个任务,**`dplyr`** 提供了 +`count()`。 例如,如果我们想要计算每个感染和未感染样本的 +的数据行数,我们会这样做: ```{r, purl=TRUE} rna %>% - count(infection) + 计数(感染) ``` -The `count()` function is shorthand for something we've already seen: grouping by a variable, and summarising it by counting the number of observations in that group. In other words, `rna %>% count(infection)` is equivalent to: +`count()` 函数是我们已经见过的功能的简写:按变量分组,并通过计算该组中的观察次数进行汇总。 换句话说,`rna %>% count(infection)` 等同于: ```{r, purl=TRUE} rna %>% - group_by(infection) %>% - summarise(n = n()) + group_by(感染) %>% + 总结(n = n()) ``` -The previous example shows the use of `count()` to count the number of rows/observations -for _one_ factor (i.e., `infection`). -If we wanted to count a _combination of factors_, such as `infection` and `time`, -we would specify the first and the second factor as the arguments of `count()`: +前面的例子显示了使用 `count()` 来计算_一个_因素(即`感染`)的行数/观察数 +。 +如果我们想要计算_多种因素的组合_,例如`感染`和`时间`, +我们会将第一个和第二个因素指定为`count()`的参数: ```{r, purl=TRUE} rna %>% - count(infection, time) + 计数(感染,时间) ``` -which is equivalent to this: +这相当于: ```{r, purl=TRUE} rna %>% - group_by(infection, time) %>% - summarise(n = n()) + group_by(感染,时间) %>% + 总结(n = n()) ``` -It is sometimes useful to sort the result to facilitate the comparisons. -We can use `arrange()` to sort the table. -For instance, we might want to arrange the table above by time: +有时对结果进行排序以方便比较是很有用的。 +我们可以使用`arrange()`对表格进行排序。 +例如,我们可能想按时间排列上面的表格: ```{r, purl=TRUE} rna %>% - count(infection, time) %>% - arrange(time) + 计数(感染,时间)%>% + 安排(时间) ``` -or by counts: +或按计数: ```{r, purl=TRUE} rna %>% - count(infection, time) %>% - arrange(n) + 计数(感染,时间)%>% + 安排(n) ``` -To sort in descending order, we need to add the `desc()` function: +为了按降序排序,我们需要添加 `desc()` 函数: ```{r, purl=TRUE} rna %>% - count(infection, time) %>% - arrange(desc(n)) + count(感染,时间) %>% + 排列(desc(n)) ``` -::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::: 挑战 -## Challenge +## 挑战 -1. How many genes were analysed in each sample? -2. Use `group_by()` and `summarise()` to evaluate the sequencing depth (the sum of all counts) in each sample. Which sample has the highest sequencing depth? -3. Pick one sample and evaluate the number of genes by biotype. -4. Identify genes associated with the "abnormal DNA methylation" phenotype description, and calculate their mean expression (in log) at time 0, time 4 and time 8. +1. 每个样本分析了多少个基因? +2. 使用 `group_by()` 和 `summarise()` 来评估每个样本中的测序深度(所有计数的总和)。 哪个样本的测序深度最高? +3. 选择一个样本并根据生物型评估基因的数量。 +4. 识别与“异常 DNA 甲基化”表型描述相关的基因,并计算它们在时间 0、时间 4 和时间 8 的平均表达(以对数表示)。 -::::::::::::::: solution +::::::::::::::: 解决方案 -## Solution +## 解决方案 ```{r} ## 1. @@ -508,94 +508,94 @@ rna %>% rna %>% group_by(sample) %>% summarise(seq_depth = sum(expression)) %>% - arrange(desc(seq_depth)) + range(desc(seq_depth)) ## 3. rna %>% filter(sample == "GSM2545336") %>% count(gene_biotype) %>% - arrange(desc(n)) + range(desc(n)) ## 4. rna %>% - filter(phenotype_description == "abnormal DNA methylation") %>% + filter(phenotype_description == "异常 DNA 甲基化") %>% group_by(gene, time) %>% summarise(mean_expression = mean(log(expression))) %>% - arrange() + range() ``` ::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::: +::::::::::::::::::::::::::::::::::::::::::::::::::::: -## Reshaping data +## 重塑数据 -In the `rna` tibble, the rows contain expression values (the unit) that are -associated with a combination of 2 other variables: `gene` and `sample`. +在 `rna` tibble 中,行包含表达值(单位),这些值 +与其他两个变量的组合相关:`gene` 和 `sample`。 All the other columns correspond to variables describing either -the sample (organism, age, sex, ...) or the gene (gene\_biotype, ENTREZ\_ID, product, ...). -The variables that don't change with genes or with samples will have the same value in all the rows. +the sample (organism, age, sex, ...) 或基因(gene_biotype、ENTREZ_ID、product、...)。 +不随基因或样本而改变的变量在所有行中具有相同的值。 ```{r} rna %>% - arrange(gene) + 排列(基因) ``` -This structure is called a `long-format`, as one column contains all the values, -and other column(s) list(s) the context of the value. +这种结构称为“长格式”,因为一列包含所有值, +,而其他列列出值的上下文。 -In certain cases, the `long-format` is not really "human-readable", and another format, -a `wide-format` is preferred, as a more compact way of representing the data. -This is typically the case with gene expression values that scientists are used to -look as matrices, were rows represent genes and columns represent samples. +在某些情况下,“长格式”并不是真正的“人类可读的”,而另一种格式, +“宽格式”是首选,因为它是一种更紧凑的数据表示方式。 +这通常是基因表达值的情况,科学家习惯将 +视为矩阵,其中行代表基因,列代表样本。 -In this format, it would therefore become straightforward -to explore the relationship between the gene expression levels within, and -between, the samples. +因此,在这种格式下,可以直接通过 +来探索样本内基因表达水平之间的关系,以及通过 +来探索样本间的关系。 ```{r, echo=FALSE} rna %>% - select(gene, sample, expression) %>% - pivot_wider(names_from = sample, - values_from = expression) + 选择(基因,样本,表达)%>% + pivot_wider(names_from = 样本, + values_from = 表达) ``` -To convert the gene expression values from `rna` into a wide-format, -we need to create a new table where the values of the `sample` column would -become the names of column variables. +为了将基因表达值从“rna”转换为宽格式 +,我们需要创建一个新表,其中“sample”列的值将 +成为列变量的名称。 -The key point here is that we are still following -a tidy data structure, but we have **reshaped** the data according to -the observations of interest: expression levels per gene instead -of recording them per gene and per sample. +这里的关键点是,我们仍然遵循 +整洁的数据结构,但是我们已经根据 +感兴趣的观察结果**重塑**了数据:每个基因的表达水平,而不是 +每个基因和每个样本记录它们的表达水平。 -The opposite transformation would be to transform column names into -values of a new variable. +相反的转换是将列名转换为新变量的 +值。 -We can do both these of transformations with two `tidyr` functions, -`pivot_longer()` and `pivot_wider()` (see -[here](https://tidyr.tidyverse.org/dev/articles/pivot.html) for -details). +我们可以使用两个 `tidyr` 函数完成这两种转换, +`pivot_longer()` 和 `pivot_wider()`(有关 +详细信息,请参阅 +[此处](https://tidyr.tidyverse.org/dev/articles/pivot.html))。 -### Pivoting the data into a wider format +### 将数据转换为更广泛的格式 -Let's select the first 3 columns of `rna` and use `pivot_wider()` -to transform the data into a wide-format. +让我们选择“rna”的前三列并使用“pivot_wider()” +将数据转换为宽格式。 ```{r, purl=TRUE} rna_exp <- rna %>% - select(gene, sample, expression) + 选择(基因,样本,表达) rna_exp ``` -`pivot_wider` takes three main arguments: +`pivot_wider` 有三个主要参数: -1. the data to be transformed; -2. the `names_from` : the column whose values will become new column - names; -3. the `values_from`: the column whose values will fill the new - columns. +1. 要转换的数据; +2. `names_from` :其值将成为新列 + 名称的列; +3. `values_from`:其值将填充新的 + 列的列。 -\`\`\`{r, fig.cap="Wide pivot of the `rna` data.", echo=FALSE, message=FALSE} +\`\`\`{r, fig.cap="`rna` 数据的宽枢轴。", echo=FALSE, message=FALSE} knitr::include_graphics("fig/pivot_wider.png") ```` @@ -607,65 +607,65 @@ rna_wide <- rna_exp %>% rna_wide ```` -Note that by default, the `pivot_wider()` function will add `NA` for missing values. +请注意,默认情况下,“pivot_wider()”函数将为缺失值添加“NA”。 -Let's imagine that for some reason, we had some missing expression values for some -genes in certain samples. In the following fictive example, the gene Cyp2d22 has only -one expression value, in GSM2545338 sample. +让我们想象一下,由于某种原因,某些样本中的某些 +基因缺少一些表达值。 在以下虚构的例子中,基因 Cyp2d22 在 GSM2545338 样本中只有 +一个表达值。 ```{r, purl=TRUE} rna_with_missing_values <- rna %>% - select(gene, sample, expression) %>% - filter(gene %in% c("Asl", "Apod", "Cyp2d22")) %>% - filter(sample %in% c("GSM2545336", "GSM2545337", "GSM2545338")) %>% - arrange(sample) %>% - filter(!(gene == "Cyp2d22" & sample != "GSM2545338")) + 选择(基因,样本,表达)%>% + 过滤器(基因 %in% c(“Asl”,“Apod”,“Cyp2d22”))%>% + 过滤器(样本 %in% c(“GSM2545336”,“GSM2545337”,“GSM2545338”))%>% + 安排(样本)%>% + 过滤器(!(基因 == “Cyp2d22” & 样本 != “GSM2545338”)) rna_with_missing_values ``` -By default, the `pivot_wider()` function will add `NA` for missing -values. This can be parameterised with the `values_fill` argument of -the `pivot_wider()` function. +默认情况下,`pivot_wider()` 函数将为缺失的 +值添加 `NA`。 这可以通过 +`pivot_wider()` 函数的 `values_fill` 参数进行参数化。 ```{r, purl=TRUE} rna_with_missing_values %>% - pivot_wider(names_from = sample, - values_from = expression) + pivot_wider(names_from = 样本, + values_from = 表达式) rna_with_missing_values %>% - pivot_wider(names_from = sample, - values_from = expression, + pivot_wider(names_from = 样本, + values_from = 表达式, values_fill = 0) ``` -### Pivoting data into a longer format +### 将数据转换为更长的格式 -In the opposite situation we are using the column names and turning them into -a pair of new variables. One variable represents the column names as -values, and the other variable contains the values previously -associated with the column names. +在相反的情况下,我们使用列名并将它们变成 +一对新变量。 一个变量将列名表示为 +值,另一个变量包含先前与列名相关联的 +值。 -`pivot_longer()` takes four main arguments: +`pivot_longer()` 有四个主要参数: -1. the data to be transformed; -2. the `names_to`: the new column name we wish to create and populate with the - current column names; -3. the `values_to`: the new column name we wish to create and populate with - current values; -4. the names of the columns to be used to populate the `names_to` and - `values_to` variables (or to drop). +1. 要转换的数据; +2. `names_to`:我们希望创建的新列名,并用 + 当前列名填充; +3. `values_to`:我们希望创建的新列名,并用 + 当前值填充; +4. 用于填充“names_to”和 + “values_to”变量(或删除)的列的名称。 -\`\`\`{r, fig.cap="Long pivot of the `rna` data.", echo=FALSE, message=FALSE} +\`\`\`{r, fig.cap="`rna` 数据的长枢轴。", echo=FALSE, message=FALSE} knitr::include_graphics("fig/pivot_longer.png") ```` -To recreate `rna_long` from `rna_wide` we would create a key -called `sample` and value called `expression` and use all columns -except `gene` for the key variable. Here we drop `gene` column -with a minus sign. +要从 `rna_wide` 重新创建 `rna_long`,我们需要创建一个名为 `sample` 的键 +和一个名为 `expression` 的值,并使用除 `gene` 之外的所有列 +作为键变量。在这里,我们用减号删除 `gene` 列 +。 -Notice how the new variable names are to be quoted here. +请注意这里如何引用新变量名称。 ```{r} rna_long <- rna_wide %>% @@ -675,12 +675,12 @@ rna_long <- rna_wide %>% rna_long ```` -We could also have used a specification for what columns to -include. This can be useful if you have a large number of identifying -columns, and it's easier to specify what to gather than what to leave -alone. Here the `starts_with()` function can help to retrieve sample -names without having to list them all! -Another possibility would be to use the `:` operator! +我们还可以使用规范来指定 +包含哪些列。 如果您有大量可识别的 +列,这将非常有用,而且指定要收集的内容比指定要保留 +的内容更容易。 这里,`starts_with()` 函数可以帮助检索样本 +名称,而无需将它们全部列出! +另一种可能性是使用 `:` 运算符! ```{r} rna_wide %>% @@ -693,10 +693,10 @@ rna_wide %>% GSM2545336:GSM2545380) ``` -Note that if we had missing values in the wide-format, the `NA` would be -included in the new long format. +请注意,如果我们在宽格式中缺少值,则`NA`将为 +,包含在新的长格式中。 -Remember our previous fictive tibble containing missing values: +记住我们之前包含缺失值的虚构 tibble: ```{r} rna_with_missing_values @@ -712,126 +712,126 @@ wide_with_NA %>% -gene) ``` -Pivoting to wider and longer formats can be a useful way to balance out a dataset -so every replicate has the same composition. +转向更宽更长的格式可以成为平衡数据集 +的有效方法,这样每个重复都有相同的组成。 -::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::: 挑战 -## Question +## 问题 -Starting from the rna table, use the `pivot_wider()` function to create -a wide-format table giving the gene expression levels in each mouse. -Then use the `pivot_longer()` function to restore a long-format table. +从 rna 表开始,使用 `pivot_wider()` 函数创建 +一个宽格式表,给出每只小鼠的基因表达水平。 +然后使用`pivot_longer()`函数恢复长格式的表。 -::::::::::::::: solution +::::::::::::::: 解决方案 -## Solution +## 解决方案 ```{r, answer=TRUE, purl=TRUE} rna1 <- rna %>% -select(gene, mouse, expression) %>% -pivot_wider(names_from = mouse, values_from = expression) +选择(基因,小鼠,表达)%>% +pivot_wider(names_from = 小鼠,values_from = 表达) rna1 rna1 %>% -pivot_longer(names_to = "mouse_id", values_to = "counts", -gene) +pivot_longer(names_to = "mouse_id", values_to = "counts", -gene) ``` ::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::: +::::::::::::::::::::::::::::::::::::::::::::::::::::: -::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::: 挑战 -## Question +## 问题 -Subset genes located on X and Y chromosomes from the `rna` data frame and -spread the data frame with `sex` as columns, `chromosome_name` as -rows, and the mean expression of genes located in each chromosome as the values, -as in the following tibble: +从 `rna` 数据框中子集位于 X 和 Y 染色体上的基因,并以 +为列、以 `sex` 为列、以 `chromosome_name` 为 +为行、以位于每条染色体上的基因平均表达为值、以 +为值,将数据框扩展到以下表: ```{r, echo=FALSE, message=FALSE} -knitr::include_graphics("fig/Exercise_pivot_W.png") +knitr::include_graphics(“fig/Exercise_pivot_W.png”) ``` -You will need to summarise before reshaping! +重塑之前需要先总结一下! -::::::::::::::: solution +::::::::::::::: 解决方案 -## Solution +## 解决方案 -Let's first calculate the mean expression level of X and Y linked genes from -male and female samples... +让我们首先从 +男性和女性样本中计算 X 和 Y 连锁基因的平均表达水平…… ```{r} rna %>% - filter(chromosome_name == "Y" | chromosome_name == "X") %>% - group_by(sex, chromosome_name) %>% - summarise(mean = mean(expression)) + 过滤器(chromosome_name == “Y” | chromosome_name == “X”)%>% + group_by(性别,chromosome_name)%>% + 总结(平均值 = 平均值(表达式)) ``` -And pivot the table to wide format +并将表格旋转至宽格式 ```{r, answer=TRUE, purl=TRUE} rna_1 <- rna %>% - filter(chromosome_name == "Y" | chromosome_name == "X") %>% - group_by(sex, chromosome_name) %>% - summarise(mean = mean(expression)) %>% - pivot_wider(names_from = sex, - values_from = mean) + 过滤器(chromosome_name == "Y" | chromosome_name == "X") %>% + 分组(性别,chromosome_name) %>% + 总结(平均值 = 平均值(表达式)) %>% + 枢轴_宽(names_from = 性别, + values_from = 平均值) rna_1 ``` -Now take that data frame and transform it with `pivot_longer()` so -each row is a unique `chromosome_name` by `gender` combination. +现在获取该数据框并使用“pivot_longer()”对其进行转换,因此 +每行都是一个按“性别”组合唯一的“chromosome_name”。 ```{r, answer=TRUE, purl=TRUE} rna_1 %>% - pivot_longer(names_to = "gender", - values_to = "mean", + pivot_longer(names_to = "性别", + values_to = "平均值", -chromosome_name) ``` ::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::: +::::::::::::::::::::::::::::::::::::::::::::::::::::: -::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::: 挑战 -## Question +## 问题 -Use the `rna` dataset to create an expression matrix where each row -represents the mean expression levels of genes and columns represent -the different timepoints. +使用 `rna` 数据集创建一个表达矩阵,其中每行 +代表基因的平均表达水平,每列代表 +不同的时间点。 -::::::::::::::: solution +::::::::::::::: 解决方案 -## Solution +## 解决方案 -Let's first calculate the mean expression by gene and by time +我们首先计算基因和时间的平均表达 ```{r} rna %>% - group_by(gene, time) %>% - summarise(mean_exp = mean(expression)) + group_by(基因,时间) %>% + 总结(mean_exp = mean(表达式)) ``` -before using the pivot\_wider() function +在使用pivot_wider()函数之前 ```{r} rna_time <- rna %>% - group_by(gene, time) %>% - summarise(mean_exp = mean(expression)) %>% - pivot_wider(names_from = time, - values_from = mean_exp) + group_by(基因,时间) %>% + 总结(平均值表达式 = 平均值(表达式)) %>% + pivot_wider(names_from = 时间, + values_from = 平均值表达式) rna_time ``` -Notice that this generates a tibble with some column names starting by a number. -If we wanted to select the column corresponding to the timepoints, -we could not use the column names directly... What happens when we select the column 4? +请注意,这会生成一个 tibble,其中的一些列名以数字开头。 +如果我们想选择与时间点相对应的列, +我们不能直接使用列名…… 当我们选择第 4 列时会发生什么? ```{r} rna %>% @@ -842,7 +842,7 @@ rna %>% select(gene, 4) ``` -To select the timepoint 4, we would have to quote the column name, with backticks "\\`" +要选择时间点 4,我们必须用反引号“\\`”引用列名称 ```{r} rna %>% @@ -853,8 +853,8 @@ rna %>% select(gene, `4`) ``` -Another possibility would be to rename the column, -choosing a name that doesn't start by a number : +另一种可能性是重命名该列, +选择一个不以数字开头的名称: ```{r} rna %>% @@ -868,180 +868,180 @@ rna %>% ::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::: +::::::::::::::::::::::::::::::::::::::::::::::::::::: -::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::: 挑战 -## Question +## 问题 -Use the previous data frame containing mean expression levels per timepoint and create -a new column containing fold-changes between timepoint 8 and timepoint 0, and fold-changes -between timepoint 8 and timepoint 4. -Convert this table into a long-format table gathering the fold-changes calculated. +使用包含每个时间点的平均表达水平的先前数据框并创建一个新列 +,其中包含时间点 8 和时间点 0 之间的倍数变化,以及时间点 8 和时间点 4 之间的倍数变化 +。 +将此表转换为长格式表,收集计算出的倍数变化。 -::::::::::::::: solution +::::::::::::::: 解决方案 -## Solution +## 解决方案 -Starting from the rna\_time tibble: +从 rna\_time tibble 开始: ```{r} -rna_time +RNA时间 ``` -Calculate fold-changes: +计算倍数变化: ```{r} rna_time %>% - mutate(time_8_vs_0 = `8` / `0`, time_8_vs_4 = `8` / `4`) + 突变(time_8_vs_0 = `8` / `0`,time_8_vs_4 = `8` / `4`) ``` -And use the pivot\_longer() function: +并使用pivot\_longer()函数: ```{r} rna_time %>% - mutate(time_8_vs_0 = `8` / `0`, time_8_vs_4 = `8` / `4`) %>% - pivot_longer(names_to = "comparisons", + 突变(time_8_vs_0 = `8` / `0`,time_8_vs_4 = `8` / `4`)%>% + pivot_longer(names_to = "comparisons", values_to = "Fold_changes", - time_8_vs_0:time_8_vs_4) + time_8_vs_0:time_8_vs_4) ``` ::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::: +::::::::::::::::::::::::::::::::::::::::::::::::::::: -## Joining tables +## 连接表 -In many real life situations, data are spread across multiple tables. +在许多现实生活中,数据分布在多个表中。 Usually this occurs because different types of information are collected from different sources. -It may be desirable for some analyses to combine data from two or more -tables into a single data frame based on a column that would be common -to all the tables. +对于某些分析来说,可能需要将两个或多个 +表中的数据基于所有表共有的 +列组合到单个数据框中。 -The `dplyr` package provides a set of join functions for combining two -data frames based on matches within specified columns. Here, we -provide a short introduction to joins. For further reading, please -refer to the chapter about table -joins. The -Data Transformation Cheat -Sheet -also provides a short overview on table joins. +`dplyr` 包提供了一组连接函数,用于根据指定列内的匹配组合两个 +数据框。 在这里,我们 +对连接进行简单介绍。 如需进一步阅读,请 +参阅有关 表 +连接 的章节。 +数据转换秘籍 +表 +还提供了关于表连接的简要概述。 -We are going to illustrate join using a small table, `rna_mini` that -we will create by subsetting the original `rna` table, keeping only 3 -columns and 10 lines. +我们将使用一个小表“rna_mini”来说明连接,我们将通过对原始“rna”表进行子集设置来创建它 +,只保留 3 +列和 10 行。 ```{r} rna_mini <- rna %>% - select(gene, sample, expression) %>% - head(10) + 选择(基因,样本,表达) %>% + 头部(10) rna_mini ``` -The second table, `annot1`, contains 2 columns, gene and -gene\_description. You can either -[download annot1.csv](https://carpentries-incubator.github.io/bioc-intro/data/annot1.csv) -by clicking on the link and then moving it to the `data/` folder, or -you can use the R code below to download it directly to the folder. +第二个表“annot1”包含2列,gene 和 +gene\_description。 您可以通过单击链接然后将其移动到“data/”文件夹来 +[下载 annot1.csv](https://carpentries-incubator.github.io/bioc-intro/data/annot1.csv) +,或者 +您可以使用下面的 R 代码将其直接下载到文件夹。 ```{r, message=FALSE} -download.file(url = "https://carpentries-incubator.github.io/bioc-intro/data/annot1.csv", - destfile = "data/annot1.csv") -annot1 <- read_csv(file = "data/annot1.csv") +下载文件(url = “https://carpentries-incubator.github.io/bioc-intro/data/annot1.csv”, + destfile = “data/annot1.csv”) +annot1 <- read_csv(file = “data/annot1.csv”) annot1 ``` -We now want to join these two tables into a single one containing all -variables using the `full_join()` function from the `dplyr` package. The -function will automatically find the common variable to match columns -from the first and second table. In this case, `gene` is the common -variable. Such variables are called keys. Keys are used to match -observations across different tables. +我们现在要使用 `dplyr` 包中的 `full_join()` 函数将这两个表合并为一个包含所有 +变量的表。 +函数将自动找到与第一个和第二个表中的 +列匹配的公共变量。 在这种情况下,“基因”是常见的 +变量。 这样的变量被称为键。 键用于匹配不同表之间的 +观察结果。 ```{r} -full_join(rna_mini, annot1) +全连接(rna_mini,annot1) ``` -In real life, gene annotations are sometimes labelled differently. +在现实生活中,基因注释有时会被标记不同。 -The `annot2` table is exactly the same than `annot1` except that the -variable containing gene names is labelled differently. Again, either -[download annot2.csv](https://carpentries-incubator.github.io/bioc-intro/data/annot2.csv) -yourself and move it to `data/` or use the R code below. +`annot2` 表与 `annot1` 完全相同,只是包含基因名称的 +变量的标签不同。 再次,要么 +自己 [下载 annot2.csv](https://carpentries-incubator.github.io/bioc-intro/data/annot2.csv) +并将其移动到 `data/`,要么使用下面的 R 代码。 ```{r, message=FALSE} -download.file(url = "https://carpentries-incubator.github.io/bioc-intro/data/annot2.csv", - destfile = "data/annot2.csv") -annot2 <- read_csv(file = "data/annot2.csv") +下载文件(url = “https://carpentries-incubator.github.io/bioc-intro/data/annot2.csv”, + 目标文件 = “data/annot2.csv”) +annot2 <- read_csv(file = “data/annot2.csv”) annot2 ``` -In case none of the variable names match, we can set manually the -variables to use for the matching. These variables can be set using -the `by` argument, as shown below with `rna_mini` and `annot2` tables. +如果没有匹配的变量名,我们可以手动设置 +变量来用于匹配。 可以使用 +`by` 参数设置这些变量,如下面 `rna_mini` 和 `annot2` 表所示。 ```{r} -full_join(rna_mini, annot2, by = c("gene" = "external_gene_name")) +full_join(rna_mini,annot2,by = c(“基因”=“external_gene_name”)) ``` -As can be seen above, the variable name of the first table is retained -in the joined one. +从上可以看出,第一个表的变量名在连接后的表中保留为 +。 -::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::: 挑战 -## Challenge: +## 挑战: -Download the `annot3` table by clicking -[here](https://carpentries-incubator.github.io/bioc-intro/data/annot3.csv) -and put the table in your data/ repository. Using the `full_join()` -function, join tables `rna_mini` and `annot3`. What has happened for -genes _Klk6_, _mt-Tf_, _mt-Rnr1_, _mt-Tv_, _mt-Rnr2_, and _mt-Tl1_ ? +通过点击 +[此处](https://carpentries-incubator.github.io/bioc-intro/data/annot3.csv) +下载 `annot3` 表,并将该表放入您的数据/存储库中。 使用 `full_join()` +函数,连接表 `rna_mini` 和 `annot3`。 +基因 _Klk6_、_mt-Tf_、_mt-Rnr1_、_mt-Tv_、_mt-Rnr2_ 和 _mt-Tl1_ 发生了什么? -::::::::::::::: solution +::::::::::::::: 解决方案 -## Solution +## 解决方案 ```{r, message=FALSE} annot3 <- read_csv("data/annot3.csv") full_join(rna_mini, annot3) ``` -Genes _Klk6_ is only present in `rna_mini`, while genes _mt-Tf_, _mt-Rnr1_, _mt-Tv_, -_mt-Rnr2_, and _mt-Tl1_ are only present in `annot3` table. Their respective values for the -variables of the table have been encoded as missing. +基因 _Klk6_ 仅存在于 `rna_mini` 中,而基因 _mt-Tf_、_mt-Rnr1_、_mt-Tv_、 +_mt-Rnr2_ 和 _mt-Tl1_ 仅存在于 `annot3` 表中。 表中 +变量的各自值已被编码为缺失。 ::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::: +::::::::::::::::::::::::::::::::::::::::::::::::::::: -## Exporting data +## 导出数据 -Now that you have learned how to use `dplyr` to extract information from -or summarise your raw data, you may want to export these new data sets to share -them with your collaborators or for archival. +现在您已经了解了如何使用“dplyr”从 +中提取信息或总结原始数据,您可能希望导出这些新数据集以便与您的合作者共享 +或用于存档。 -Similar to the `read_csv()` function used for reading CSV files into R, there is -a `write_csv()` function that generates CSV files from data frames. +与用于将 CSV 文件读入 R 的 `read_csv()` 函数类似,有一个 +函数 `write_csv()`,可以从数据框生成 CSV 文件。 -Before using `write_csv()`, we are going to create a new folder, `data_output`, -in our working directory that will store this generated dataset. We don't want -to write generated datasets in the same directory as our raw data. -It's good practice to keep them separate. The `data` folder should only contain -the raw, unaltered data, and should be left alone to make sure we don't delete -or modify it. In contrast, our script will generate the contents of the `data_output` -directory, so even if the files it contains are deleted, we can always -re-generate them. +在使用“write_csv()”之前,我们将在工作目录中创建一个新文件夹“data_output”, +,用于存储生成的数据集。 我们不希望 +将生成的数据集写入与原始数据相同的目录中。 +将它们分开是一种很好的做法。 `data` 文件夹应该只包含 +原始、未改变的数据,并且应该保持不变以确保我们不会删除 +或修改它。 相反,我们的脚本将生成 `data_output` +目录的内容,因此即使它包含的文件被删除,我们也总是可以 +重新生成它们。 -Let's use `write_csv()` to save the rna\_wide table that we have created previously. +让我们使用`write_csv()`来保存我们之前创建的rna_wide表。 ```{r, purl=TRUE, eval=FALSE} -write_csv(rna_wide, file = "data_output/rna_wide.csv") +write_csv(rna_wide,文件 = “data_output/rna_wide.csv”) ``` -:::::::::::::::::::::::::::::::::::::::: keypoints +:::::::::::::::::::::::::::::::::::::::: 关键点 -- Tabular data in R using the tidyverse meta-package +- 使用 tidyverse 元包在 R 中生成表格数据 -:::::::::::::::::::::::::::::::::::::::::::::::::: +::::::::::::::::::::::::::::::::::::::::::::::::::::: From 3b52256bcc834e2c6cf9d3fb607fba1a7785e374 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Thu, 1 Aug 2024 00:07:43 +0900 Subject: [PATCH 203/334] New translations 40-visualization.md (French) --- locale/fr/episodes/40-visualization.Rmd | 881 ++++++++++++------------ 1 file changed, 441 insertions(+), 440 deletions(-) diff --git a/locale/fr/episodes/40-visualization.Rmd b/locale/fr/episodes/40-visualization.Rmd index 5500e95c3..c7b68831c 100644 --- a/locale/fr/episodes/40-visualization.Rmd +++ b/locale/fr/episodes/40-visualization.Rmd @@ -1,219 +1,219 @@ --- source: Rmd -title: Data visualization +title: Visualisation de données teaching: 60 exercises: 60 --- ```{r loaddata_vis, echo=FALSE, purl=FALSE, message=FALSE} if (!file.exists("data/rnaseq.csv")) -download.file(url = "https://github.com/carpentries-incubator/bioc-intro/raw/main/episodes/data/rnaseq.csv", +download.file(url = "https://github.com/carpentries-incubator/bioc-intro/raw/main/episodes/ data/rnaseq.csv", destfile = "data/rnaseq.csv") ``` ::::::::::::::::::::::::::::::::::::::: objectives -- Produce scatter plots, boxplots, line plots, etc. using ggplot. -- Set universal plot settings. -- Describe what faceting is and apply faceting in ggplot. -- Modify the aesthetics of an existing ggplot plot (including axis labels and color). -- Build complex and customized plots from data in a data frame. +- Produisez des nuages de points, des boxplots, des tracés linéaires, etc. en utilisant ggplot. +- Définissez les paramètres de tracé universels. +- Décrivez ce qu'est le facettage et appliquez le facettage dans ggplot. +- Modifiez l'esthétique d'un tracé ggplot existant (y compris les étiquettes des axes et la couleur). +- Créez des tracés complexes et personnalisés à partir de données dans un bloc de données. -:::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::: questions +:::::::::::::::::::::::::::::::::::::::: des questions -- Visualization in R +- Visualisation en R -:::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::: ```{r vis_setup, echo=FALSE} -rna <- read.csv("data/rnaseq.csv") +arn <- read.csv("data/rnaseq.csv") ``` -> This episode is based on the Data Carpentries's _Data Analysis and -> Visualisation in R for Ecologists_ lesson. +> Cet épisode est basé sur la leçon _Analyse des données et +> Visualisation dans R pour les écologistes_ de Data Carpentries. -## Data Visualization +## Visualisation de données -We start by loading the required packages. **`ggplot2`** is included in -the **`tidyverse`** package. +Nous commençons par charger les packages requis. **`ggplot2`** est inclus dans +le package **`tidyverse`**. ```{r load-package, message=FALSE, purl=TRUE} -library("tidyverse") +bibliothèque("tidyverse") ``` -If not still in the workspace, load the data we saved in the previous -lesson. +Si vous n'êtes pas encore dans l'espace de travail, chargez les données que nous avons enregistrées dans la leçon +précédente. ```{r load-data, eval=FALSE, purl=TRUE} -rna <- read.csv("data/rnaseq.csv") +arn <- read.csv("data/rnaseq.csv") ``` -The Data Visualization Cheat -Sheet -will cover the basics and more advanced features of `ggplot2` and will -help, in addition to serve as a reminder, getting an overview of the -many data representations available in the package. The following video -tutorials ([part 1](https://www.youtube.com/watch?v=h29g21z0a68) and -[2](https://www.youtube.com/watch?v=0m4yywqNPVY)) by Thomas Lin Pedersen -are also very instructive. +La Data Visualization Cheat -## Plotting with `ggplot2` +couvrira les bases et les fonctionnalités plus avancées de ` ggplot2` et +aidera, en plus de servir de rappel, à obtenir un aperçu des +nombreuses représentations de données disponibles dans le package. Les didacticiels vidéo suivants +([partie 1](https://www.youtube.com/watch?v=h29g21z0a68) et +[2](https://www.youtube.com /watch?v=0m4yywqNPVY)) de Thomas Lin Pedersen +sont également très instructifs. -`ggplot2` is a plotting package that makes it simple to create complex -plots from data in a data frame. It provides a more programmatic -interface for specifying what variables to plot, how they are displayed, -and general visual properties. The theoretical foundation that supports -the `ggplot2` is the _Grammar of Graphics_ (@Wilkinson:2005). Using this -approach, we only need minimal changes if the underlying data change or -if we decide to change from a bar plot to a scatterplot. This helps in -creating publication quality plots with minimal amounts of adjustments -and tweaking. +## Tracer avec `ggplot2` -There is a book about `ggplot2` (@ggplot2book) that provides a good -overview, but it is outdated. The 3rd edition is in preparation and will -be [freely available online](https://ggplot2-book.org/). The `ggplot2` -webpage ([https://ggplot2.tidyverse.org](https://ggplot2.tidyverse.org)) provides ample documentation. +`ggplot2` est un package de traçage qui simplifie la création de tracés +complexes à partir de données dans un bloc de données. Il fournit une interface +plus programmatique pour spécifier les variables à tracer, comment elles sont affichées, +et les propriétés visuelles générales. La fondation théorique qui prend en charge +le `ggplot2` est la _Grammar of Graphics_ (@Wilkinson :2005). En utilisant cette approche +, nous n'avons besoin que de changements minimes si les données sous-jacentes changent ou +si nous décidons de passer d'un diagramme à barres à un nuage de points. Cela aide à +créer des tracés de qualité de publication avec un minimum d'ajustements +et de peaufinages. -`ggplot2` functions like data in the 'long' format, i.e., a column for -every dimension, and a row for every observation. Well-structured data -will save you lots of time when making figures with `ggplot2`. +Il existe un livre sur `ggplot2` (@ggplot2book) qui fournit un bon aperçu de +, mais il est obsolète. La 3ème édition est en préparation et sera +[disponible gratuitement en ligne](https://ggplot2-book.org/). La page Web `ggplot2` +([https://ggplot2.tidyverse.org](https://ggplot2.tidyverse.org)) fournit une documentation abondante. -ggplot graphics are built step by step by adding new elements. Adding -layers in this fashion allows for extensive flexibility and -customization of plots. +`ggplot2` fonctionne comme des données au format « long », c'est-à-dire une colonne pour +chaque dimension et une ligne pour chaque observation. Des données bien structurées +vous feront gagner beaucoup de temps lors de la création de chiffres avec `ggplot2`. -> The idea behind the Grammar of Graphics it is that you can build every -> graph from the same 3 components: (1) a data set, (2) a coordinate system, -> and (3) geoms — i.e. visual marks that represent data points \[^three\\_comp\\_ggplot2] +Les graphiques ggplot sont construits étape par étape en ajoutant de nouveaux éléments. L'ajout de +couches de cette manière permet une grande flexibilité et une +personnalisation des tracés. -[^three_comp_ggplot2]: Source: [Data Visualization Cheat Sheet](https://raw.githubusercontent.com/rstudio/cheatsheets/main/data-visualization.pdf). +> L'idée derrière la grammaire graphique est que vous pouvez construire chaque graphique +> à partir des 3 mêmes composants : (1) un ensemble de données, (2) un système de coordonnées, +> et (3) des géoms. — c'est-à-dire des marques visuelles qui représentent des points de données \[^trois\\_comp\\_ggplot2] -To build a ggplot, we will use the following basic template that can be -used for different types of plots: +[^three_comp_ggplot2]: Source : [Aide-mémoire pour la visualisation des données](https://raw.githubusercontent.com/rstudio/cheatsheets/main/data-visualization.pdf). + +Pour construire un ggplot, nous utiliserons le modèle de base suivant qui peut être +utilisé pour différents types de tracés : ``` ggplot(data = <DATA>, mapping = aes(<MAPPINGS>)) + <GEOM_FUNCTION>() ``` -- use the `ggplot()` function and bind the plot to a specific **data - frame** using the `data` argument +- utilisez la fonction `ggplot()` et liez le tracé à un cadre \*\*data + \*\* spécifique en utilisant l'argument `data` ```{r, eval=FALSE} -ggplot(data = rna) +ggplot (données = arn) ``` -- define a **mapping** (using the aesthetic (`aes`) function), by - selecting the variables to be plotted and specifying how to present - them in the graph, e.g. as x/y positions or characteristics such as - size, shape, color, etc. +- définir un **mapping** (en utilisant la fonction esthétique (`aes`)), en + sélectionnant les variables à tracer et en spécifiant comment les présenter + dans le graphique, par exemple sous la forme x/ y positions ou caractéristiques telles que + taille, forme, couleur, etc. ```{r, eval=FALSE} ggplot(data = rna, mapping = aes(x = expression)) ``` -- add '**geoms**' - geometries, or graphical representations of the - data in the plot (points, lines, bars). `ggplot2` offers many - different geoms; we will use some common ones today, including: +- ajoutez '**geoms**' - géométries ou représentations graphiques des données + dans le tracé (points, lignes, barres). `ggplot2` propose de nombreuses + géométries différentes ; nous en utiliserons quelques-uns courants aujourd’hui, notamment : ``` - * `geom_point()` for scatter plots, dot plots, etc. - * `geom_histogram()` for histograms - * `geom_boxplot()` for, well, boxplots! - * `geom_line()` for trend lines, time series, etc. + * `geom_point()` pour les nuages de points, les diagrammes de points, etc. + * `geom_histogram()` pour les histogrammes + * `geom_boxplot()` pour, eh bien, les boxplots ! + * `geom_line()` pour les lignes de tendance, les séries chronologiques, etc. ``` -To add a geom(etry) to the plot use the `+` operator. Let's use -`geom_histogram()` first: +Pour ajouter une géométrie au tracé, utilisez l'opérateur `+`. Utilisons d'abord +`geom_histogram()` : ```{r first-ggplot, cache=FALSE, purl=TRUE} ggplot(data = rna, mapping = aes(x = expression)) + geom_histogram() ``` -The `+` in the `ggplot2` package is particularly useful because it -allows you to modify existing `ggplot` objects. This means you can -easily set up plot templates and conveniently explore different types of -plots, so the above plot can also be generated with code like this: +Le `+` dans le package `ggplot2` est particulièrement utile car il +vous permet de modifier les objets `ggplot` existants. Cela signifie que vous pouvez +facilement configurer des modèles de tracé et explorer facilement différents types de +tracés, de sorte que le tracé ci-dessus peut également être généré avec un code comme celui-ci : ```{r, eval=FALSE, purl=TRUE} -# Assign plot to a variable +# Attribuer un tracé à une variable rna_plot <- ggplot(data = rna, mapping = aes(x = expression)) -# Draw the plot -rna_plot + geom_histogram() +# Dessiner le tracé +rna_plot + geom_histogramme() ``` -::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::: défi -## Challenge +## Défi -You have probably noticed an automatic message that appears when -drawing the histogram: +Vous avez probablement remarqué un message automatique qui apparaît lorsque +dessine l'histogramme : ```{r, echo=FALSE, fig.show="hide"} ggplot(rna, aes(x = expression)) + geom_histogram() ``` -Change the arguments `bins` or `binwidth` of `geom_histogram()` to -change the number or width of the bins. +Modifiez les arguments `bins` ou `binwidth` de `geom_histogram()` en +changez le nombre ou la largeur des bacs. ::::::::::::::: solution ## Solution ```{r, purl=TRUE} -# change bins +# changer les bacs ggplot(rna, aes(x = expression)) + geom_histogram(bins = 15) -# change binwidth -ggplot(rna, aes(x = expression)) + +# changer la largeur de bac +ggplot(rna, aes( x = expression)) + geom_histogram(binwidth = 2000) ``` ::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::: -We can observe here that the data are skewed to the right. We can apply -log2 transformation to have a more symmetric distribution. Note that we -add here a small constant value (`+1`) to avoid having `-Inf` values -returned for expression values equal to 0. +Nous pouvons observer ici que les données sont biaisées vers la droite. Nous pouvons appliquer la transformation +log2 pour avoir une distribution plus symétrique. Notez que nous +ajoutons ici une petite valeur constante (`+1`) pour éviter que les valeurs `-Inf` +soient renvoyées pour les valeurs d'expression égales à 0. ```{r log-transfo, cache=FALSE, purl=TRUE} -rna <- rna %>% - mutate(expression_log = log2(expression + 1)) +arn <- arn %>% + muter(expression_log = log2(expression + 1)) ``` -If we now draw the histogram of the log2-transformed expressions, the -distribution is indeed closer to a normal distribution. +Si l'on dessine maintenant l'histogramme des expressions transformées en log2, la distribution +est en effet plus proche d'une distribution normale. ```{r second-ggplot, cache=FALSE, purl=TRUE} ggplot(rna, aes(x = expression_log)) + geom_histogram() ``` -From now on we will work on the log-transformed expression values. +À partir de maintenant, nous travaillerons sur les valeurs d’expression transformées en log. -::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::: défi -## Challenge +## Défi -Another way to visualize this transformation is to consider the scale -of the observations. For example, it may be worth changing the scale -of the axis to better distribute the observations in the space of the -plot. Changing the scale of the axes is done similarly to -adding/modifying other components (i.e., by incrementally adding -commands). Try making this modification: +Une autre façon de visualiser cette transformation est de considérer l’échelle +des observations. Par exemple, il peut être intéressant de changer l'échelle +de l'axe pour mieux répartir les observations dans l'espace de la parcelle +. Changer l'échelle des axes se fait de la même manière que +ajouter/modifier d'autres composants (c'est-à-dire en ajoutant progressivement des commandes +). Essayez de faire cette modification : -- Represent the un-transformed expression on the log10 scale; see - `scale_x_log10()`. Compare it with the previous graph. Why do you - now have warning messages appearing? +- Représenter l'expression non transformée sur l'échelle log10 ; voir + `scale_x_log10()`. Comparez-le avec le graphique précédent. Pourquoi + des messages d'avertissement apparaissent-ils maintenant ? -::::::::::::::: solution +::::::::::::::: solution ## Solution @@ -225,17 +225,17 @@ ggplot(data = rna,mapping = aes(x = expression))+ ::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::: -**Notes** +**Remarques** -- Anything you put in the `ggplot()` function can be seen by any geom - layers that you add (i.e., these are global plot settings). This - includes the x- and y-axis mapping you set up in `aes()`. -- You can also specify mappings for a given geom independently of the - mappings defined globally in the `ggplot()` function. -- The `+` sign used to add new layers must be placed at the end of the - line containing the _previous_ layer. If, instead, the `+` sign is +- Tout ce que vous mettez dans la fonction `ggplot()` peut être vu par n'importe quelle couche geom + que vous ajoutez (c'est-à-dire qu'il s'agit de paramètres de tracé globaux). Ce + inclut le mappage des axes x et y que vous avez configuré dans `aes()`. +- Vous pouvez également spécifier des mappages pour une géom donnée indépendamment des mappages + définis globalement dans la fonction `ggplot()`. +- Le signe `+` utilisé pour ajouter de nouveaux calques doit être placé à la fin de la ligne + contenant le calque _précédent_. If, instead, the `+` sign is added at the beginning of the line containing the new layer, `ggplot2` will not add the new layer and will return an error message. @@ -250,58 +250,58 @@ rna_plot + geom_histogram() ``` -## Building your plots iteratively +## Construire vos parcelles de manière itérative -We will now draw a scatter plot with two continuous variables and the -`geom_point()` function. This graph will represent the log2 fold changes -of expression comparing time 8 versus time 0, and time 4 versus time 0. -To this end, we first need to compute the means of the log-transformed -expression values by gene and time, then the log fold changes by -subtracting the mean log expressions between time 8 and time 0 and -between time 4 and time 0. Note that we also include here the gene -biotype that we will use later on to represent the genes. We will save -the fold changes in a new data frame called `rna_fc.` +Nous allons maintenant dessiner un nuage de points avec deux variables continues et la fonction +`geom_point()`. Ce graphique représentera les changements de pli log2 +de l'expression comparant le temps 8 au temps 0 et le temps 4 au temps 0. +À cette fin, nous devons d'abord calculer les moyennes des valeurs d'expression +transformées en log par gène et par temps, puis le pli du log change en +en soustrayant les expressions du log moyen entre le temps 8 et le temps 0. et +entre le temps 4 et le temps 0. Notez que nous incluons également ici le biotype du gène +que nous utiliserons plus tard pour représenter les gènes. Nous enregistrerons +les changements de repli dans un nouveau bloc de données appelé `rna_fc.` ```{r rna_fc, cache=FALSE, purl=TRUE} rna_fc <- rna %>% select(gene, time, gene_biotype, expression_log) %>% group_by(gene, time, gene_biotype) %>% - summarize(mean_exp = mean(expression_log)) %>% - pivot_wider(names_from = time, - values_from = mean_exp) %>% + summary(mean_exp = moyenne (expression_log)) %>% + pivot_wider(names_from = temps, + valeurs_from = moyenne_exp) %>% mutate(time_8_vs_0 = `8` - `0`, time_4_vs_0 = `4` - `0`) ``` -We can then build a ggplot with the newly created dataset `rna_fc`. -Building plots with `ggplot2` is typically an iterative process. We -start by defining the dataset we'll use, lay out the axes, and choose a -geom: +Nous pouvons ensuite construire un ggplot avec l'ensemble de données nouvellement créé `rna_fc`. +Construire des parcelles avec `ggplot2` est généralement un processus itératif. Nous +commençons par définir l'ensemble de données que nous allons utiliser, tracer les axes et choisir une +géom : ```{r create-ggplot-object, cache=FALSE, purl=TRUE} ggplot(data = rna_fc, mapping = aes(x = time_4_vs_0, y = time_8_vs_0)) + geom_point() ``` -Then, we start modifying this plot to extract more information from it. -For instance, we can add transparency (`alpha`) to avoid overplotting: +Ensuite, nous commençons à modifier ce tracé pour en extraire plus d’informations. +Par exemple, nous pouvons ajouter de la transparence (« alpha ») pour éviter le surtraçage : ```{r adding-transparency, cache=FALSE, purl=TRUE} ggplot(data = rna_fc, mapping = aes(x = time_4_vs_0, y = time_8_vs_0)) + geom_point(alpha = 0.3) ``` -We can also add colors for all the points: +On peut également ajouter des couleurs pour tous les points : ```{r adding-colors, cache=FALSE, purl=TRUE} ggplot(data = rna_fc, mapping = aes(x = time_4_vs_0, y = time_8_vs_0)) + geom_point(alpha = 0.3, color = "blue") ``` -Or to color each gene in the plot differently, you could use a vector as -an input to the argument **color**. `ggplot2` will provide a different -color corresponding to different values in the vector. Here is an -example where we color with `gene_biotype`: +Ou pour colorer différemment chaque gène du tracé, vous pouvez utiliser un vecteur comme +entrée dans l'argument **color**. `ggplot2` fournira une couleur +différente correspondant à différentes valeurs dans le vecteur. Voici un +exemple où nous colorons avec `gene_biotype` : ```{r color-by-gene_biotype1, cache=FALSE, purl=TRUE} ggplot(data = rna_fc, mapping = aes(x = time_4_vs_0, y = time_8_vs_0)) + @@ -309,9 +309,9 @@ ggplot(data = rna_fc, mapping = aes(x = time_4_vs_0, y = time_8_vs_0)) + ``` -We can also specify the colors directly inside the mapping provided in -the `ggplot()` function. This will be seen by any geom layers and the -mapping will be determined by the x- and y-axis set up in `aes()`. +Nous pouvons également spécifier les couleurs directement à l'intérieur du mappage fourni dans +la fonction `ggplot()`. Cela sera visible par toutes les couches géométriques et la cartographie +sera déterminée par les axes x et y définis dans `aes()`. ```{r color-by-gene_biotype2, cache=FALSE, purl=TRUE} ggplot(data = rna_fc, mapping = aes(x = time_4_vs_0, y = time_8_vs_0, @@ -319,8 +319,8 @@ ggplot(data = rna_fc, mapping = aes(x = time_4_vs_0, y = time_8_vs_0, geom_point(alpha = 0.3) ``` -Finally, we could also add a diagonal line with the `geom_abline()` -function: +Enfin, nous pourrions également ajouter une ligne diagonale avec la fonction `geom_abline()` + : ```{r adding-diag, cache=FALSE, purl=TRUE} ggplot(data = rna_fc, mapping = aes(x = time_4_vs_0, y = time_8_vs_0, @@ -329,8 +329,8 @@ ggplot(data = rna_fc, mapping = aes(x = time_4_vs_0, y = time_8_vs_0, geom_abline(intercept = 0) ``` -Notice that we can change the geom layer from `geom_point` to -`geom_jitter` and colors will still be determined by `gene_biotype`. +Notez que nous pouvons changer la couche géométrique de `geom_point` à +`geom_jitter` et les couleurs seront toujours déterminées par `gene_biotype`. ```{r color-by-gene_biotype3, cache=FALSE, purl=TRUE} ggplot(data = rna_fc, mapping = aes(x = time_4_vs_0, y = time_8_vs_0, @@ -340,31 +340,31 @@ ggplot(data = rna_fc, mapping = aes(x = time_4_vs_0, y = time_8_vs_0, ``` ```{r, echo=FALSE, message=FALSE} -library("hexbin") +bibliothèque("hexbin") ``` -::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::: défi -## Challenge +## Défi -Scatter plots can be useful exploratory tools for small datasets. For -data sets with large numbers of observations, such as the `rna_fc` -data set, overplotting of points can be a limitation of scatter plots. -One strategy for handling such settings is to use hexagonal binning of -observations. The plot space is tessellated into hexagons. Each -hexagon is assigned a color based on the number of observations that -fall within its boundaries. +Les nuages de points peuvent être des outils d’exploration utiles pour de petits ensembles de données. Pour les ensembles de données +avec un grand nombre d'observations, tels que l'ensemble de données `rna_fc` +, le surtraçage des points peut constituer une limitation des nuages de points. +Une stratégie pour gérer de tels paramètres consiste à utiliser le regroupement hexagonal d'observations +. L’espace de l’intrigue est divisé en hexagones. Chaque +hexagone se voit attribuer une couleur en fonction du nombre d'observations qui +tombent dans ses limites. -- To use hexagonal binning in `ggplot2`, first install the R package - `hexbin` from CRAN and load it. +- Pour utiliser le regroupement hexagonal dans `ggplot2`, installez d'abord le package R + `hexbin` depuis CRAN et chargez-le. -- Then use the `geom_hex()` function to produce the hexbin figure. +- Utilisez ensuite la fonction `geom_hex()` pour produire la figure hexbin. -- What are the relative strengths and weaknesses of a hexagonal bin - plot compared to a scatter plot? Examine the above scatter plot - and compare it with the hexagonal bin plot that you created. +- Quelles sont les forces et les faiblesses relatives d'un diagramme hexagonal + par rapport à un nuage de points ? Examinez le nuage de points ci-dessus + et comparez-le avec le diagramme hexagonal que vous avez créé. -::::::::::::::: solution +::::::::::::::: solution ## Solution @@ -377,23 +377,23 @@ library("hexbin") ggplot(data = rna_fc, mapping = aes(x = time_4_vs_0, y = time_8_vs_0)) + geom_hex() + - geom_abline(intercept = 0) + geom_abline(intercept = 0 ) ``` ::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::: -::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::: défi -## Challenge +## Défi -Use what you just learned to create a scatter plot of `expression_log` -over `sample` from the `rna` dataset with the time showing in -different colors. Is this a good way to show this type of data? +Utilisez ce que vous venez d'apprendre pour créer un nuage de points de `expression_log` +sur `sample` à partir de l'ensemble de données `rna` avec l'heure affichée dans +différentes couleurs. Est-ce une bonne façon d’afficher ce type de données ? -::::::::::::::: solution +::::::::::::::: solution ## Solution @@ -404,12 +404,12 @@ ggplot(data = rna, mapping = aes(y = expression_log, x = sample)) + ::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::: -## Boxplot +## Boîte à moustaches -We can use boxplots to visualize the distribution of gene expressions -within each sample: +Nous pouvons utiliser des boxplots pour visualiser la distribution des expressions géniques +au sein de chaque échantillon : ```{r boxplot, cache=FALSE, purl=TRUE} ggplot(data = rna, @@ -417,66 +417,66 @@ ggplot(data = rna, geom_boxplot() ``` -By adding points to boxplot, we can have a better idea of the number of -measurements and of their distribution: +En ajoutant des points au boxplot, on peut avoir une meilleure idée du nombre de +mesures et de leur répartition : ```{r boxplot-with-points, cache=FALSE, purl=TRUE} ggplot(data = rna, mapping = aes(y = expression_log, x = sample)) + - geom_jitter(alpha = 0.2, color = "tomato") + - geom_boxplot(alpha = 0) + geom_jitter(alpha = 0.2, color = "tomate") + + geom_boxplot( alpha = 0) ``` -::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::: défi -## Challenge +## Défi -Note how the boxplot layer is in front of the jitter layer? What do -you need to change in the code to put the boxplot below the points? +Notez comment la couche de boîte à moustaches se trouve devant la couche de gigue ? Que devez-vous +modifier dans le code pour placer le boxplot sous les points ? -::::::::::::::: solution +::::::::::::::: solution ## Solution -We should switch the order of these two geoms: +Nous devrions inverser l'ordre de ces deux géométries : ```{r boxplot-with-points2, cache=FALSE, purl=TRUE} ggplot(data = rna, mapping = aes(y = expression_log, x = sample)) + geom_boxplot(alpha = 0) + - geom_jitter(alpha = 0.2, color = "tomato") + geom_jitter(alpha = 0.2, color = "tomate") ``` ::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::: -You may notice that the values on the x-axis are still not properly -readable. Let's change the orientation of the labels and adjust them -vertically and horizontally so they don't overlap. You can use a -90-degree angle, or experiment to find the appropriate angle for -diagonally oriented labels: +Vous remarquerez peut-être que les valeurs sur l'axe des x ne sont toujours pas correctement +lisibles. Modifions l'orientation des étiquettes et ajustons-les +verticalement et horizontalement afin qu'elles ne se chevauchent pas. Vous pouvez utiliser un angle de +90 degrés ou expérimenter pour trouver l'angle approprié pour +les étiquettes orientées en diagonale : ```{r boxplot-xaxis-rotated, cache=FALSE, purl=TRUE} ggplot(data = rna, mapping = aes(y = expression_log, x = sample)) + - geom_jitter(alpha = 0.2, color = "tomato") + - geom_boxplot(alpha = 0) + - theme(axis.text.x = element_text(angle = 90, hjust = 0.5, vjust = 0.5)) + geom_jitter(alpha = 0.2, color = "tomate") + + geom_boxplot( alpha = 0) + + thème(axis.text.x = element_text(angle = 90, hjust = 0,5, vjust = 0,5)) ``` -::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::: défi -## Challenge +## Défi -Add color to the data points on your boxplot according to the duration -of the infection (`time`). +Ajoutez de la couleur aux points de données sur votre boxplot en fonction de la durée +de l'infection (« time »). -_Hint:_ Check the class for `time`. Consider changing the class of -`time` from integer to factor directly in the ggplot mapping. Why does -this change how R makes the graph? +_Indice :_ Vérifiez la classe pour « heure ». Envisagez de changer la classe de +`time` d'entier pour prendre en compte directement dans le mappage ggplot. Pourquoi +cela change-t-il la façon dont R crée le graphique ? -::::::::::::::: solution +::::::::::::::: solution ## Solution @@ -500,21 +500,21 @@ ggplot(data = rna, ::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::: -::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::: défi -## Challenge +## Défi -Boxplots are useful summaries, but hide the _shape_ of the -distribution. For example, if the distribution is bimodal, we would -not see it in a boxplot. An alternative to the boxplot is the violin -plot, where the shape (of the density of points) is drawn. +Les boxplots sont des résumés utiles, mais cachent la _forme_ de la distribution +. Par exemple, si la distribution est bimodale, nous ne la verrions +pas dans un boxplot. Une alternative au boxplot est le tracé en violon +, où la forme (de la densité de points) est dessinée. -- Replace the box plot with a violin plot; see `geom_violin()`. Fill - in the violins according to the time with the argument `fill`. +- Remplacez la boîte à moustaches par une intrigue en violon ; voir `geom_violin()`. Remplissez + les violons en fonction du temps avec l'argument `fill`. -::::::::::::::: solution +::::::::::::::: solution ## Solution @@ -522,20 +522,20 @@ plot, where the shape (of the density of points) is drawn. ggplot(data = rna, mapping = aes(y = expression_log, x = sample)) + geom_violin(aes(fill = as.factor(time))) + - theme(axis.text.x = element_text(angle = 90, hjust = 0.5, vjust = 0.5)) + thème (axis.text.x = element_text (angle = 90, hjust = 0,5, vjust = 0,5)) ``` ::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::: -::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::: défi -## Challenge +## Défi -- Modify the violin plot to fill in the violins by `sex`. +- Modifiez l'intrigue des violons pour remplir les violons par « sexe ». -::::::::::::::: solution +::::::::::::::: solution ## Solution @@ -543,20 +543,20 @@ ggplot(data = rna, ggplot(data = rna, mapping = aes(y = expression_log, x = sample)) + geom_violin(aes(fill = sex)) + - theme(axis.text.x = element_text(angle = 90, hjust = 0.5, vjust = 0.5)) + theme(axis.text .x = element_text(angle = 90, hjust = 0,5, vjust = 0,5)) ``` ::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::: -## Line plots +## Tracés linéaires -Let's calculate the mean expression per duration of the infection for -the 10 genes having the highest log fold changes comparing time 8 versus -time 0. First, we need to select the genes and create a subset of `rna` -called `sub_rna` containing the 10 selected genes, then we need to group -the data and calculate the mean gene expression within each group: +Calculons l'expression moyenne par durée de l'infection pour +les 10 gènes ayant les changements logarithmiques les plus élevés en comparant le temps 8 au +temps 0. Tout d'abord, nous devons sélectionner les gènes et créer un sous-ensemble de `rna` +appelé `sub_rna` contenant les 10 gènes sélectionnés, puis nous devons regrouper +les données et calculer l'expression moyenne des gènes dans chaque groupe: ```{r, purl=TRUE} rna_fc <- rna_fc %>% arrange(desc(time_8_vs_0)) @@ -573,205 +573,206 @@ mean_exp_by_time <- sub_rna %>% mean_exp_by_time ``` -We can build the line plot with duration of the infection on the x-axis -and the mean expression on the y-axis: +Nous pouvons construire le tracé linéaire avec la durée de l'infection sur l'axe des x +et l'expression moyenne sur l'axe des y : ```{r first-time-series, purl=TRUE} -ggplot(data = mean_exp_by_time, mapping = aes(x = time, y = mean_exp)) + +ggplot(data = moyenne_exp_by_time, mapping = aes(x = temps, y = moyenne_exp)) + geom_line() ``` -Unfortunately, this does not work because we plotted data for all the -genes together. We need to tell ggplot to draw a line for each gene by -modifying the aesthetic function to include `group = gene`: +Malheureusement, cela ne fonctionne pas car nous avons tracé ensemble les données de tous les gènes +. Nous devons dire à ggplot de tracer une ligne pour chaque gène en +modifiant la fonction esthétique pour inclure `group = gene` : ```{r time-series-by-gene, purl=TRUE} -ggplot(data = mean_exp_by_time, - mapping = aes(x = time, y = mean_exp, group = gene)) + +ggplot(data = moyenne_exp_by_time, + mapping = aes(x = temps, y = moyenne_exp, groupe = gène)) + geom_line() ``` -We will be able to distinguish genes in the plot if we add colors (using -`color` also automatically groups the data): +Nous pourrons distinguer les gènes dans l'intrigue si nous ajoutons des couleurs (l'utilisation de +`color` regroupe également automatiquement les données) : ```{r time-series-with-colors, purl=TRUE} -ggplot(data = mean_exp_by_time, - mapping = aes(x = time, y = mean_exp, color = gene)) + +ggplot(data = moyenne_exp_by_time, + mapping = aes(x = temps, y = moyenne_exp, couleur = gène)) + geom_line() ``` -## Faceting +## Facettage -`ggplot2` has a special technique called _faceting_ that allows the user -to split one plot into multiple (sub) plots based on a factor included -in the dataset. These different subplots inherit the same properties -(axes limits, ticks, ...) to facilitate their direct comparison. We will -use it to make a line plot across time for each gene: +`ggplot2` a une technique spéciale appelée _faceting_ qui permet à l'utilisateur +de diviser une parcelle en plusieurs (sous) parcelles en fonction d'un facteur inclus +dans l'ensemble de données. Ces différentes sous-parcelles héritent des mêmes propriétés +(limites des axes, ticks, ...) pour faciliter leur comparaison directe. Nous allons +l'utiliser pour créer un tracé linéaire dans le temps pour chaque gène : ```{r first-facet, purl=TRUE} -ggplot(data = mean_exp_by_time, - mapping = aes(x = time, y = mean_exp)) + geom_line() + - facet_wrap(~ gene) +ggplot(data = moyenne_exp_by_time, + mapping = aes(x = temps, y = moyenne_exp)) + geom_line() + + facet_wrap(~ gène) ``` -Here both x- and y-axis have the same scale for all the subplots. You -can change this default behavior by modifying `scales` in order to allow -a free scale for the y-axis: +Ici, les axes x et y ont la même échelle pour toutes les sous-parcelles. Vous +pouvez changer ce comportement par défaut en modifiant `scales` afin d'autoriser +une échelle libre pour l'axe y : ```{r first-facet-scales, purl=TRUE} -ggplot(data = mean_exp_by_time, - mapping = aes(x = time, y = mean_exp)) + +ggplot(data = moyenne_exp_by_time, + mapping = aes(x = temps, y = moyenne_exp)) + geom_line() + facet_wrap(~ gene, scales = "free_y") ``` -Now we would like to split the line in each plot by the sex of the mice. -To do that we need to calculate the mean expression in the data frame -grouped by `gene`, `time`, and `sex`: +Nous aimerions maintenant diviser la ligne dans chaque parcelle selon le sexe des souris. +Pour ce faire, nous devons calculer l'expression moyenne dans le bloc de données +regroupé par « gène », « temps » et « sexe » : ```{r data-facet-by-gene-and-sex, purl=TRUE} -mean_exp_by_time_sex <- sub_rna %>% +Mean_exp_by_time_sex <- sub_rna %>% group_by(gene, time, sex) %>% - summarize(mean_exp = mean(expression_log)) + summary(mean_exp = Mean(expression_log)) -mean_exp_by_time_sex +Mean_exp_by_time_sex ``` -We can now make the faceted plot by splitting further by sex using -`color` (within a single plot): +Nous pouvons maintenant créer le tracé à facettes en le divisant davantage par sexe en utilisant +`color` (au sein d'un seul tracé) : ```{r facet-by-gene-and-sex, cache=FALSE, purl=TRUE} -ggplot(data = mean_exp_by_time_sex, - mapping = aes(x = time, y = mean_exp, color = sex)) + +ggplot(data = moyenne_exp_by_time_sex, + mapping = aes(x = temps, y = moyenne_exp, couleur = sexe)) + geom_line() + - facet_wrap(~ gene, scales = "free_y") + facet_wrap(~ gène, échelles = "free_y") ``` -Usually plots with white background look more readable when printed. We -can set the background to white using the function `theme_bw()`. -Additionally, we can remove the grid: +Généralement, les tracés sur fond blanc semblent plus lisibles une fois imprimés. Nous +pouvons définir l'arrière-plan en blanc en utilisant la fonction `theme_bw()`. +De plus, nous pouvons supprimer la grille : ```{r facet-by-gene-and-sex-white-bg, cache=FALSE, purl=TRUE} -ggplot(data = mean_exp_by_time_sex, - mapping = aes(x = time, y = mean_exp, color = sex)) + +ggplot(data = moyenne_exp_by_time_sex, + mapping = aes(x = temps, y = moyenne_exp, couleur = sexe)) + geom_line() + - facet_wrap(~ gene, scales = "free_y") + + facet_wrap(~ gène, échelles = "free_y") + theme_bw() + theme(panel.grid = element_blank()) ``` -::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::: défi -## Challenge +## Défi -Use what you just learned to create a plot that depicts how the -average expression of each chromosome changes through the duration of -infection. +Utilisez ce que vous venez d'apprendre pour créer un graphique illustrant comment l'expression moyenne +de chaque chromosome change au cours de la durée de l'infection +. -::::::::::::::: solution +::::::::::::::: solution ## Solution ```{r mean-exp-chromosome-time-series, purl=TRUE} -mean_exp_by_chromosome <- rna %>% +Mean_exp_by_chromosome <- rna %>% group_by(chromosome_name, time) %>% - summarize(mean_exp = mean(expression_log)) + summary(mean_exp = Mean(expression_log)) -ggplot(data = mean_exp_by_chromosome, mapping = aes(x = time, - y = mean_exp)) + +ggplot(data = Mean_exp_by_chromosome, mapping = aes( x = temps, + y = moyenne_exp)) + geom_line() + facet_wrap(~ chromosome_name, scales = "free_y") ``` ::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::: -The `facet_wrap` geometry extracts plots into an arbitrary number of -dimensions to allow them to cleanly fit on one page. On the other hand, +La géométrie `facet_wrap` extrait les tracés dans un nombre arbitraire de +dimensions pour leur permettre de s'adapter proprement à une seule page. On the other hand, the `facet_grid` geometry allows you to explicitly specify how you want your plots to be arranged via formula notation (`rows ~ columns`; a `.` can be used as a placeholder that indicates only one row or column). -Let's modify the previous plot to compare how the mean gene expression -of males and females has changed through time: +Modifions le graphique précédent pour comparer l'évolution de l'expression génétique moyenne +des hommes et des femmes au fil du temps : ```{r mean-exp-time-facet-sex-rows, purl=TRUE} -# One column, facet by rows -ggplot(data = mean_exp_by_time_sex, - mapping = aes(x = time, y = mean_exp, color = gene)) + +# Une colonne, facette par lignes +ggplot(data = Mean_exp_by_time_sex, + mapping = aes(x = time, y = Mean_exp, color = gene)) + geom_line() + - facet_grid(sex ~ .) + facet_grid(sexe ~ .) ``` ```{r mean-exp-time-facet-sex-columns, purl=TRUE} -# One row, facet by column -ggplot(data = mean_exp_by_time_sex, - mapping = aes(x = time, y = mean_exp, color = gene)) + +# Une ligne, facette par colonne +ggplot(data = Mean_exp_by_time_sex, + mapping = aes(x = time, y = Mean_exp, color = gene)) + geom_line() + - facet_grid(. ~ sex) + facet_grid(. ~ sexe) ``` -## `ggplot2` themes +## Thèmes `ggplot2` -In addition to `theme_bw()`, which changes the plot background to white, -`ggplot2` comes with several other themes which can be useful to quickly -change the look of your visualization. The complete list of themes is -available at [https://ggplot2.tidyverse.org/reference/ggtheme.html](https://ggplot2.tidyverse.org/reference/ggtheme.html). -`theme_minimal()` and `theme_light()` are popular, and `theme_void()` -can be useful as a starting point to create a new hand-crafted theme. +En plus de `theme_bw()`, qui change l'arrière-plan de l'intrigue en blanc, +`ggplot2` est livré avec plusieurs autres thèmes qui peuvent être utiles pour +changer rapidement l'apparence de votre visualisation. La liste complète des thèmes est +disponible sur [https://ggplot2.tidyverse.org/reference/ggtheme.html](https://ggplot2.tidyverse.org/reference/ggtheme.html). +`theme_minimal()` et `theme_light()` sont populaires, et `theme_void()` +peut être utile comme point de départ pour créer un nouveau thème créé à la main. -The [ggthemes](https://jrnold.github.io/ggthemes/reference/index.html) -package provides a wide variety of options (including an Excel 2003 -theme). The ggplot2 provides a list of +Le package [ggthemes](https://jrnold.github.io/ggthemes/reference/index.html) +fournit une grande variété d'options (y compris un thème Excel 2003 +). The ggplot2 provides a list of packages that extend the capabilities of `ggplot2`, including additional themes. -## Customisation +## Personnalisation -Let's come back to the faceted plot of mean expression by time and gene, -colored by sex. +Revenons à l'intrigue à facettes de l'expression moyenne par temps et gène, +colorée par sexe. -Take a look at the ggplot2, -and think of ways you could improve the plot. +Jetez un œil à la feuille de triche , +et réfléchissez aux moyens vous pourriez améliorer l'intrigue. -Now, we can change names of axes to something more informative than -'time' and 'mean\_exp', and add a title to the figure: +Maintenant, nous pouvons changer les noms des axes en quelque chose de plus informatif que +'time' et 'mean\_exp', et ajouter un titre à la figure : ```{r mean_exp-time-with-right-labels, cache=FALSE, purl=TRUE} -ggplot(data = mean_exp_by_time_sex, - mapping = aes(x = time, y = mean_exp, color = sex)) + +ggplot(data = moyenne_exp_by_time_sex, + mapping = aes(x = temps, y = moyenne_exp, couleur = sexe)) + geom_line() + - facet_wrap(~ gene, scales = "free_y") + + facet_wrap(~ gène, échelles = "free_y") + theme_bw() + theme(panel.grid = element_blank()) + - labs(title = "Mean gene expression by duration of the infection", - x = "Duration of the infection (in days)", - y = "Mean gene expression") + labs(title = "Expression moyenne des gènes selon la durée de l'infection", + x = "Durée de l'infection (en jours)", + y = "Expression moyenne des gènes") ``` -The axes have more informative names, but their readability can be -improved by increasing the font size: +Les axes ont des noms plus informatifs, mais leur lisibilité peut être +améliorée en augmentant la taille de la police : ```{r mean_exp-time-with-right-labels-xfont-size, cache=FALSE, purl=TRUE} -ggplot(data = mean_exp_by_time_sex, - mapping = aes(x = time, y = mean_exp, color = sex)) + +ggplot(data = moyenne_exp_by_time_sex, + mapping = aes(x = temps, y = moyenne_exp, couleur = sexe)) + geom_line() + - facet_wrap(~ gene, scales = "free_y") + + facet_wrap(~ gène, échelles = "free_y") + theme_bw() + theme(panel.grid = element_blank()) + - labs(title = "Mean gene expression by duration of the infection", - x = "Duration of the infection (in days)", - y = "Mean gene expression") + + labs(title = "Expression moyenne des gènes selon la durée de l'infection", + x = "Durée de l'infection (en jours)", + y = "Expression génétique moyenne") + theme(text = element_text(size = 16)) ``` -Note that it is also possible to change the fonts of your plots. If you -are on Windows, you may have to install the . +Notez qu'il est également possible de changer les polices de vos tracés. Si vous +êtes sous Windows, vous devrez peut-être installer le [**`extrafont`** +package](https://cran.r-project.org/web/packages/extrafont /index.html). -We can further customize the color of x- and y-axis text, the color of -the grid, etc. We can also for example move the legend to the top by -setting `legend.position` to `"top"`. +Nous pouvons personnaliser davantage la couleur du texte des axes x et y, la couleur de +la grille, etc. Nous pouvons aussi par exemple déplacer la légende vers le haut en +définissant `legend.position` sur `"top"`. ```{r mean_exp-time-with-theme, cache=FALSE, purl=TRUE} ggplot(data = mean_exp_by_time_sex, @@ -790,10 +791,10 @@ ggplot(data = mean_exp_by_time_sex, legend.position = "top") ``` -If you like the changes you created better than the default theme, you -can save them as an object to be able to easily apply them to other -plots you may create. Here is an example with the histogram we have -previously created. +Si vous préférez les modifications que vous avez créées au thème par défaut, vous pouvez +les enregistrer en tant qu'objet pour pouvoir les appliquer facilement à d'autres +tracés que vous pourriez créer. Voici un exemple avec l'histogramme que nous avons +créé précédemment. ```{r mean_exp-time-with-right-labels-xfont, cache=FALSE, purl=TRUE} blue_theme <- theme(axis.text.x = element_text(colour = "royalblue4", @@ -808,39 +809,39 @@ ggplot(rna, aes(x = expression_log)) + blue_theme ``` -::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::: défi -## Challenge +## Défi -With all of this information in hand, please take another five minutes -to either improve one of the plots generated in this exercise or -create a beautiful graph of your own. Use the RStudio ggplot2 -for inspiration. Here are some ideas: +Avec toutes ces informations en main, veuillez prendre encore cinq minutes +pour soit améliorer l'un des tracés générés dans cet exercice, soit +créer votre propre graphique. Utilisez la feuille de triche RStudio ggplot2 +pour vous inspirer. Voici quelques idées : -- See if you can change the thickness of the lines. -- Can you find a way to change the name of the legend? What about - its labels? (hint: look for a ggplot function starting with +- Voyez si vous pouvez modifier l’épaisseur des lignes. +- Pouvez-vous trouver un moyen de changer le nom de la légende ? Qu'en est-il de + ses étiquettes ? (indice : recherchez une fonction ggplot commençant par `scale_`) -- Try using a different color palette or manually specifying the - colors for the lines (see +- Essayez d'utiliser une palette de couleurs différente ou de spécifier manuellement les couleurs + pour les lignes (voir [http://www.cookbook-r.com/Graphs/Colors\_(ggplot2)/](https://www.cookbook-r.com/Graphs/Colors_\\(ggplot2\\)/)). -::::::::::::::: solution +::::::::::::::: solution ## Solution -For example, based on this plot: +Par exemple, sur la base de ce tracé : ```{r, purl=TRUE} -ggplot(data = mean_exp_by_time_sex, - mapping = aes(x = time, y = mean_exp, color = sex)) + +ggplot(data = moyenne_exp_by_time_sex, + mapping = aes(x = temps, y = moyenne_exp, couleur = sexe)) + geom_line() + - facet_wrap(~ gene, scales = "free_y") + + facet_wrap(~ gène, échelles = "free_y") + theme_bw() + theme(panel.grid = element_blank()) ``` -We can customize it the following ways: +Nous pouvons le personnaliser des manières suivantes : ```{r, purl=TRUE} # change the thickness of the lines @@ -883,55 +884,55 @@ ggplot(data = mean_exp_by_time_sex, ::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::: -## Composing plots +## Composer des intrigues Faceting is a great tool for splitting one plot into multiple subplots, but sometimes you may want to produce a single figure that contains multiple independent plots, i.e. plots that are based on different variables or even different data frames. -Let's start by creating the two plots that we want to arrange next to -each other: +Commençons par créer les deux tracés que nous souhaitons disposer l'un à côté de l'autre + : -The first graph counts the number of unique genes per chromosome. We -first need to reorder the levels of `chromosome_name` and filter the -unique genes per chromosome. We also change the scale of the y-axis to a -log10 scale for better readability. +Le premier graphique compte le nombre de gènes uniques par chromosome. Nous +devons d'abord réorganiser les niveaux de `chromosome_name` et filtrer les +gènes uniques par chromosome. Nous modifions également l'échelle de l'axe y en une échelle +log10 pour une meilleure lisibilité. ```{r sub1, purl=TRUE} -rna$chromosome_name <- factor(rna$chromosome_name, - levels = c(1:19,"X","Y")) +arn$chromosome_name <- factor(rna$chromosome_name, + niveaux = c(1:19,"X","Y")) -count_gene_chromosome <- rna %>% select(chromosome_name, gene) %>% +count_gene_chromosome <- rna %> % select(chromosome_name, gene) %>% distinct() %>% ggplot() + geom_bar(aes(x = chromosome_name), fill = "seagreen", - position = "dodge", stat = "count") + - labs(y = "log10(n genes)", x = "chromosome") + + position = "esquive", stat = "count") + + labs(y = "log10(n gènes)", x = "chromosome") + scale_y_log10() count_gene_chromosome ``` -Below, we also remove the legend altogether by setting the -`legend.position` to `"none"`. +Ci-dessous, nous supprimons également complètement la légende en définissant +`legend.position` sur `"none"`. ```{r sub2, purl=TRUE} exp_boxplot_sex <- ggplot(rna, aes(y=expression_log, x = as.factor(time), color=sex)) + geom_boxplot(alpha = 0) + - labs(y = "Mean gene exp", + labs(y = "Exp moyenne du gène", x = "time") + theme(legend.position = "none") exp_boxplot_sex ``` -The [**patchwork**](https://github.com/thomasp85/patchwork) package -provides an elegant approach to combining figures using the `+` to -arrange figures (typically side by side). More specifically the `|` -explicitly arranges them side by side and `/` stacks them on top of each -other. +Le package [**patchwork**](https://github.com/thomasp85/patchwork) +fournit une approche élégante pour combiner des figures en utilisant le « + » pour +disposer les figures (généralement latérales). de côté). Plus précisément, le `|` +les dispose explicitement côte à côte et `/` les empile les uns sur les autres +. ```{r install-patchwork, message=FALSE, eval=FALSE, purl=TRUE} install.packages("patchwork") @@ -940,15 +941,15 @@ install.packages("patchwork") ```{r patchworkplot1, purl=TRUE} library("patchwork") count_gene_chromosome + exp_boxplot_sex -## or count_gene_chromosome | exp_boxplot_sex +## ou count_gene_chromosome | exp_boxplot_sex ``` ```{r patchwork2, purl=TRUE} count_gene_chromosome / exp_boxplot_sex ``` -We can combine further control the layout of the final composition with -`plot_layout` to create more complex layouts: +Nous pouvons combiner un contrôle plus approfondi de la mise en page de la composition finale avec +`plot_layout` pour créer des mises en page plus complexes : ```{r patchwork3, purl=TRUE} count_gene_chromosome + exp_boxplot_sex + plot_layout(ncol = 1) @@ -961,7 +962,7 @@ count_gene_chromosome + plot_layout(ncol = 1) ``` -The last plot can also be created using the `|` and `/` composers: +Le dernier tracé peut également être créé à l'aide des compositeurs `|` et `/` : ```{r patchwork5, purl=TRUE} count_gene_chromosome / @@ -969,12 +970,12 @@ count_gene_chromosome / exp_boxplot_sex ``` -Learn more about `patchwork` on its -[webpage](https://patchwork.data-imaginist.com/) or in this -[video](https://www.youtube.com/watch?v=0m4yywqNPVY). +Apprenez-en plus sur `patchwork` sur sa +[page Web](https://patchwork.data-imaginist.com/) ou dans cette +[vidéo](https://www.youtube. com/watch?v=0m4yywqNPVY). -Another option is the **`gridExtra`** package that allows to combine -separate ggplots into a single figure using `grid.arrange()`: +Une autre option est le package **`gridExtra`** qui permet de combiner +des ggplots séparés en une seule figure en utilisant `grid.arrange()` : ```{r install-gridextra, message=FALSE, eval=FALSE, purl=TRUE} install.packages("gridExtra") @@ -985,22 +986,22 @@ library("gridExtra") grid.arrange(count_gene_chromosome, exp_boxplot_sex, ncol = 2) ``` -In addition to the `ncol` and `nrow` arguments, used to make simple -arrangements, there are tools for constructing more complex -layouts. +En plus des arguments `ncol` et `nrow`, utilisés pour créer des arrangements +simples, il existe des outils pour [construire des dispositions +plus complexes](https://cran.r-project. org/web/packages/gridExtra/vignettes/arrangeGrob.html). -## Exporting plots +## Exporter des tracés -After creating your plot, you can save it to a file in your favorite -format. The Export tab in the **Plot** pane in RStudio will save your -plots at low resolution, which will not be accepted by many journals and -will not scale well for posters. +Après avoir créé votre tracé, vous pouvez l'enregistrer dans un fichier dans votre format +préféré. L'onglet Exporter dans le volet **Plot** de RStudio enregistrera vos tracés +à basse résolution, ce qui ne sera pas accepté par de nombreuses revues et +ne s'adaptera pas bien aux affiches. -Instead, use the `ggsave()` function, which allows you easily change the -dimension and resolution of your plot by adjusting the appropriate -arguments (`width`, `height` and `dpi`). +Utilisez plutôt la fonction `ggsave()`, qui vous permet de modifier facilement la dimension +et la résolution de votre tracé en ajustant les arguments +appropriés (`width`, `height` et `dpi` ). -Make sure you have the `fig_output/` folder in your working directory. +Assurez-vous d'avoir le dossier `fig_output/` dans votre répertoire de travail. ```{r ggsave-example, eval=FALSE, purl=TRUE} my_plot <- ggplot(data = mean_exp_by_time_sex, @@ -1027,80 +1028,80 @@ ggsave("fig_output/combo_plot_chromosome_sex.png", combo_plot, width = 10, dpi = 300) ``` -Note: The parameters `width` and `height` also determine the font size -in the saved plot. +Remarque : Les paramètres « largeur » et « hauteur » déterminent également la taille de la police +dans le tracé enregistré. ```{r final-challenge, eval=FALSE, purl=TRUE, echo=FALSE} -### Final plotting challenge: -## With all of this information in hand, please take another five -## minutes to either improve one of the plots generated in this -## exercise or create a beautiful graph of your own. Use the RStudio -## ggplot2 cheat sheet for inspiration: -## https://raw.githubusercontent.com/rstudio/cheatsheets/main/data-visualization.pdf +### Défi final de tracé : +## Avec toutes ces informations en main, veuillez prendre encore cinq +## minutes pour améliorer l'un des tracés générés dans ce +# # faites de l'exercice ou créez votre propre graphique. Utilisez l'aide-mémoire RStudio +## ggplot2 pour vous inspirer : +## https://raw.githubusercontent.com/rstudio/cheatsheets/main/data-visualization.pdf ``` -## Other packages for visualisation +## Autres packages pour la visualisation -`ggplot2` is a very powerful package that fits very nicely in our _tidy -data_ and _tidy tools_ pipeline. There are other visualization packages -in R that shouldn't be ignored. +`ggplot2` est un package très puissant qui s'intègre très bien dans notre pipeline _tidy +data_ et _tidy tools_. Il existe d'autres packages de visualisation +dans R qui ne doivent pas être ignorés. -### Base graphics +### Graphiques de base -The default graphics system that comes with R, often called _base R -graphics_ is simple and fast. It is based on the _painter's or canvas -model_, where different output are directly overlaid on top of each -other (see figure @ref(fig:paintermodel)). This is a fundamental +Le système graphique par défaut fourni avec R, souvent appelé _base R +graphiques_ est simple et rapide. Il est basé sur le \*modèle de peintre ou de toile +\*, où différentes sorties sont directement superposées les unes sur les +autres (voir figure @ref(fig:paintermodel)). This is a fundamental difference with `ggplot2` (and with `lattice`, described below), that returns dedicated objects, that are rendered on screen or in a file, and that can even be updated. ```{r paintermodel, fig.width=12, fig.height=4, fig.cap="Successive layers added on top of each other."} par(mfrow = c(1, 3)) -plot(1:20, main = "First layer, produced with plot(1:20)") +plot(1:20, main = "Première couche, produite avec plot(1:20)") -plot(1:20, main = "A horizontal red line, added with abline(h = 10)") +plot(1:20, main = "Une ligne rouge horizontale, ajoutée avec abline(h = 10)") abline(h = 10, col = "red") -plot(1:20, main = "A rectangle, added with rect(5, 5, 15, 15)") +plot(1:20, main = "Un rectangle , ajouté avec rect(5, 5, 15, 15)") abline(h = 10, col = "red") -rect(5, 5, 15, 15, lwd = 3) +rect(5, 5, 15, 15, lwd = 3 ) ``` -Another main difference is that base graphics' plotting function try to -do _the right_ thing based on their input type, i.e. they will adapt -their behaviour based on the class of their input. This is again very -different from what we have in `ggplot2`, that only accepts dataframes -as input, and that requires plots to be constructed bit by bit. +Une autre différence principale est que la fonction de traçage des graphiques de base essaie de +faire _la bonne_ chose en fonction de leur type d'entrée, c'est-à-dire qu'ils adapteront +leur comportement en fonction de la classe de leur entrée. C'est encore une fois très +différent de ce que nous avons dans `ggplot2`, qui n'accepte que les trames de données +en entrée, et qui nécessite que les tracés soient construits petit à petit. ```{r plotmethod, fig.width=8, fig.height=8, fig.cap="Plotting boxplots (top) and histograms (bottom) vectors (left) or a matrices (right)."} par(mfrow = c(2, 2)) boxplot(rnorm(100), - main = "Boxplot of rnorm(100)") -boxplot(matrix(rnorm(100), ncol = 10), - main = "Boxplot of matrix(rnorm(100), ncol = 10)") + main = "Boxplot de rnorm(100)") +boxplot(matrix(rnorm( 100), ncol = 10), + main = "Boxplot de la matrice(rnorm(100), ncol = 10)") hist(rnorm(100)) -hist(matrix(rnorm(100), ncol = 10)) +hist( matrice(rnorm(100), ncol = 10)) ``` -The out-of-the-box approach in base graphics can be very efficient for -simple, standard figures, that can be produced very quickly with a -single line of code and a single function such as `plot`, or `hist`, or -`boxplot`, ... The defaults are however not always the most appealing -and tuning of figures, especially when they become more complex (for -example to produce facets), can become lengthy and cumbersome. +L'approche prête à l'emploi dans les graphiques de base peut être très efficace pour +des figures simples et standards, qui peuvent être produites très rapidement avec une +une seule ligne de code et une seule fonction telle que `plot`, ou `hist`, ou +`boxplot`, ... Les valeurs par défaut ne sont cependant pas toujours les plus attractives +et le réglage des figures, surtout lorsqu'elles deviennent plus complexes (par exemple +pour produire des facettes), peut devenir long et fastidieux. -### The lattice package +### Le paquet treillis -The **`lattice`** package is similar to `ggplot2` in that is uses -dataframes as input, returns graphical objects and supports faceting. -`lattice` however isn't based on the grammar of graphics and has a more -convoluted interface. +Le package **`lattice`** est similaire à `ggplot2` dans le sens où il utilise +des trames de données en entrée, renvoie des objets graphiques et prend en charge le facettage. +`treillis` cependant n'est pas basé sur la grammaire des graphiques et a une interface plus +alambiquée. -A good reference for the `lattice` package is @latticebook. +Une bonne référence pour le package `lattice` est @latticebook. -:::::::::::::::::::::::::::::::::::::::: keypoints +:::::::::::::::::::::::::::::::::::::::: points clés -- Visualization in R +- Visualisation en R -:::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::: From 50ee4cee0b0ce0b903fd622d79de88626937b2d5 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Thu, 1 Aug 2024 00:07:47 +0900 Subject: [PATCH 204/334] New translations 40-visualization.md (Chinese Simplified) --- locale/zh/episodes/40-visualization.Rmd | 1087 +++++++++++------------ 1 file changed, 543 insertions(+), 544 deletions(-) diff --git a/locale/zh/episodes/40-visualization.Rmd b/locale/zh/episodes/40-visualization.Rmd index 5500e95c3..aa94dd913 100644 --- a/locale/zh/episodes/40-visualization.Rmd +++ b/locale/zh/episodes/40-visualization.Rmd @@ -1,799 +1,798 @@ --- -source: Rmd -title: Data visualization +source: 放射科 +title: 数据可视化 teaching: 60 exercises: 60 --- ```{r loaddata_vis, echo=FALSE, purl=FALSE, message=FALSE} -if (!file.exists("data/rnaseq.csv")) -download.file(url = "https://github.com/carpentries-incubator/bioc-intro/raw/main/episodes/data/rnaseq.csv", - destfile = "data/rnaseq.csv") +如果(!file.exists(“data/rnaseq.csv”)) +下载.file(url = “https://github.com/carpentries-incubator/bioc-intro/raw/main/episodes/data/rnaseq.csv”, + 目标文件 = “data/rnaseq.csv”) ``` ::::::::::::::::::::::::::::::::::::::: objectives -- Produce scatter plots, boxplots, line plots, etc. using ggplot. -- Set universal plot settings. -- Describe what faceting is and apply faceting in ggplot. -- Modify the aesthetics of an existing ggplot plot (including axis labels and color). -- Build complex and customized plots from data in a data frame. +- 使用 ggplot 生成散点图、箱线图、线图等。 +- 设置通用情节设置。 +- 描述什么是分面并在 ggplot 中应用分面。 +- 修改现有 ggplot 图的美观度(包括轴标签和颜色)。 +- 根据数据框中的数据构建复杂且定制的图表。 -:::::::::::::::::::::::::::::::::::::::::::::::::: +::::::::::::::::::::::::::::::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::: questions +:::::::::::::::::::::::::::::::::::::::: 问题 -- Visualization in R +- R 中的可视化 -:::::::::::::::::::::::::::::::::::::::::::::::::: +::::::::::::::::::::::::::::::::::::::::::::::::::::: ```{r vis_setup, echo=FALSE} -rna <- read.csv("data/rnaseq.csv") +rna <- read.csv(“数据/rnaseq.csv”) ``` -> This episode is based on the Data Carpentries's _Data Analysis and -> Visualisation in R for Ecologists_ lesson. +> 本集基于 Data Carpentries 的_面向生态学家的 R 语言数据分析和 +> 可视化_课程。 -## Data Visualization +## 数据可视化 -We start by loading the required packages. **`ggplot2`** is included in -the **`tidyverse`** package. +我们首先加载所需的包。 **`ggplot2`** 包含在 +**`tidyverse`** 包中。 ```{r load-package, message=FALSE, purl=TRUE} -library("tidyverse") +图书馆(“tidyverse”) ``` -If not still in the workspace, load the data we saved in the previous -lesson. +如果还不在工作区中,请加载我们在上一节 +课中保存的数据。 ```{r load-data, eval=FALSE, purl=TRUE} -rna <- read.csv("data/rnaseq.csv") +rna <- read.csv(“数据/rnaseq.csv”) ``` -The Data Visualization Cheat -Sheet -will cover the basics and more advanced features of `ggplot2` and will -help, in addition to serve as a reminder, getting an overview of the -many data representations available in the package. The following video -tutorials ([part 1](https://www.youtube.com/watch?v=h29g21z0a68) and -[2](https://www.youtube.com/watch?v=0m4yywqNPVY)) by Thomas Lin Pedersen -are also very instructive. +数据可视化秘籍 +表 +将涵盖 `ggplot2` 的基础知识和更高级的功能,并且除了作为提醒之外,还将 +帮助您概览 +包中可用的许多数据表示。 以下由 Thomas Lin Pedersen +制作的视频 +教程 ([第 1 部分](https://www.youtube.com/watch?v=h29g21z0a68) 和 +[2](https://www.youtube.com/watch?v=0m4yywqNPVY)) 也非常具有启发性。 -## Plotting with `ggplot2` +## 使用 `ggplot2` 绘图 -`ggplot2` is a plotting package that makes it simple to create complex -plots from data in a data frame. It provides a more programmatic -interface for specifying what variables to plot, how they are displayed, -and general visual properties. The theoretical foundation that supports -the `ggplot2` is the _Grammar of Graphics_ (@Wilkinson:2005). Using this -approach, we only need minimal changes if the underlying data change or -if we decide to change from a bar plot to a scatterplot. This helps in -creating publication quality plots with minimal amounts of adjustments -and tweaking. +`ggplot2` 是一个绘图包,可以很容易地从数据框中的数据创建复杂的 +图。 它提供了一个更具程序性的 +界面,用于指定要绘制的变量、如何显示它们、 +以及一般的视觉属性。 支持 +`ggplot2`的理论基础是_图形语法_(@Wilkinson:2005)。 使用这种 +方法,如果基础数据发生变化,我们只需要进行最少的更改;如果我们决定从条形图更改为散点图,则只需要进行 +更改。 这有助于 +以最少的调整 +和微调创建出版质量的图表。 -There is a book about `ggplot2` (@ggplot2book) that provides a good -overview, but it is outdated. The 3rd edition is in preparation and will -be [freely available online](https://ggplot2-book.org/). The `ggplot2` -webpage ([https://ggplot2.tidyverse.org](https://ggplot2.tidyverse.org)) provides ample documentation. +有一本关于“ggplot2”(@ggplot2book)的书提供了很好的 +概述,但它已经过时了。 第三版正在准备中, +将[免费在线提供](https://ggplot2-book.org/)。 `ggplot2` +网页 ([https://ggplot2.tidyverse.org](https://ggplot2.tidyverse.org)) 提供了充足的文档。 -`ggplot2` functions like data in the 'long' format, i.e., a column for -every dimension, and a row for every observation. Well-structured data -will save you lots of time when making figures with `ggplot2`. +`ggplot2` 的功能类似于“长”格式的数据,即每个维度为 +一列,每个观察值为一行。 结构良好的数据 +将在使用“ggplot2”制作图形时为您节省大量时间。 -ggplot graphics are built step by step by adding new elements. Adding -layers in this fashion allows for extensive flexibility and -customization of plots. +ggplot 图形是通过添加新元素一步步构建的。 以这种方式添加 +层可以实现广泛的灵活性和 +图的定制。 -> The idea behind the Grammar of Graphics it is that you can build every -> graph from the same 3 components: (1) a data set, (2) a coordinate system, -> and (3) geoms — i.e. visual marks that represent data points \[^three\\_comp\\_ggplot2] +> 图形语法背后的想法是,你可以从相同的 3 个组件构建每个 +> 图形:(1) 数据集,(2) 坐标系, +> 和 (3) 几何对象 - 即表示数据点的视觉标记 \[^three\\_comp\\_ggplot2] -[^three_comp_ggplot2]: Source: [Data Visualization Cheat Sheet](https://raw.githubusercontent.com/rstudio/cheatsheets/main/data-visualization.pdf). +[^three_comp_ggplot2]: 来源:[数据可视化备忘单](https://raw.githubusercontent.com/rstudio/cheatsheets/main/data-visualization.pdf)。 -To build a ggplot, we will use the following basic template that can be -used for different types of plots: +为了构建 ggplot,我们将使用以下基本模板,该模板可以 +用于不同类型的绘图: ``` -ggplot(data = <DATA>, mapping = aes(<MAPPINGS>)) + <GEOM_FUNCTION>() +ggplot(数据 = <DATA>, 映射 = aes(<MAPPINGS>)) + <GEOM_FUNCTION>() ``` -- use the `ggplot()` function and bind the plot to a specific **data - frame** using the `data` argument +- 使用 `ggplot()` 函数并使用 `data` 参数将图绑定到特定的 **data + 框架** ```{r, eval=FALSE} -ggplot(data = rna) +ggplot(数据 = rna) ``` -- define a **mapping** (using the aesthetic (`aes`) function), by - selecting the variables to be plotted and specifying how to present - them in the graph, e.g. as x/y positions or characteristics such as - size, shape, color, etc. +- 定义一个**映射**(使用美学(`aes`)函数),通过 + 选择要绘制的变量并指定如何在图形中呈现 + 它们,例如作为 x/y 位置或特征,如 + 大小、形状、颜色等。 ```{r, eval=FALSE} -ggplot(data = rna, mapping = aes(x = expression)) +ggplot(数据 = rna,映射 = aes(x = 表达式)) ``` -- add '**geoms**' - geometries, or graphical representations of the - data in the plot (points, lines, bars). `ggplot2` offers many - different geoms; we will use some common ones today, including: +- 添加'**geoms**' - 几何图形,或图中 + 数据的图形表示(点、线、条)。 `ggplot2` 提供许多 + 不同的几何对象;今天我们将使用一些常见的,包括: ``` - * `geom_point()` for scatter plots, dot plots, etc. - * `geom_histogram()` for histograms - * `geom_boxplot()` for, well, boxplots! - * `geom_line()` for trend lines, time series, etc. + * `geom_point()` 用于散点图、点图等。 + * `geom_histogram()` 用于直方图! + * `geom_boxplot()` 用于箱线图! + * `geom_line()` 用于趋势线、时间序列等。 ``` -To add a geom(etry) to the plot use the `+` operator. Let's use -`geom_histogram()` first: +要向图中添加几何图形,请使用“+”运算符。 让我们首先使用 +`geom_histogram()`: ```{r first-ggplot, cache=FALSE, purl=TRUE} -ggplot(data = rna, mapping = aes(x = expression)) + +ggplot(数据 = rna,映射 = aes(x = 表达式)) + geom_histogram() ``` -The `+` in the `ggplot2` package is particularly useful because it -allows you to modify existing `ggplot` objects. This means you can -easily set up plot templates and conveniently explore different types of -plots, so the above plot can also be generated with code like this: +`ggplot2` 包中的 `+` 特别有用,因为它 +允许您修改现有的 `ggplot` 对象。 这意味着您可以 +轻松设置绘图模板并方便地探索不同类型的 +绘图,因此上述绘图也可以使用如下代码生成: ```{r, eval=FALSE, purl=TRUE} -# Assign plot to a variable +# 将图分配给变量 rna_plot <- ggplot(data = rna, - mapping = aes(x = expression)) + map = aes(x = expression)) -# Draw the plot +# 绘制图 rna_plot + geom_histogram() ``` -::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::: 挑战 -## Challenge +## 挑战 -You have probably noticed an automatic message that appears when -drawing the histogram: +您可能已经注意到,在 +绘制直方图时会出现一条自动消息: ```{r, echo=FALSE, fig.show="hide"} -ggplot(rna, aes(x = expression)) + +ggplot(rna,aes(x = 表达式)) + geom_histogram() ``` -Change the arguments `bins` or `binwidth` of `geom_histogram()` to -change the number or width of the bins. +将 `geom_histogram()` 的参数 `bins` 或 `binwidth` 更改为 +以更改箱的数量或宽度。 ::::::::::::::: solution -## Solution +## 解决方案 ```{r, purl=TRUE} -# change bins +# 更改箱体 ggplot(rna, aes(x = expression)) + geom_histogram(bins = 15) -# change binwidth +# 更改箱宽 ggplot(rna, aes(x = expression)) + geom_histogram(binwidth = 2000) ``` ::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::: +::::::::::::::::::::::::::::::::::::::::::::::::::::: -We can observe here that the data are skewed to the right. We can apply -log2 transformation to have a more symmetric distribution. Note that we -add here a small constant value (`+1`) to avoid having `-Inf` values -returned for expression values equal to 0. +我们可以在这里观察到数据向右倾斜。 我们可以应用 +log2 变换来获得更加对称的分布。 请注意,我们 +在这里添加一个小的常数值(`+1`)以避免当表达式值等于 0 时返回 `-Inf` 值 +。 ```{r log-transfo, cache=FALSE, purl=TRUE} rna <- rna %>% - mutate(expression_log = log2(expression + 1)) + 突变(expression_log = log2(expression + 1)) ``` -If we now draw the histogram of the log2-transformed expressions, the -distribution is indeed closer to a normal distribution. +如果我们现在绘制 log2 变换表达式的直方图, +分布确实更接近正态分布。 ```{r second-ggplot, cache=FALSE, purl=TRUE} -ggplot(rna, aes(x = expression_log)) + geom_histogram() +ggplot(rna,aes(x = expression_log)) + geom_histogram() ``` -From now on we will work on the log-transformed expression values. +从现在开始我们将研究对数转换的表达值。 -::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::: 挑战 -## Challenge +## 挑战 -Another way to visualize this transformation is to consider the scale -of the observations. For example, it may be worth changing the scale -of the axis to better distribute the observations in the space of the -plot. Changing the scale of the axes is done similarly to -adding/modifying other components (i.e., by incrementally adding -commands). Try making this modification: +可视化这种转变的另一种方法是考虑观测的尺度 +。 例如,可能值得改变轴的比例 +以便更好地在 +图的空间中分布观测值。 改变轴的比例与 +添加/修改其他组件类似(即通过逐步添加 +命令)。 尝试做这样的修改: -- Represent the un-transformed expression on the log10 scale; see - `scale_x_log10()`. Compare it with the previous graph. Why do you - now have warning messages appearing? +- 表示 log10 尺度上未转换的表达式;参见 + `scale_x_log10()`。 将其与之前的图表进行比较。 为什么 + 现在出现警告信息? -::::::::::::::: solution +::::::::::::::: 解决方案 -## Solution +## 解决方案 ```{r, eval=TRUE, purl=TRUE, echo=TRUE} -ggplot(data = rna,mapping = aes(x = expression))+ +ggplot(数据 = rna,映射 = aes(x = 表达式))+ geom_histogram() + scale_x_log10() ``` ::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::: +::::::::::::::::::::::::::::::::::::::::::::::::::::: -**Notes** +**笔记** -- Anything you put in the `ggplot()` function can be seen by any geom - layers that you add (i.e., these are global plot settings). This - includes the x- and y-axis mapping you set up in `aes()`. -- You can also specify mappings for a given geom independently of the - mappings defined globally in the `ggplot()` function. -- The `+` sign used to add new layers must be placed at the end of the - line containing the _previous_ layer. If, instead, the `+` sign is - added at the beginning of the line containing the new layer, - `ggplot2` will not add the new layer and will return an error - message. +- 你在 `ggplot()` 函数中输入的任何内容都可以被你添加的任何 geom + 层看到(即,这些是全局绘图设置)。 这个 + 包括你在 `aes()` 中设置的 x 轴和 y 轴映射。 +- 您还可以独立于 `ggplot()` 函数中全局定义的 + 映射为给定的 geom 指定映射。 +- 用于添加新层的 `+` 符号必须放在包含_上一个_层的 + 行末尾。 相反,如果在包含新层的行首添加 `+` 符号 + , + `ggplot2` 将不会添加新层,并将返回错误 + 消息。 ```{r, eval=FALSE} -# This is the correct syntax for adding layers +# 这是添加层的正确语法 rna_plot + geom_histogram() -# This will not add the new layer and will return an error message +# 这不会添加新层并将返回错误消息 rna_plot + geom_histogram() ``` -## Building your plots iteratively +## 迭代构建你的图 -We will now draw a scatter plot with two continuous variables and the -`geom_point()` function. This graph will represent the log2 fold changes -of expression comparing time 8 versus time 0, and time 4 versus time 0. -To this end, we first need to compute the means of the log-transformed -expression values by gene and time, then the log fold changes by -subtracting the mean log expressions between time 8 and time 0 and -between time 4 and time 0. Note that we also include here the gene -biotype that we will use later on to represent the genes. We will save -the fold changes in a new data frame called `rna_fc.` +我们现在将绘制一个包含两个连续变量和 +`geom_point()` 函数的散点图。 该图将表示时间 8 与时间 0 以及时间 4 与时间 0 相比的表达的 log2 倍数变化 +。 +为此,我们首先需要计算基因和时间对数转换的 +表达值的平均值,然后通过 +减去时间 8 和时间 0 之间的平均对数表达值和 +减去时间 4 和时间 0 之间的平均对数表达值来计算对数倍数变化。 请注意,我们还在这里包括了基因 +生物型,我们稍后会用它来表示基因。 我们将把 +倍数变化保存在名为“rna_fc”的新数据框中。 ```{r rna_fc, cache=FALSE, purl=TRUE} -rna_fc <- rna %>% select(gene, time, - gene_biotype, expression_log) %>% - group_by(gene, time, gene_biotype) %>% - summarize(mean_exp = mean(expression_log)) %>% - pivot_wider(names_from = time, - values_from = mean_exp) %>% - mutate(time_8_vs_0 = `8` - `0`, time_4_vs_0 = `4` - `0`) +rna_fc <- rna %>% 选择(基因,时间, + 基因生物型,表达日志)%>% + group_by(基因,时间,基因生物型)%>% + 总结(平均值表达式 = 平均值(表达日志))%>% + pivot_wider(names_from = 时间, + values_from = 平均值表达式)%>% + 突变(time_8_vs_0 = `8` - `0`,time_4_vs_0 = `4` - `0`) ``` -We can then build a ggplot with the newly created dataset `rna_fc`. -Building plots with `ggplot2` is typically an iterative process. We -start by defining the dataset we'll use, lay out the axes, and choose a -geom: +然后我们可以使用新创建的数据集“rna_fc”构建一个 ggplot。 +使用“ggplot2”构建图表通常是一个迭代过程。 我们 +首先定义我们将使用的数据集、布置轴,然后选择一个 +几何对象: ```{r create-ggplot-object, cache=FALSE, purl=TRUE} -ggplot(data = rna_fc, mapping = aes(x = time_4_vs_0, y = time_8_vs_0)) + +ggplot(数据 = rna_fc,映射 = aes(x = time_4_vs_0,y = time_8_vs_0)) + geom_point() ``` -Then, we start modifying this plot to extract more information from it. -For instance, we can add transparency (`alpha`) to avoid overplotting: +然后,我们开始修改这个图以从中提取更多信息。 +例如,我们可以添加透明度(“alpha”)以避免过度绘图: ```{r adding-transparency, cache=FALSE, purl=TRUE} -ggplot(data = rna_fc, mapping = aes(x = time_4_vs_0, y = time_8_vs_0)) + +ggplot(数据 = rna_fc,映射 = aes(x = time_4_vs_0,y = time_8_vs_0)) + geom_point(alpha = 0.3) ``` -We can also add colors for all the points: +我们还可以为所有点添加颜色: ```{r adding-colors, cache=FALSE, purl=TRUE} -ggplot(data = rna_fc, mapping = aes(x = time_4_vs_0, y = time_8_vs_0)) + - geom_point(alpha = 0.3, color = "blue") +ggplot(数据 = rna_fc,映射 = aes(x = time_4_vs_0,y = time_8_vs_0)) + + geom_point(alpha = 0.3,颜色 = “蓝色”) ``` -Or to color each gene in the plot differently, you could use a vector as -an input to the argument **color**. `ggplot2` will provide a different -color corresponding to different values in the vector. Here is an -example where we color with `gene_biotype`: +或者为了给图中每个基因赋予不同的颜色,你可以使用一个向量作为 +参数 **color** 的输入。 `ggplot2` 将提供与向量中的不同值相对应的不同的 +颜色。 这是一个 +示例,我们用 `gene_biotype` 进行着色: ```{r color-by-gene_biotype1, cache=FALSE, purl=TRUE} -ggplot(data = rna_fc, mapping = aes(x = time_4_vs_0, y = time_8_vs_0)) + - geom_point(alpha = 0.3, aes(color = gene_biotype)) +ggplot(数据 = rna_fc,映射 = aes(x = time_4_vs_0,y = time_8_vs_0)) + + geom_point(alpha = 0.3,aes(颜色 = gene_biotype)) ``` -We can also specify the colors directly inside the mapping provided in -the `ggplot()` function. This will be seen by any geom layers and the -mapping will be determined by the x- and y-axis set up in `aes()`. +我们还可以在 +`ggplot()`函数提供的映射中直接指定颜色。 任何 geom 层都可以看到这一点,并且 +映射将由 `aes()` 中设置的 x 轴和 y 轴决定。 ```{r color-by-gene_biotype2, cache=FALSE, purl=TRUE} -ggplot(data = rna_fc, mapping = aes(x = time_4_vs_0, y = time_8_vs_0, +ggplot(数据 = rna_fc,映射 = aes(x = time_4_vs_0,y = time_8_vs_0, color = gene_biotype)) + geom_point(alpha = 0.3) ``` -Finally, we could also add a diagonal line with the `geom_abline()` -function: +最后,我们还可以使用 `geom_abline()` +函数添加对角线: ```{r adding-diag, cache=FALSE, purl=TRUE} -ggplot(data = rna_fc, mapping = aes(x = time_4_vs_0, y = time_8_vs_0, - color = gene_biotype)) + +ggplot(数据 = rna_fc,映射 = aes(x = time_4_vs_0,y = time_8_vs_0, + 颜色 = gene_biotype)) + geom_point(alpha = 0.3) + - geom_abline(intercept = 0) + geom_abline(截距 = 0) ``` -Notice that we can change the geom layer from `geom_point` to -`geom_jitter` and colors will still be determined by `gene_biotype`. +请注意,我们可以将 geom 层从 `geom_point` 更改为 +`geom_jitter`,颜色仍由 `gene_biotype` 决定。 ```{r color-by-gene_biotype3, cache=FALSE, purl=TRUE} -ggplot(data = rna_fc, mapping = aes(x = time_4_vs_0, y = time_8_vs_0, - color = gene_biotype)) + +ggplot(数据 = rna_fc,映射 = aes(x = time_4_vs_0,y = time_8_vs_0, + 颜色 = gene_biotype)) + geom_jitter(alpha = 0.3) + - geom_abline(intercept = 0) + geom_abline(截距 = 0) ``` ```{r, echo=FALSE, message=FALSE} -library("hexbin") +库(“hexbin”) ``` -::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::: 挑战 -## Challenge +## 挑战 -Scatter plots can be useful exploratory tools for small datasets. For -data sets with large numbers of observations, such as the `rna_fc` -data set, overplotting of points can be a limitation of scatter plots. -One strategy for handling such settings is to use hexagonal binning of -observations. The plot space is tessellated into hexagons. Each -hexagon is assigned a color based on the number of observations that -fall within its boundaries. +散点图可以成为小数据集的有用的探索工具。 对于具有大量观测值的 +数据集,例如 `rna_fc` +数据集,点的过度绘制可能是散点图的限制。 +处理此类设置的一种策略是使用 +观测值的六边形分箱。 地块空间被镶嵌成六边形。 每个 +六边形根据其边界内的 +观测值的数量被分配一种颜色。 -- To use hexagonal binning in `ggplot2`, first install the R package - `hexbin` from CRAN and load it. +- 要在 `ggplot2` 中使用六边形分箱,首先从 CRAN 安装 R 包 + `hexbin` 并加载它。 -- Then use the `geom_hex()` function to produce the hexbin figure. +- 然后使用“geom_hex()”函数生成六边形图。 -- What are the relative strengths and weaknesses of a hexagonal bin - plot compared to a scatter plot? Examine the above scatter plot - and compare it with the hexagonal bin plot that you created. +- 与散点图相比,六边形箱 + 图的相对优势和劣势是什么? 检查上述散点图 + 并将其与您创建的六边形箱图进行比较。 -::::::::::::::: solution +::::::::::::::: 解决方案 -## Solution +## 解决方案 ```{r, eval=FALSE, purl=TRUE} -install.packages("hexbin") +安装.包(“hexbin”) ``` ```{r, purl=TRUE} -library("hexbin") +库(“hexbin”) -ggplot(data = rna_fc, mapping = aes(x = time_4_vs_0, y = time_8_vs_0)) + +ggplot(数据 = rna_fc,映射 = aes(x = time_4_vs_0,y = time_8_vs_0)) + geom_hex() + - geom_abline(intercept = 0) + geom_abline(截距 = 0) ``` ::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::: +::::::::::::::::::::::::::::::::::::::::::::::::::::: -::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::: 挑战 -## Challenge +## 挑战 -Use what you just learned to create a scatter plot of `expression_log` -over `sample` from the `rna` dataset with the time showing in -different colors. Is this a good way to show this type of data? +使用你刚刚学到的知识从“rna”数据集中创建一个“expression_log” +在“sample”上的散点图,其中时间以 +不同的颜色显示。 这是显示此类数据的好方法吗? -::::::::::::::: solution +::::::::::::::: 解决方案 -## Solution +## 解决方案 ```{r, eval=TRUE, purl=TRUE} -ggplot(data = rna, mapping = aes(y = expression_log, x = sample)) + - geom_point(aes(color = time)) +ggplot(数据 = rna,映射 = aes(y = expression_log,x = 样本)) + + geom_point(aes(颜色 = 时间)) ``` ::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::: +::::::::::::::::::::::::::::::::::::::::::::::::::::: -## Boxplot +## 箱形图 -We can use boxplots to visualize the distribution of gene expressions -within each sample: +我们可以使用箱线图来可视化每个样本内基因表达的分布 +: ```{r boxplot, cache=FALSE, purl=TRUE} -ggplot(data = rna, - mapping = aes(y = expression_log, x = sample)) + +ggplot(数据 = rna, + 映射 = aes(y = expression_log, x = 样本)) + geom_boxplot() ``` -By adding points to boxplot, we can have a better idea of the number of -measurements and of their distribution: +通过向箱线图添加点,我们可以更好地了解 +测量的数量及其分布: ```{r boxplot-with-points, cache=FALSE, purl=TRUE} -ggplot(data = rna, - mapping = aes(y = expression_log, x = sample)) + - geom_jitter(alpha = 0.2, color = "tomato") + +ggplot(数据 = rna, + 映射 = aes(y = expression_log,x = 样本)) + + geom_jitter(alpha = 0.2,颜色 = “番茄”) + geom_boxplot(alpha = 0) ``` -::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::: 挑战 -## Challenge +## 挑战 -Note how the boxplot layer is in front of the jitter layer? What do +请注意箱线图层是如何位于抖动图层前面的? What do you need to change in the code to put the boxplot below the points? -::::::::::::::: solution +::::::::::::::: 解决方案 -## Solution +## 解决方案 -We should switch the order of these two geoms: +我们应该交换这两个几何对象的顺序: ```{r boxplot-with-points2, cache=FALSE, purl=TRUE} -ggplot(data = rna, - mapping = aes(y = expression_log, x = sample)) + +ggplot(数据 = rna, + 映射 = aes(y = expression_log,x = 样本)) + geom_boxplot(alpha = 0) + - geom_jitter(alpha = 0.2, color = "tomato") + geom_jitter(alpha = 0.2,颜色 = “番茄”) ``` ::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::: +::::::::::::::::::::::::::::::::::::::::::::::::::::: -You may notice that the values on the x-axis are still not properly -readable. Let's change the orientation of the labels and adjust them -vertically and horizontally so they don't overlap. You can use a -90-degree angle, or experiment to find the appropriate angle for -diagonally oriented labels: +您可能会注意到 x 轴上的值仍然无法正确读取 +。 让我们改变标签的方向并垂直和水平调整它们 +以使它们不重叠。 您可以使用 +90 度角,或者试验找到适合 +对角线标签的角度: ```{r boxplot-xaxis-rotated, cache=FALSE, purl=TRUE} -ggplot(data = rna, - mapping = aes(y = expression_log, x = sample)) + - geom_jitter(alpha = 0.2, color = "tomato") + +ggplot(数据 = rna, + 映射 = aes(y = expression_log,x = 样本)) + + geom_jitter(alpha = 0.2,颜色 = “tomato”) + geom_boxplot(alpha = 0) + - theme(axis.text.x = element_text(angle = 90, hjust = 0.5, vjust = 0.5)) + 主题(axis.text.x = element_text(angle = 90,hjust = 0.5,vjust = 0.5)) ``` -::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::: 挑战 -## Challenge +## 挑战 -Add color to the data points on your boxplot according to the duration -of the infection (`time`). +根据感染的持续时间 +(“时间”),为箱线图上的数据点添加颜色。 -_Hint:_ Check the class for `time`. Consider changing the class of -`time` from integer to factor directly in the ggplot mapping. Why does -this change how R makes the graph? +\*提示:\*检查类别中的“时间”。 考虑将 +`时间`类从整数直接更改为 ggplot 映射中的因子。 为什么 +会改变 R 绘制图形的方式? -::::::::::::::: solution +::::::::::::::: 解决方案 -## Solution +## 解决方案 ```{r boxplot-color-time, cache=FALSE, purl=TRUE} -# time as integer +# 时间作为整数 ggplot(data = rna, - mapping = aes(y = expression_log, + map = aes(y = expression_log, x = sample)) + geom_jitter(alpha = 0.2, aes(color = time)) + geom_boxplot(alpha = 0) + - theme(axis.text.x = element_text(angle = 90, hjust = 0.5, vjust = 0.5)) + theme(axis.text.x = element_text(angle = 90, hjust = 0.5, vjust = 0.5)) -# time as factor +# 时间作为因子 ggplot(data = rna, - mapping = aes(y = expression_log, + map = aes(y = expression_log, x = sample)) + - geom_jitter(alpha = 0.2, aes(color = as.factor(time))) + + geom_jitter(alpha = 0.2, aes(颜色 = as.因子(时间))) + geom_boxplot(alpha = 0) + - theme(axis.text.x = element_text(angle = 90, hjust = 0.5, vjust = 0.5)) + 主题(轴.文本.x = element_text(角度 = 90, hjust = 0.5, vjust = 0.5)) ``` ::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::: +::::::::::::::::::::::::::::::::::::::::::::::::::::: -::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::: 挑战 -## Challenge +## 挑战 -Boxplots are useful summaries, but hide the _shape_ of the -distribution. For example, if the distribution is bimodal, we would -not see it in a boxplot. An alternative to the boxplot is the violin -plot, where the shape (of the density of points) is drawn. +箱线图是有用的摘要,但隐藏了 +分布的_形状_。 For example, if the distribution is bimodal, we would +not see it in a boxplot. 箱线图的替代方法是小提琴 +图,其中绘制了(点密度的)形状。 -- Replace the box plot with a violin plot; see `geom_violin()`. Fill - in the violins according to the time with the argument `fill`. +- 用小提琴图代替箱线图;参见“geom_violin()”。 使用参数“fill”根据时间在小提琴中填充 + 。 -::::::::::::::: solution +::::::::::::::: 解决方案 -## Solution +## 解决方案 ```{r, eval=TRUE, echo=TRUE, cache=FALSE, purl=TRUE} -ggplot(data = rna, - mapping = aes(y = expression_log, x = sample)) + - geom_violin(aes(fill = as.factor(time))) + - theme(axis.text.x = element_text(angle = 90, hjust = 0.5, vjust = 0.5)) +ggplot(数据 = rna, + 映射 = aes(y = expression_log,x = 样本)) + + geom_violin(aes(填充 = as.factor(时间))) + + 主题(轴.文本.x = element_text(角度 = 90,hjust = 0.5,vjust = 0.5)) ``` ::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::: +::::::::::::::::::::::::::::::::::::::::::::::::::::: -::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::: 挑战 -## Challenge +## 挑战 -- Modify the violin plot to fill in the violins by `sex`. +- 修改小提琴图以按“性别”填充小提琴。 -::::::::::::::: solution +::::::::::::::: 解决方案 -## Solution +## 解决方案 ```{r, eval=TRUE, echo=TRUE, cache=FALSE, purl=TRUE} -ggplot(data = rna, - mapping = aes(y = expression_log, x = sample)) + - geom_violin(aes(fill = sex)) + - theme(axis.text.x = element_text(angle = 90, hjust = 0.5, vjust = 0.5)) +ggplot(数据 = rna, + 映射 = aes(y = expression_log,x = 样本)) + + geom_violin(aes(填充 = 性别)) + + 主题(axis.text.x = element_text(角度 = 90,hjust = 0.5,vjust = 0.5)) ``` ::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::: +::::::::::::::::::::::::::::::::::::::::::::::::::::: -## Line plots +## 线图 -Let's calculate the mean expression per duration of the infection for -the 10 genes having the highest log fold changes comparing time 8 versus -time 0. First, we need to select the genes and create a subset of `rna` -called `sub_rna` containing the 10 selected genes, then we need to group -the data and calculate the mean gene expression within each group: +让我们计算一下 +感染持续时间内的平均表达量,其中 10 个基因的对数倍数变化最高,比较时间 8 与 +时间 0。 首先,我们需要选择基因并创建一个 `rna` +的子集,称为 `sub_rna`,包含 10 个选定的基因,然后我们需要对数据进行分组 +并计算每个组内的平均基因表达: ```{r, purl=TRUE} -rna_fc <- rna_fc %>% arrange(desc(time_8_vs_0)) +rna_fc <- rna_fc %>% 排列(desc(time_8_vs_0)) genes_selected <- rna_fc$gene[1:10] sub_rna <- rna %>% - filter(gene %in% genes_selected) + 过滤(基因 %in% genes_selected) mean_exp_by_time <- sub_rna %>% - group_by(gene,time) %>% - summarize(mean_exp = mean(expression_log)) + group_by(基因,时间) %>% + 总结(mean_exp = mean(expression_log)) mean_exp_by_time ``` -We can build the line plot with duration of the infection on the x-axis -and the mean expression on the y-axis: +我们可以绘制线图,x 轴为感染持续时间 +,y 轴为平均表达量: ```{r first-time-series, purl=TRUE} -ggplot(data = mean_exp_by_time, mapping = aes(x = time, y = mean_exp)) + +ggplot(数据 = mean_exp_by_time,映射 = aes(x = 时间,y = mean_exp)) + geom_line() ``` -Unfortunately, this does not work because we plotted data for all the -genes together. We need to tell ggplot to draw a line for each gene by -modifying the aesthetic function to include `group = gene`: +不幸的是,这不起作用,因为我们将所有 +基因的数据绘制在一起。 我们需要告诉 ggplot 为每个基因画一条线,通过 +修改美学函数以包含 `group = gene`: ```{r time-series-by-gene, purl=TRUE} -ggplot(data = mean_exp_by_time, - mapping = aes(x = time, y = mean_exp, group = gene)) + +ggplot(数据 = mean_exp_by_time, + 映射 = aes(x = 时间, y = mean_exp, 组 = 基因)) + geom_line() ``` -We will be able to distinguish genes in the plot if we add colors (using -`color` also automatically groups the data): +如果我们添加颜色,我们将能够区分图中的基因(使用 +`color` 也会自动对数据进行分组): ```{r time-series-with-colors, purl=TRUE} -ggplot(data = mean_exp_by_time, - mapping = aes(x = time, y = mean_exp, color = gene)) + +ggplot(数据 = mean_exp_by_time, + 映射 = aes(x = 时间, y = mean_exp, 颜色 = 基因)) + geom_line() ``` -## Faceting +## 刻面 -`ggplot2` has a special technique called _faceting_ that allows the user -to split one plot into multiple (sub) plots based on a factor included -in the dataset. These different subplots inherit the same properties -(axes limits, ticks, ...) to facilitate their direct comparison. We will -use it to make a line plot across time for each gene: +`ggplot2` 有一种称为 _faceting_ 的特殊技术,它允许用户 +根据数据集中包含的因素 +将一个图分成多个(子)图。 这些不同的子图继承了相同的属性 +(轴限制、刻度……) 以便于直接比较。 我们将 +使用它为每个基因绘制一条跨时间的线图: ```{r first-facet, purl=TRUE} -ggplot(data = mean_exp_by_time, - mapping = aes(x = time, y = mean_exp)) + geom_line() + - facet_wrap(~ gene) +ggplot(数据 = mean_exp_by_time, + 映射 = aes(x = 时间, y = mean_exp)) + geom_line() + + facet_wrap(~ 基因) ``` -Here both x- and y-axis have the same scale for all the subplots. You -can change this default behavior by modifying `scales` in order to allow -a free scale for the y-axis: +这里,所有子图的 x 轴和 y 轴具有相同的比例。 您 +可以通过修改 `scales` 来更改此默认行为,以允许 +y 轴自由缩放: ```{r first-facet-scales, purl=TRUE} -ggplot(data = mean_exp_by_time, - mapping = aes(x = time, y = mean_exp)) + +ggplot(数据 = mean_exp_by_time, + 映射 = aes(x = 时间,y = mean_exp)) + geom_line() + - facet_wrap(~ gene, scales = "free_y") + facet_wrap(~基因,scales = "free_y") ``` -Now we would like to split the line in each plot by the sex of the mice. -To do that we need to calculate the mean expression in the data frame -grouped by `gene`, `time`, and `sex`: +现在我们想根据小鼠的性别来分割每个图中的线。 +为此,我们需要计算数据框 +中按“基因”、“时间”和“性别”分组的平均表达: ```{r data-facet-by-gene-and-sex, purl=TRUE} mean_exp_by_time_sex <- sub_rna %>% group_by(gene, time, sex) %>% - summarize(mean_exp = mean(expression_log)) + 总结(mean_exp = mean(expression_log)) mean_exp_by_time_sex ``` -We can now make the faceted plot by splitting further by sex using -`color` (within a single plot): +我们现在可以使用 +`color`(在单个图内)按性别进一步划分来制作分面图: ```{r facet-by-gene-and-sex, cache=FALSE, purl=TRUE} -ggplot(data = mean_exp_by_time_sex, - mapping = aes(x = time, y = mean_exp, color = sex)) + +ggplot(数据 = mean_exp_by_time_sex, + 映射 = aes(x = 时间, y = mean_exp, 颜色 = 性别)) + geom_line() + - facet_wrap(~ gene, scales = "free_y") + facet_wrap(~ 基因, scales = "free_y") ``` -Usually plots with white background look more readable when printed. We -can set the background to white using the function `theme_bw()`. -Additionally, we can remove the grid: +通常,带有白色背景的图表在打印时看起来更易读。 我们 +可以使用函数“theme_bw()”将背景设置为白色。 +此外,我们可以删除网格: ```{r facet-by-gene-and-sex-white-bg, cache=FALSE, purl=TRUE} -ggplot(data = mean_exp_by_time_sex, - mapping = aes(x = time, y = mean_exp, color = sex)) + +ggplot(数据 = mean_exp_by_time_sex, + 映射 = aes(x = 时间,y = mean_exp,颜色 = 性别)) + geom_line() + - facet_wrap(~ gene, scales = "free_y") + + facet_wrap(~ 基因,scales = "free_y") + theme_bw() + - theme(panel.grid = element_blank()) + 主题(panel.grid = element_blank()) ``` -::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::: 挑战 -## Challenge +## 挑战 -Use what you just learned to create a plot that depicts how the -average expression of each chromosome changes through the duration of -infection. +使用你刚刚学到的知识创建一个图表,描绘在 +感染持续期间,每个染色体的 +平均表达如何变化。 -::::::::::::::: solution +::::::::::::::: 解决方案 -## Solution +## 解决方案 ```{r mean-exp-chromosome-time-series, purl=TRUE} mean_exp_by_chromosome <- rna %>% - group_by(chromosome_name, time) %>% - summarize(mean_exp = mean(expression_log)) + group_by(chromosome_name, time) %>% + 总结(mean_exp = mean(expression_log)) -ggplot(data = mean_exp_by_chromosome, mapping = aes(x = time, +ggplot(数据 = mean_exp_by_chromosome, 映射 = aes(x = 时间, y = mean_exp)) + geom_line() + - facet_wrap(~ chromosome_name, scales = "free_y") + facet_wrap(~ chromosome_name, scales = "free_y") ``` ::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::: +::::::::::::::::::::::::::::::::::::::::::::::::::::: -The `facet_wrap` geometry extracts plots into an arbitrary number of -dimensions to allow them to cleanly fit on one page. On the other hand, -the `facet_grid` geometry allows you to explicitly specify how you want -your plots to be arranged via formula notation (`rows ~ columns`; a `.` -can be used as a placeholder that indicates only one row or column). +`facet_wrap` 几何将图提取到任意数量的 +维度中,以使它们能够整齐地放在一页上。 另一方面, +`facet_grid` 几何允许您通过公式符号明确指定如何排列 +您的图表(`rows ~ columns`;`.` +可用作仅表示一行或一列的占位符)。 -Let's modify the previous plot to compare how the mean gene expression -of males and females has changed through time: +让我们修改之前的图来比较男性和女性的平均基因表达 +随时间的变化: ```{r mean-exp-time-facet-sex-rows, purl=TRUE} -# One column, facet by rows +# 一列,按行细分 ggplot(data = mean_exp_by_time_sex, - mapping = aes(x = time, y = mean_exp, color = gene)) + + map = aes(x = time, y = mean_exp, color = gene)) + geom_line() + facet_grid(sex ~ .) ``` ```{r mean-exp-time-facet-sex-columns, purl=TRUE} -# One row, facet by column +# 一行,逐列 ggplot(data = mean_exp_by_time_sex, - mapping = aes(x = time, y = mean_exp, color = gene)) + + map = aes(x = time, y = mean_exp, color = gene)) + geom_line() + facet_grid(. ~ sex) ``` -## `ggplot2` themes +## `ggplot2` 主题 -In addition to `theme_bw()`, which changes the plot background to white, -`ggplot2` comes with several other themes which can be useful to quickly -change the look of your visualization. The complete list of themes is -available at [https://ggplot2.tidyverse.org/reference/ggtheme.html](https://ggplot2.tidyverse.org/reference/ggtheme.html). -`theme_minimal()` and `theme_light()` are popular, and `theme_void()` -can be useful as a starting point to create a new hand-crafted theme. +除了将绘图背景更改为白色的 `theme_bw()` 之外, +`ggplot2` 还附带其他几个主题,可用于快速 +更改可视化的外观。 完整的主题列表在 +处可用,网址为 [https://ggplot2.tidyverse.org/reference/ggtheme.html](https://ggplot2.tidyverse.org/reference/ggtheme.html)。 +`theme_minimal()` 和 `theme_light()` 很流行,而 `theme_void()` +可以作为创建新手工主题的起点。 -The [ggthemes](https://jrnold.github.io/ggthemes/reference/index.html) -package provides a wide variety of options (including an Excel 2003 -theme). The ggplot2 provides a list of -packages that extend the capabilities of `ggplot2`, including additional -themes. +[ggthemes](https://jrnold.github.io/ggthemes/reference/index.html) +包提供了各种各样的选项(包括 Excel 2003 +主题)。 ggplot2 提供了扩展 `ggplot2` 功能的 +软件包列表,包括额外的 +主题。 -## Customisation +## 定制 -Let's come back to the faceted plot of mean expression by time and gene, -colored by sex. +让我们回到按时间和基因划分的平均表达的多面图, +按性别着色。 -Take a look at the ggplot2, -and think of ways you could improve the plot. +查看 ggplot2, +并思考如何改进图表。 -Now, we can change names of axes to something more informative than -'time' and 'mean\_exp', and add a title to the figure: +现在,我们可以将轴的名称更改为比 +'time' 和 'mean\_exp' 更具信息量的名称,并为图形添加标题: ```{r mean_exp-time-with-right-labels, cache=FALSE, purl=TRUE} ggplot(data = mean_exp_by_time_sex, mapping = aes(x = time, y = mean_exp, color = sex)) + geom_line() + - facet_wrap(~ gene, scales = "free_y") + + facet_wrap(~gene, scales = "free_y") + theme_bw() + theme(panel.grid = element_blank()) + - labs(title = "Mean gene expression by duration of the infection", - x = "Duration of the infection (in days)", - y = "Mean gene expression") + labs(title = "按感染持续时间划分的平均基因表达", + x = "感染持续时间(天)", + y = "平均基因表达") ``` -The axes have more informative names, but their readability can be -improved by increasing the font size: +轴具有更多信息名称,但可以通过增加字体大小来提高其可读性: ```{r mean_exp-time-with-right-labels-xfont-size, cache=FALSE, purl=TRUE} ggplot(data = mean_exp_by_time_sex, mapping = aes(x = time, y = mean_exp, color = sex)) + geom_line() + - facet_wrap(~ gene, scales = "free_y") + + facet_wrap(~gene, scales = "free_y") + theme_bw() + theme(panel.grid = element_blank()) + - labs(title = "Mean gene expression by duration of the infection", - x = "Duration of the infection (in days)", - y = "Mean gene expression") + + labs(title = "按感染持续时间划分的平均基因表达", + x = "感染持续时间(天)", + y = "平均基因表达") + theme(text = element_text(size = 16)) ``` -Note that it is also possible to change the fonts of your plots. If you -are on Windows, you may have to install the . +请注意,您也可以更改图表的字体。 如果你 +使用的是 Windows,则可能必须安装 。 -We can further customize the color of x- and y-axis text, the color of -the grid, etc. We can also for example move the legend to the top by -setting `legend.position` to `"top"`. +我们可以进一步自定义 x 轴和 y 轴文本的颜色、 +网格的颜色等。 例如,我们还可以通过 +将“legend.position”设置为“top”将图例移动到顶部。 ```{r mean_exp-time-with-theme, cache=FALSE, purl=TRUE} ggplot(data = mean_exp_by_time_sex, mapping = aes(x = time, y = mean_exp, color = sex)) + geom_line() + - facet_wrap(~ gene, scales = "free_y") + + facet_wrap(~gene, scales = "free_y") + theme_bw() + theme(panel.grid = element_blank()) + - labs(title = "Mean gene expression by duration of the infection", - x = "Duration of the infection (in days)", - y = "Mean gene expression") + + labs(title = "按感染持续时间划分的平均基因表达", + x = "感染持续时间(天)", + y = "平均基因表达") + theme(text = element_text(size = 16), - axis.text.x = element_text(colour = "royalblue4", size = 12), - axis.text.y = element_text(colour = "royalblue4", size = 12), + axis.text.x = element_text(colour = “royalblue4”, size = 12), + axis.text.y = element_text(colour = “royalblue4”, size = 12), panel.grid = element_line(colour="lightsteelblue1"), - legend.position = "top") + legend.position = “top”) ``` -If you like the changes you created better than the default theme, you -can save them as an object to be able to easily apply them to other -plots you may create. Here is an example with the histogram we have -previously created. +如果您比默认主题更喜欢您所做的更改,您可以 +将它们保存为对象,以便能够轻松地将它们应用到您可能创建的其他 +图中。 下面是我们之前创建的 +直方图的一个例子。 ```{r mean_exp-time-with-right-labels-xfont, cache=FALSE, purl=TRUE} blue_theme <- theme(axis.text.x = element_text(colour = "royalblue4", @@ -808,150 +807,150 @@ ggplot(rna, aes(x = expression_log)) + blue_theme ``` -::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::: 挑战 -## Challenge +## 挑战 -With all of this information in hand, please take another five minutes -to either improve one of the plots generated in this exercise or -create a beautiful graph of your own. Use the RStudio ggplot2 -for inspiration. Here are some ideas: +掌握了所有这些信息后,请再花五分钟 +来改进本练习中生成的其中一个图表或 +创建您自己的精美图表。 使用 RStudio ggplot2 +获取灵感。 以下是一些想法: -- See if you can change the thickness of the lines. -- Can you find a way to change the name of the legend? What about - its labels? (hint: look for a ggplot function starting with - `scale_`) -- Try using a different color palette or manually specifying the - colors for the lines (see - [http://www.cookbook-r.com/Graphs/Colors\_(ggplot2)/](https://www.cookbook-r.com/Graphs/Colors_\\(ggplot2\\)/)). +- 看看是否可以改变线条的粗细。 +- 你能找到办法改变传奇的名字吗? 那么 + 它的标签怎么样? (提示:寻找以 + `scale_` 开头的 ggplot 函数) +- 尝试使用不同的调色板或手动指定线条的 + 颜色(参见 + [http://www.cookbook-r.com/Graphs/Colors\_(ggplot2)/](https://www.cookbook-r.com/Graphs/Colors_\\(ggplot2\\)/))。 -::::::::::::::: solution +::::::::::::::: 解决方案 -## Solution +## 解决方案 -For example, based on this plot: +例如,基于此图: ```{r, purl=TRUE} -ggplot(data = mean_exp_by_time_sex, - mapping = aes(x = time, y = mean_exp, color = sex)) + +ggplot(数据 = mean_exp_by_time_sex, + 映射 = aes(x = 时间,y = mean_exp,颜色 = 性别)) + geom_line() + - facet_wrap(~ gene, scales = "free_y") + + facet_wrap(~ 基因,scales = "free_y") + theme_bw() + - theme(panel.grid = element_blank()) + 主题(panel.grid = element_blank()) ``` -We can customize it the following ways: +我们可以通过以下方式定制它: ```{r, purl=TRUE} -# change the thickness of the lines +# 更改线条的粗细 ggplot(data = mean_exp_by_time_sex, - mapping = aes(x = time, y = mean_exp, color = sex)) + + map = aes(x = time, y = mean_exp, color = sex)) + geom_line(size=1.5) + facet_wrap(~ gene, scales = "free_y") + theme_bw() + theme(panel.grid = element_blank()) -# change the name of the legend and the labels +# 更改图例和标签的名称 ggplot(data = mean_exp_by_time_sex, - mapping = aes(x = time, y = mean_exp, color = sex)) + + map = aes(x = time, y = mean_exp, color = sex)) + geom_line() + facet_wrap(~ gene, scales = "free_y") + theme_bw() + theme(panel.grid = element_blank()) + scale_color_discrete(name = "Gender", labels = c("F", "M")) -# using a different color palette +# 使用不同的调色板 ggplot(data = mean_exp_by_time_sex, - mapping = aes(x = time, y = mean_exp, color = sex)) + + map = aes(x = time, y = mean_exp, color = sex)) + geom_line() + facet_wrap(~ gene, scales = "free_y") + theme_bw() + theme(panel.grid = element_blank()) + scale_color_brewer(name = "Gender", labels = c("F", "M"), palette = "Dark2") -# manually specifying the colors +# 手动指定颜色 ggplot(data = mean_exp_by_time_sex, - mapping = aes(x = time, y = mean_exp, color = sex)) + + map = aes(x = time, y = mean_exp, color = sex)) + geom_line() + facet_wrap(~ gene, scales = "free_y") + theme_bw() + theme(panel.grid = element_blank()) + - scale_color_manual(name = "Gender", labels = c("F", "M"), - values = c("royalblue", "deeppink")) + scale_color_manual(name = "性别", 标签 = c("F", "M"), + 值 = c("royalblue", "deeppink")) ``` ::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::: +::::::::::::::::::::::::::::::::::::::::::::::::::::: -## Composing plots +## 创作情节 -Faceting is a great tool for splitting one plot into multiple subplots, -but sometimes you may want to produce a single figure that contains -multiple independent plots, i.e. plots that are based on different -variables or even different data frames. +分面是一种很好的工具,可以将一个图分割成多个子图, +但有时您可能想要生成一个包含 +多个独立图的单个图形,即基于不同 +变量甚至不同数据框的图。 -Let's start by creating the two plots that we want to arrange next to -each other: +让我们首先创建两个想要排列在 +旁边的图: -The first graph counts the number of unique genes per chromosome. We -first need to reorder the levels of `chromosome_name` and filter the -unique genes per chromosome. We also change the scale of the y-axis to a -log10 scale for better readability. +第一个图表计算了每个染色体上独特基因的数量。 我们 +首先需要重新排序 `chromosome_name` 的级别,并过滤每个染色体上的 +个独特基因。 我们还将 y 轴的比例更改为 +log10 比例,以提高可读性。 ```{r sub1, purl=TRUE} -rna$chromosome_name <- factor(rna$chromosome_name, - levels = c(1:19,"X","Y")) +rna$chromosome_name <- 因子 (rna$chromosome_name, + 水平 = c (1:19,"X","Y")) -count_gene_chromosome <- rna %>% select(chromosome_name, gene) %>% +count_gene_chromosome <- rna %>% 选择 (chromosome_name, 基因) %>% distinct() %>% ggplot() + - geom_bar(aes(x = chromosome_name), fill = "seagreen", - position = "dodge", stat = "count") + - labs(y = "log10(n genes)", x = "chromosome") + + geom_bar(aes(x = chromosome_name), 填充 = "seagreen", + 位置 = "dodge", stat = "count") + + 实验室 (y = "log10(n 基因)", x = "染色体") + scale_y_log10() count_gene_chromosome ``` -Below, we also remove the legend altogether by setting the -`legend.position` to `"none"`. +下面,我们还通过将 +`legend.position` 设置为 `"none"` 来完全删除图例。 ```{r sub2, purl=TRUE} exp_boxplot_sex <- ggplot(rna, aes(y=expression_log, x = as.factor(time), color=sex)) + geom_boxplot(alpha = 0) + - labs(y = "Mean gene exp", - x = "time") + theme(legend.position = "none") + labs(y = "平均基因 exp", + x = "时间") + theme(legend.position = "无") exp_boxplot_sex ``` -The [**patchwork**](https://github.com/thomasp85/patchwork) package -provides an elegant approach to combining figures using the `+` to -arrange figures (typically side by side). More specifically the `|` -explicitly arranges them side by side and `/` stacks them on top of each -other. +[**patchwork**](https://github.com/thomasp85/patchwork) 包 +提供了一种优雅的方法来组合图形,使用 `+` 来 +排列图形(通常是并排)。 更具体地说,`|` +明确地将它们并排排列,而 `/` 将它们堆叠在彼此的顶部 +。 ```{r install-patchwork, message=FALSE, eval=FALSE, purl=TRUE} -install.packages("patchwork") +安装.packages(“patchwork”) ``` ```{r patchworkplot1, purl=TRUE} -library("patchwork") +库(“patchwork”) count_gene_chromosome + exp_boxplot_sex -## or count_gene_chromosome | exp_boxplot_sex +## 或 count_gene_chromosome | exp_boxplot_sex ``` ```{r patchwork2, purl=TRUE} -count_gene_chromosome / exp_boxplot_sex +计数基因染色体 / 指数箱线图性别 ``` -We can combine further control the layout of the final composition with -`plot_layout` to create more complex layouts: +我们可以结合 +`plot_layout` 进一步控制最终构图的布局,以创建更复杂的布局: ```{r patchwork3, purl=TRUE} -count_gene_chromosome + exp_boxplot_sex + plot_layout(ncol = 1) +计数基因染色体 + 指数箱线图性别 + 绘图布局(ncol = 1) ``` ```{r patchwork4, purl=TRUE} @@ -961,7 +960,7 @@ count_gene_chromosome + plot_layout(ncol = 1) ``` -The last plot can also be created using the `|` and `/` composers: +最后一个图也可以使用 `|` 和 `/` 组合器来创建: ```{r patchwork5, purl=TRUE} count_gene_chromosome / @@ -969,50 +968,50 @@ count_gene_chromosome / exp_boxplot_sex ``` -Learn more about `patchwork` on its -[webpage](https://patchwork.data-imaginist.com/) or in this -[video](https://www.youtube.com/watch?v=0m4yywqNPVY). +了解有关 `patchwork` 的更多信息,请访问其 +[网页](https://patchwork.data-imaginist.com/) 或此 +[视频](https://www.youtube.com/watch?v=0m4yywqNPVY)。 -Another option is the **`gridExtra`** package that allows to combine -separate ggplots into a single figure using `grid.arrange()`: +另一个选项是 **`gridExtra`** 包,它允许使用 `grid.arrange()` 将 +个单独的 ggplots 组合成一个图形: ```{r install-gridextra, message=FALSE, eval=FALSE, purl=TRUE} -install.packages("gridExtra") +安装.包(“gridExtra”) ``` ```{r gridarrange-example, message=FALSE, fig.width=10, purl=TRUE} -library("gridExtra") -grid.arrange(count_gene_chromosome, exp_boxplot_sex, ncol = 2) +库(“gridExtra”) +grid.arrange(count_gene_chromosome,exp_boxplot_sex,ncol = 2) ``` -In addition to the `ncol` and `nrow` arguments, used to make simple -arrangements, there are tools for constructing more complex -layouts. +除了用于进行简单 +排列的 `ncol` 和 `nrow` 参数之外,还有用于 构建更复杂的 +布局 的工具。 -## Exporting plots +## 导出地块 -After creating your plot, you can save it to a file in your favorite -format. The Export tab in the **Plot** pane in RStudio will save your -plots at low resolution, which will not be accepted by many journals and -will not scale well for posters. +创建图表后,您可以将其保存为您喜欢的 +格式的文件。 RStudio 中 **Plot** 窗格中的“导出”选项卡将以低分辨率保存您的 +图,这种图不会被许多期刊接受,并且 +不适合作为海报缩放。 -Instead, use the `ggsave()` function, which allows you easily change the -dimension and resolution of your plot by adjusting the appropriate -arguments (`width`, `height` and `dpi`). +相反,使用 `ggsave()` 函数,它允许您通过调整适当的 +参数(`width`、`height` 和 `dpi`)轻松更改图的 +维度和分辨率。 -Make sure you have the `fig_output/` folder in your working directory. +确保你的工作目录中有“fig_output/”文件夹。 ```{r ggsave-example, eval=FALSE, purl=TRUE} my_plot <- ggplot(data = mean_exp_by_time_sex, mapping = aes(x = time, y = mean_exp, color = sex)) + geom_line() + - facet_wrap(~ gene, scales = "free_y") + - labs(title = "Mean gene expression by duration of the infection", - x = "Duration of the infection (in days)", - y = "Mean gene expression") + - guides(color=guide_legend(title="Gender")) + + facet_wrap(~gene, scales = "free_y") + + labs(title = "按感染持续时间划分的平均基因表达", + x = "感染持续时间(天)", + y = "平均基因表达") + + guides(color=guide_legend(title="性别")) + theme_bw() + - theme(axis.text.x = element_text(colour = "royalblue4", size = 12), + theme(axis.text.x = element_text(colour = "royalblue4",size = 12), axis.text.y = element_text(colour = "royalblue4", size = 12), text = element_text(size = 16), panel.grid = element_line(colour="lightsteelblue1"), @@ -1020,87 +1019,87 @@ my_plot <- ggplot(data = mean_exp_by_time_sex, ggsave("fig_output/mean_exp_by_time_sex.png", my_plot, width = 15, height = 10) -# This also works for grid.arrange() plots +# 这也适用于 grid.arrange() 图 combo_plot <- grid.arrange(count_gene_chromosome, exp_boxplot_sex, - ncol = 2, widths = c(4, 6)) -ggsave("fig_output/combo_plot_chromosome_sex.png", combo_plot, - width = 10, dpi = 300) + ncol = 2, widths = c(4, 6)) +ggsave(“fig_output/combo_plot_chromosome_sex.png”,combo_plot, + width = 10,dpi = 300) ``` -Note: The parameters `width` and `height` also determine the font size -in the saved plot. +注意:参数“width”和“height”也决定了保存的图中的字体大小 +。 ```{r final-challenge, eval=FALSE, purl=TRUE, echo=FALSE} -### Final plotting challenge: -## With all of this information in hand, please take another five -## minutes to either improve one of the plots generated in this -## exercise or create a beautiful graph of your own. Use the RStudio -## ggplot2 cheat sheet for inspiration: -## https://raw.githubusercontent.com/rstudio/cheatsheets/main/data-visualization.pdf +### 最终绘图挑战: +## 掌握所有这些信息后,请再花五 +## 分钟来改进此 +## 练习中生成的图表之一,或创建您自己的精美图表。使用 RStudio +## ggplot2 备忘单获取灵感: +## https://raw.githubusercontent.com/rstudio/cheatsheets/main/data-visualization.pdf ``` -## Other packages for visualisation +## 其他可视化包 -`ggplot2` is a very powerful package that fits very nicely in our _tidy -data_ and _tidy tools_ pipeline. There are other visualization packages -in R that shouldn't be ignored. +`ggplot2` 是一个非常强大的包,非常适合我们的 _tidy +数据_ 和 _tidy 工具_ 管道。 R 中还有其他可视化包 +不容忽视。 -### Base graphics +### 基础图形 -The default graphics system that comes with R, often called _base R -graphics_ is simple and fast. It is based on the _painter's or canvas -model_, where different output are directly overlaid on top of each -other (see figure @ref(fig:paintermodel)). This is a fundamental -difference with `ggplot2` (and with `lattice`, described below), that -returns dedicated objects, that are rendered on screen or in a file, and -that can even be updated. +R 自带的默认图形系统通常称为 _base R +graphics_ ,简单而快速。 它基于_画家或画布 +模型_,其中不同的输出直接叠加在每个 +其他输出之上(参见图@ref(fig:paintermodel))。 这是与 `ggplot2`(以及下面描述的 `lattice`)的一个根本的 +区别,即 +返回在屏幕或文件中呈现的专用对象,以及 +甚至可以更新。 ```{r paintermodel, fig.width=12, fig.height=4, fig.cap="Successive layers added on top of each other."} par(mfrow = c(1, 3)) -plot(1:20, main = "First layer, produced with plot(1:20)") +plot(1:20, main = "第一层,用 plot(1:20) 制作") -plot(1:20, main = "A horizontal red line, added with abline(h = 10)") +plot(1:20, main = "一条水平红线,用 abline(h = 10) 添加") abline(h = 10, col = "red") -plot(1:20, main = "A rectangle, added with rect(5, 5, 15, 15)") +plot(1:20, main = "一个矩形,用 rect(5, 5, 15, 15) 添加") abline(h = 10, col = "red") rect(5, 5, 15, 15, lwd = 3) ``` -Another main difference is that base graphics' plotting function try to -do _the right_ thing based on their input type, i.e. they will adapt -their behaviour based on the class of their input. This is again very -different from what we have in `ggplot2`, that only accepts dataframes -as input, and that requires plots to be constructed bit by bit. +另一个主要区别是,基本图形的绘图功能会尝试根据其输入类型 +做_正确_的事情,即,它们将根据其输入的类别调整 +其行为。 这与我们在 `ggplot2` 中所用的又非常 +不同,它仅接受数据框 +作为输入,并且需要一点一点地构建图表。 ```{r plotmethod, fig.width=8, fig.height=8, fig.cap="Plotting boxplots (top) and histograms (bottom) vectors (left) or a matrices (right)."} par(mfrow = c(2, 2)) boxplot(rnorm(100), - main = "Boxplot of rnorm(100)") + main = "rnorm(100) 的箱线图") boxplot(matrix(rnorm(100), ncol = 10), - main = "Boxplot of matrix(rnorm(100), ncol = 10)") + main = "matrix(rnorm(100), ncol = 10) 的箱线图") hist(rnorm(100)) hist(matrix(rnorm(100), ncol = 10)) ``` -The out-of-the-box approach in base graphics can be very efficient for -simple, standard figures, that can be produced very quickly with a -single line of code and a single function such as `plot`, or `hist`, or -`boxplot`, ... The defaults are however not always the most appealing -and tuning of figures, especially when they become more complex (for -example to produce facets), can become lengthy and cumbersome. +基础图形中的开箱即用方法对于 +简单标准图形非常有效,可以使用 +单行代码和一个函数(例如`plot`,`hist`,或 +`boxplot`)非常快速地生成... 然而,默认值并不总是最吸引人的 +,并且图形的调整,尤其是当它们变得更加复杂时(例如 +产生方面),可能会变得冗长而繁琐。 -### The lattice package +### lattice 包 -The **`lattice`** package is similar to `ggplot2` in that is uses -dataframes as input, returns graphical objects and supports faceting. +**`lattice`** 包与 `ggplot2` 类似,它使用 +数据框作为输入,返回图形对象并支持分面。 `lattice` however isn't based on the grammar of graphics and has a more convoluted interface. -A good reference for the `lattice` package is @latticebook. +`lattice` 包的一个很好的参考是@latticebook。 -:::::::::::::::::::::::::::::::::::::::: keypoints +:::::::::::::::::::::::::::::::::::::::: 关键点 -- Visualization in R +- R 中的可视化 -:::::::::::::::::::::::::::::::::::::::::::::::::: +::::::::::::::::::::::::::::::::::::::::::::::::::::: From 752bce97af5fc45494af563d1170ebddbdcfed9a Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Thu, 1 Aug 2024 00:07:49 +0900 Subject: [PATCH 205/334] New translations 60-next-steps.md (French) --- locale/fr/episodes/60-next-steps.Rmd | 340 +++++++++++++-------------- 1 file changed, 169 insertions(+), 171 deletions(-) diff --git a/locale/fr/episodes/60-next-steps.Rmd b/locale/fr/episodes/60-next-steps.Rmd index 89511b1ab..99250f094 100644 --- a/locale/fr/episodes/60-next-steps.Rmd +++ b/locale/fr/episodes/60-next-steps.Rmd @@ -1,6 +1,6 @@ --- source: Rmd -title: Next steps +title: Prochaines étapes teaching: 45 exercises: 45 --- @@ -10,86 +10,86 @@ exercises: 45 ::::::::::::::::::::::::::::::::::::::: objectives -- Introduce the Bioconductor project. -- Introduce the notion of data containers. -- Give an overview of the `SummarizedExperiment`, extensively used in - omics analyses. +- Présentez le projet Bioconducteur. +- Introduire la notion de conteneurs de données. +- Donnez un aperçu du `SummarizedExperiment`, largement utilisé dans les analyses + omiques. :::::::::::::::::::::::::::::::::::::::::::::::::: :::::::::::::::::::::::::::::::::::::::: questions -- What is a `SummarizedExperiment`? -- What is Bioconductor? +- Qu'est-ce qu'une « expérience résumée » ? +- Qu’est-ce qu’un bioconducteur ? -:::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::: -## Next steps +## Prochaines étapes ```{r, echo=FALSE, message=FALSE} -library("tidyverse") +bibliothèque("tidyverse") ``` -Data in bioinformatics is often complex. To deal with this, -developers define specialised data containers (termed classes) that -match the properties of the data they need to handle. +Les données en bioinformatique sont souvent complexes. Pour résoudre ce problème, les développeurs +définissent des conteneurs de données spécialisés (appelés classes) qui +correspondent aux propriétés des données qu'ils doivent gérer. -This aspect is central to the **Bioconductor**[^Bioconductor] project -which uses the same **core data infrastructure** across packages. This -certainly contributed to Bioconductor's success. Bioconductor package -developers are advised to make use of existing infrastructure to -provide coherence, interoperability, and stability to the project as a -whole. +Cet aspect est au cœur du projet **Bioconductor**[^Bioconductor] +qui utilise la même **infrastructure de données de base** dans tous les packages. Ce +a certainement contribué au succès de Bioconductor. Il est conseillé aux développeurs du package Bioconductor +d'utiliser l'infrastructure existante pour +assurer la cohérence, l'interopérabilité et la stabilité du projet dans son ensemble +. -[^Bioconductor]: The [Bioconductor](https://www.bioconductor.org) was - initiated by Robert Gentleman, one of the two creators of the R - language. Bioconductor provides tools dedicated to omics data - analysis. Bioconductor uses the R statistical programming language - and is open source and open development. +[^Bioconductor]: Le [Bioconductor](https://www.bioconductor.org) a été + initié par Robert Gentleman, l'un des deux créateurs du langage R + . Bioconductor fournit des outils dédiés à l'analyse des données omiques + . Bioconductor utilise le langage de programmation statistique R + et est open source et développement ouvert. -To illustrate such an omics data container, we'll present the -`SummarizedExperiment` class. +Pour illustrer un tel conteneur de données omiques, nous présenterons la classe +`SummarizedExperiment`. -## SummarizedExperiment +## Expérience résumée -The figure below represents the anatomy of the SummarizedExperiment class. +La figure ci-dessous représente l’anatomie de la classe SummarizedExperiment. ```{r SE, echo=FALSE, out.width="80%"} knitr::include_graphics("https://uclouvain-cbio.github.io/WSBIM1322/figs/SE.svg") ``` -Objects of the class SummarizedExperiment contain : +Les objets de la classe SummarizedExperiment contiennent : -- **One (or more) assay(s)** containing the quantitative omics data - (expression data), stored as a matrix-like object. Features (genes, - transcripts, proteins, ...) are defined along the rows, and samples - along the columns. +- **Un (ou plusieurs) test(s)** contenant les données omiques quantitatives + (données d'expression), stockées sous forme d'objet de type matriciel. Caractéristiques (gènes, + transcrits, protéines, ...) sont définis le long des lignes, et les échantillons + le long des colonnes. -- A **sample metadata** slot containing sample co-variates, stored as a - data frame. Rows from this table represent samples (rows match exactly the - columns of the expression data). +- Un emplacement **exemple de métadonnées** contenant des exemples de covariables, stocké sous forme de trame de données + . Les lignes de ce tableau représentent des échantillons (les lignes correspondent exactement aux colonnes + des données d'expression). -- A **feature metadata** slot containing feature co-variates, stored as - a data frame. The rows of this data frame match exactly the rows of the - expression data. +- Un emplacement de **métadonnées de fonctionnalité** contenant des covariables de fonctionnalité, stockées sous forme de + une trame de données. Les lignes de ce bloc de données correspondent exactement aux lignes des données d'expression + . -The coordinated nature of the `SummarizedExperiment` guarantees that -during data manipulation, the dimensions of the different slots will -always match (i.e the columns in the expression data and then rows in -the sample metadata, as well as the rows in the expression data and -feature metadata) during data manipulation. For example, if we had to -exclude one sample from the assay, it would be automatically removed -from the sample metadata in the same operation. +La nature coordonnée du `SummarizedExperiment` garantit que +lors de la manipulation des données, les dimensions des différents emplacements seront toujours +(c'est-à-dire les colonnes des données d'expression puis les lignes de +les exemples de métadonnées, ainsi que les lignes des données d'expression et les +métadonnées des fonctionnalités) lors de la manipulation des données. Par exemple, si nous devions +exclure un échantillon du test, il serait automatiquement supprimé +des métadonnées de l’échantillon au cours de la même opération. -The metadata slots can grow additional co-variates -(columns) without affecting the other structures. +Les emplacements de métadonnées peuvent développer des co-variables supplémentaires +(colonnes) sans affecter les autres structures. -### Creating a SummarizedExperiment +### Création d'une expérience résumée -In order to create a `SummarizedExperiment`, we will create the -individual components, i.e the count matrix, the sample and gene -metadata from csv files. These are typically how RNA-Seq data are -provided (after raw data have been processed). +Afin de créer un `SummarizedExperiment`, nous allons créer les +composants individuels, c'est-à-dire la matrice de comptage, l'échantillon et le gène +métadonnées à partir de fichiers csv. C'est généralement ainsi que les données RNA-Seq sont +fournies (après le traitement des données brutes). ```{r, echo=FALSE, message=FALSE} rna <- read_csv("data/rnaseq.csv") @@ -126,10 +126,10 @@ write.csv(gene_metadata, file = "data/gene_metadata.csv", row.names = FALSE) write.csv(sample_metadata, file = "data/sample_metadata.csv", row.names = FALSE) ``` -- **An expression matrix**: we load the count matrix, specifying that - the first columns contains row/gene names, and convert the - `data.frame` to a `matrix`. You can download it - [here](https://carpentries-incubator.github.io/bioc-intro/data/count_matrix.csv). +- **Une matrice d'expression** : nous chargeons la matrice de comptage, en spécifiant que + les premières colonnes contiennent des noms de lignes/gènes, et convertissons le + `data.frame` en une `matrice`. Vous pouvez le télécharger + [ici](https://carpentries-incubator.github.io/bioc-intro/data/count_matrix.csv). ```{r} count_matrix <- read.csv("data/count_matrix.csv", @@ -140,8 +140,8 @@ count_matrix[1:5, ] dim(count_matrix) ``` -- **A table describing the samples**, available - [here](https://carpentries-incubator.github.io/bioc-intro/data/sample_metadata.csv). +- **Un tableau décrivant les échantillons**, disponible + [ici](https://carpentries-incubator.github.io/bioc-intro/data/sample_metadata.csv). ```{r} sample_metadata <- read.csv("data/sample_metadata.csv") @@ -149,8 +149,8 @@ sample_metadata dim(sample_metadata) ``` -- **A table describing the genes**, available - [here](https://carpentries-incubator.github.io/bioc-intro/data/gene_metadata.csv). +- **Un tableau décrivant les gènes**, disponible + [ici](https://carpentries-incubator.github.io/bioc-intro/data/gene_metadata.csv). ```{r} gene_metadata <- read.csv("data/gene_metadata.csv") @@ -158,27 +158,27 @@ gene_metadata[1:10, 1:4] dim(gene_metadata) ``` -We will create a `SummarizedExperiment` from these tables: +Nous allons créer un `SummarizedExperiment` à partir de ces tables : -- The count matrix that will be used as the **`assay`** +- La matrice de comptage qui sera utilisée comme **`essai`** -- The table describing the samples will be used as the **sample - metadata** slot +- Le tableau décrivant les échantillons sera utilisé comme emplacement de métadonnées \*\*sample + \*\* -- The table describing the genes will be used as the **features - metadata** slot +- Le tableau décrivant les gènes sera utilisé comme emplacement de métadonnées \*\*features + \*\* -To do this we can put the different parts together using the -`SummarizedExperiment` constructor: +Pour ce faire, nous pouvons assembler les différentes parties à l'aide du constructeur +`SummarizedExperiment` : ```{r, message=FALSE, warning=FALSE} ## BiocManager::install("SummarizedExperiment") -library("SummarizedExperiment") +bibliothèque("SummarizedExperiment") ``` -First, we make sure that the samples are in the same order in the -count matrix and the sample annotation, and the same for the genes in -the count matrix and the gene annotation. +Tout d’abord, nous nous assurons que les échantillons sont dans le même ordre dans la matrice de comptage +et l’annotation d’échantillon, et il en va de même pour les gènes dans +la matrice de comptage et l’annotation des gènes. ```{r} stopifnot(rownames(count_matrix) == gene_metadata$gene) @@ -192,20 +192,20 @@ se <- SummarizedExperiment(assays = list(counts = count_matrix), se ``` -### Saving data +### La sauvegarde des données -Exporting data to a spreadsheet, as we did in a previous episode, has -several limitations, such as those described in the first chapter -(possible inconsistencies with `,` and `.` for decimal separators and -lack of variable type definitions). Furthermore, exporting data to a -spreadsheet is only relevant for rectangular data such as dataframes -and matrices. +L'export de données vers un tableur, comme nous l'avons fait dans un épisode précédent, présente +plusieurs limitations, comme celles décrites dans le premier chapitre +(éventuelles incohérences avec `,` et `.` pour les séparateurs décimaux et +manque de définitions de types de variables). De plus, l'exportation de données vers une feuille de calcul +n'est pertinente que pour les données rectangulaires telles que les dataframes +et les matrices. -A more general way to save data, that is specific to R and is -guaranteed to work on any operating system, is to use the `saveRDS` -function. Saving objects like this will generate a binary -representation on disk (using the `rds` file extension here), which -can be loaded back into R using the `readRDS` function. +Une manière plus générale de sauvegarder des données, spécifique à R et dont le fonctionnement est +garanti sur n'importe quel système d'exploitation, consiste à utiliser la fonction `saveRDS` +. L'enregistrement d'objets comme celui-ci générera une représentation binaire +sur le disque (en utilisant l'extension de fichier `rds` ici), qui +peut être rechargée dans R à l'aide de la fonction `readRDS`. ```{r, eval=FALSE} saveRDS(se, file = "data_output/se.rds") @@ -214,41 +214,41 @@ se <- readRDS("data_output/se.rds") head(se) ``` -To conclude, when it comes to saving data from R that will be loaded -again in R, saving and loading with `saveRDS` and `readRDS` is the -preferred approach. If tabular data need to be shared with somebody -that is not using R, then exporting to a text-based spreadsheet is a -good alternative. +Pour conclure, lorsqu'il s'agit de sauvegarder des données de R qui seront chargées +à nouveau dans R, la sauvegarde et le chargement avec `saveRDS` et `readRDS` sont l'approche +préférée. Si les données tabulaires doivent être partagées avec quelqu'un +qui n'utilise pas R, alors l'exportation vers une feuille de calcul textuelle est une +bonne alternative. -Using this data structure, we can access the expression matrix with -the `assay` function: +En utilisant cette structure de données, nous pouvons accéder à la matrice d'expression avec +la fonction `assay` : ```{r} -head(assay(se)) -dim(assay(se)) +head(essai(se)) +dim(essai(se)) ``` -We can access the sample metadata using the `colData` function: +Nous pouvons accéder aux exemples de métadonnées à l'aide de la fonction `colData` : ```{r} colData(se) dim(colData(se)) ``` -We can also access the feature metadata using the `rowData` function: +Nous pouvons également accéder aux métadonnées des fonctionnalités à l'aide de la fonction `rowData` : ```{r} head(rowData(se)) dim(rowData(se)) ``` -### Subsetting a SummarizedExperiment +### Sous-ensemble d'une expérience résumée -SummarizedExperiment can be subset just like with data frames, with -numerics or with characters of logicals. +SummarizedExperiment peut être un sous-ensemble comme avec des trames de données, avec des chiffres +ou avec des caractères logiques. -Below, we create a new instance of class SummarizedExperiment that -contains only the 5 first features for the 3 first samples. +Ci-dessous, nous créons une nouvelle instance de la classe SummarizedExperiment qui +contient uniquement les 5 premières fonctionnalités pour les 3 premiers échantillons. ```{r} se1 <- se[1:5, 1:3] @@ -260,10 +260,10 @@ colData(se1) rowData(se1) ``` -We can also use the `colData()` function to subset on something from -the sample metadata or the `rowData()` to subset on something from the -feature metadata. For example, here we keep only miRNAs and the non -infected samples: +Nous pouvons également utiliser la fonction `colData()` pour créer un sous-ensemble sur quelque chose de +les exemples de métadonnées ou la fonction `rowData()` pour créer un sous-ensemble sur quelque chose à partir des métadonnées de fonctionnalité +. Par exemple, nous ne conservons ici que les miARN et les échantillons non +infectés : ```{r} se1 <- se[rowData(se)$gene_biotype == "miRNA", @@ -288,12 +288,12 @@ function.--> <!-- ``` --> -::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::: défi -## Challenge +## Défi -Extract the gene expression levels of the 3 first genes in samples -at time 0 and at time 8. +Extraire les niveaux d'expression génique des 3 premiers gènes dans les échantillons +au temps 0 et au temps 8. ::::::::::::::: solution @@ -312,58 +312,58 @@ assay(se)[1:3, colData(se)$time == 0 | colData(se)$time == 8] ::::::::::::::::::::::::::::::::::::::: challenge -## Challenge +## Défi -Verify that you get the same values using the long `rna` table. +Vérifiez que vous obtenez les mêmes valeurs en utilisant la longue table `rna`. ::::::::::::::: solution ## Solution ```{r, purl=FALSE} -rna |> - filter(gene %in% c("Asl", "Apod", "Cyd2d22")) |> - filter(time != 4) |> select(expression) +arn |> + filtre(gène %in% c("Asl", "Apod", "Cyd2d22")) |> + filtre(temps != 4) |> select(expression ) ``` ::::::::::::::::::::::::: :::::::::::::::::::::::::::::::::::::::::::::::::: -The long table and the `SummarizedExperiment` contain the same -information, but are simply structured differently. Each approach has its -own advantages: the former is a good fit for the `tidyverse` packages, -while the latter is the preferred structure for many bioinformatics and -statistical processing steps. For example, a typical RNA-Seq analyses using -the `DESeq2` package. +Le long tableau et le `SummarizedExperiment` contiennent les mêmes informations +, mais sont simplement structurés différemment. Chaque approche a ses +propres avantages : la première convient bien aux packages `tidyverse`, +tandis que la seconde est la structure préférée pour de nombreuses étapes de bioinformatique et +de traitement statistique. Par exemple, une analyse typique d'ARN-Seq utilisant +le package `DESeq2`. -#### Adding variables to metadata +#### Ajouter des variables aux métadonnées -We can also add information to the metadata. -Suppose that you want to add the center where the samples were collected... +Nous pouvons également ajouter des informations aux métadonnées. +Supposons que vous souhaitiez ajouter le centre où les échantillons ont été collectés... ```{r} -colData(se)$center <- rep("University of Illinois", nrow(colData(se))) +colData(se)$center <- rep("Université de l'Illinois", nrow(colData(se))) colData(se) ``` -This illustrates that the metadata slots can grow indefinitely without -affecting the other structures! +Cela illustre que les emplacements de métadonnées peuvent croître indéfiniment sans +affecter les autres structures ! -### tidySummarizedExperiment +### TidyRésuméExpérience -You may be wondering, can we use tidyverse commands to interact with -`SummarizedExperiment` objects? The answer is yes, we can with the -`tidySummarizedExperiment` package. +Vous vous demandez peut-être si pouvons-nous utiliser les commandes Tidyverse pour interagir avec les objets +`SummarizedExperiment` ? La réponse est oui, nous pouvons le faire avec le package +`tidySummarizedExperiment`. -Remember what our SummarizedExperiment object looks like: +Rappelez-vous à quoi ressemble notre objet SummarizedExperiment : ```{r, message=FALSE} se ``` -Load `tidySummarizedExperiment` and then take a look at the se object -again. +Chargez `tidySummarizedExperiment` puis jetez à nouveau un œil à l'objet se +. ```{r, message=FALSE} #BiocManager::install("tidySummarizedExperiment") @@ -372,52 +372,52 @@ library("tidySummarizedExperiment") se ``` -It's still a `SummarizedExperiment` object, so maintains the efficient -structure, but now we can view it as a tibble. Note the first line of -the output says this, it's a `SummarizedExperiment`\-`tibble` -abstraction. We can also see in the second line of the output the -number of transcripts and samples. +Il s'agit toujours d'un objet `SummarizedExperiment`, il conserve donc la structure efficace +, mais nous pouvons maintenant le voir comme un tibble. Notez la première ligne de +la sortie dit ceci, c'est une abstraction `SummarizedExperiment`\-`tibble` +. Nous pouvons également voir dans la deuxième ligne de la sortie le nombre +de transcriptions et d'échantillons. -If we want to revert to the standard `SummarizedExperiment` view, we -can do that. +Si nous voulons revenir à la vue standard `SummarizedExperiment`, nous +pouvons le faire. ```{r} options("restore_SummarizedExperiment_show" = TRUE) se ``` -But here we use the tibble view. +Mais ici, nous utilisons la vue tibble. ```{r} options("restore_SummarizedExperiment_show" = FALSE) se ``` -We can now use tidyverse commands to interact with the -`SummarizedExperiment` object. +Nous pouvons maintenant utiliser les commandes Tidyverse pour interagir avec l'objet +`SummarizedExperiment`. -We can use `filter` to filter for rows using a condition e.g. to view -all rows for one sample. +Nous pouvons utiliser `filter` pour filtrer les lignes en utilisant une condition, par exemple pour afficher +toutes les lignes pour un échantillon. ```{r} -se %>% filter(.sample == "GSM2545336") +se %>% filtre(.sample == "GSM2545336") ``` -We can use `select` to specify columns we want to view. +Nous pouvons utiliser « select » pour spécifier les colonnes que nous voulons afficher. ```{r} -se %>% select(.sample) +se %>% sélectionner (.sample) ``` -We can use `mutate` to add metadata info. +Nous pouvons utiliser `mutate` pour ajouter des informations sur les métadonnées. ```{r} -se %>% mutate(center = "Heidelberg University") +se %>% muter(center = "Université de Heidelberg") ``` -We can also combine commands with the tidyverse pipe `%>%`. For -example, we could combine `group_by` and `summarise` to get the total -counts for each sample. +Nous pouvons également combiner des commandes avec le tube Tidyverse `%>%`. Pour l'exemple de +, nous pourrions combiner `group_by` et `summarise` pour obtenir le nombre total de +pour chaque échantillon. ```{r} se %>% @@ -425,10 +425,10 @@ se %>% summarise(total_counts=sum(counts)) ``` -We can treat the tidy SummarizedExperiment object as a normal tibble -for plotting. +Nous pouvons traiter l'objet SummarizedExperiment bien rangé comme un tibble normal +pour le traçage. -Here we plot the distribution of counts per sample. +Ici, nous traçons la distribution des comptes par échantillon. ```{r tidySE-plot} se %>% @@ -438,27 +438,25 @@ se %>% theme_bw() ``` -For more information on tidySummarizedExperiment, see the package -website -[here](https://stemangiola.github.io/tidySummarizedExperiment/). +Pour plus d'informations sur TidySummarizedExperiment, consultez le site Web du package[ici](https://stemangiola.github.io/tidySummarizedExperiment/). -**Take-home message** +**Message à retenir** -- `SummarizedExperiment` represents an efficient way to store and - handle omics data. +- `SummarizedExperiment` représente un moyen efficace de stocker et + de gérer les données omiques. -- They are used in many Bioconductor packages. +- Ils sont utilisés dans de nombreux packages Bioconductor. -If you follow the next training focused on RNA sequencing analysis, -you will learn to use the Bioconductor `DESeq2` package to do some -differential expression analyses. The whole analysis of the `DESeq2` -package is handled in a `SummarizedExperiment`. +Si vous suivez la prochaine formation axée sur l'analyse de séquençage d'ARN, +vous apprendrez à utiliser le package Bioconductor `DESeq2` pour faire des +analyses d'expression différentielle. L'ensemble de l'analyse du package `DESeq2` +est géré dans un `SummarizedExperiment`. :::::::::::::::::::::::::::::::::::::::: keypoints -- Bioconductor is a project provide support and packages for the - comprehension of high high-throughput biology data. -- A `SummarizedExperiment` is a type of object useful to store and - manage high-throughput omics data. +- Bioconductor est un projet fournissant un support et des packages pour la + compréhension de données biologiques à haut débit. +- Un `SummarizedExperiment` est un type d'objet utile pour stocker et + gérer des données omiques à haut débit. :::::::::::::::::::::::::::::::::::::::::::::::::: From 50beff9c958c87ad6007cc7152e33ce6108c4d8f Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Thu, 1 Aug 2024 00:07:52 +0900 Subject: [PATCH 206/334] New translations 60-next-steps.md (Chinese Simplified) --- locale/zh/episodes/60-next-steps.Rmd | 372 +++++++++++++-------------- 1 file changed, 186 insertions(+), 186 deletions(-) diff --git a/locale/zh/episodes/60-next-steps.Rmd b/locale/zh/episodes/60-next-steps.Rmd index 89511b1ab..33035a520 100644 --- a/locale/zh/episodes/60-next-steps.Rmd +++ b/locale/zh/episodes/60-next-steps.Rmd @@ -1,8 +1,8 @@ --- -source: Rmd -title: Next steps -teaching: 45 -exercises: 45 +source: 放射科 +title: 下一步 +teaching: 四十五 +exercises: 四十五 --- ```{r, include=FALSE} @@ -10,126 +10,126 @@ exercises: 45 ::::::::::::::::::::::::::::::::::::::: objectives -- Introduce the Bioconductor project. -- Introduce the notion of data containers. -- Give an overview of the `SummarizedExperiment`, extensively used in - omics analyses. +- 介绍Bioconductor项目。 +- 引入数据容器的概念。 +- 概述在 + 组学分析中广泛使用的`SummarizedExperiment`。 :::::::::::::::::::::::::::::::::::::::::::::::::: :::::::::::::::::::::::::::::::::::::::: questions -- What is a `SummarizedExperiment`? -- What is Bioconductor? +- 什么是“SummarizedExperiment”? +- 什么是 Bioconductor? -:::::::::::::::::::::::::::::::::::::::::::::::::: +::::::::::::::::::::::::::::::::::::::::::::::::::::: -## Next steps +## 下一步 ```{r, echo=FALSE, message=FALSE} -library("tidyverse") +图书馆(“tidyverse”) ``` -Data in bioinformatics is often complex. To deal with this, -developers define specialised data containers (termed classes) that -match the properties of the data they need to handle. +生物信息学中的数据通常很复杂。 为了解决这个问题, +开发人员定义了专门的数据容器(称为类), +与他们需要处理的数据的属性相匹配。 -This aspect is central to the **Bioconductor**[^Bioconductor] project -which uses the same **core data infrastructure** across packages. This -certainly contributed to Bioconductor's success. Bioconductor package +这一方面是 **Bioconductor**[^Bioconductor] 项目 +的核心,它在各个包中使用相同的 **核心数据基础设施**。 这 +无疑为 Bioconductor 的成功做出了贡献。 Bioconductor package developers are advised to make use of existing infrastructure to provide coherence, interoperability, and stability to the project as a whole. [^Bioconductor]: The [Bioconductor](https://www.bioconductor.org) was initiated by Robert Gentleman, one of the two creators of the R - language. Bioconductor provides tools dedicated to omics data - analysis. Bioconductor uses the R statistical programming language - and is open source and open development. + language. Bioconductor 提供专用于组学数据 + 分析的工具。 Bioconductor 使用 R 统计编程语言 + ,并且是开源和开放开发的。 -To illustrate such an omics data container, we'll present the -`SummarizedExperiment` class. +为了说明这样的组学数据容器,我们将介绍 +`SummarizedExperiment`类。 -## SummarizedExperiment +## 总结实验 -The figure below represents the anatomy of the SummarizedExperiment class. +下图显示了 SummarizedExperiment 类的结构。 ```{r SE, echo=FALSE, out.width="80%"} knitr::include_graphics("https://uclouvain-cbio.github.io/WSBIM1322/figs/SE.svg") ``` -Objects of the class SummarizedExperiment contain : +SummarizedExperiment 类的对象包含: -- **One (or more) assay(s)** containing the quantitative omics data - (expression data), stored as a matrix-like object. Features (genes, - transcripts, proteins, ...) are defined along the rows, and samples - along the columns. +- **一个(或多个)分析**包含定量组学数据 + (表达数据),存储为类似矩阵的对象。 特征(基因、 + 转录本、蛋白质……) 沿着行定义,并沿着列采样 + 。 -- A **sample metadata** slot containing sample co-variates, stored as a - data frame. Rows from this table represent samples (rows match exactly the - columns of the expression data). +- 包含样本协变量的**样本元数据**槽,存储为 + 数据框。 该表中的行代表样本(行与表达数据的 + 列完全匹配)。 -- A **feature metadata** slot containing feature co-variates, stored as - a data frame. The rows of this data frame match exactly the rows of the - expression data. +- 包含特征协变量的**特征元数据**槽,存储为 + 数据框。 该数据框的行与 + 表达数据的行完全匹配。 -The coordinated nature of the `SummarizedExperiment` guarantees that -during data manipulation, the dimensions of the different slots will -always match (i.e the columns in the expression data and then rows in -the sample metadata, as well as the rows in the expression data and -feature metadata) during data manipulation. For example, if we had to -exclude one sample from the assay, it would be automatically removed -from the sample metadata in the same operation. +`SummarizedExperiment` 的协调特性保证了在数据操作过程中 +不同插槽的维度将 +始终匹配(即表达数据中的列,然后是 +样本元数据中的行,以及表达数据和 +特征元数据中的行)。 例如,如果我们必须 +从检测中排除一个样本,那么它将在同一操作中自动从样本元数据中删除 +。 -The metadata slots can grow additional co-variates -(columns) without affecting the other structures. +元数据槽可以增加额外的协变量 +(列)而不会影响其他结构。 -### Creating a SummarizedExperiment +### 创建汇总实验 -In order to create a `SummarizedExperiment`, we will create the -individual components, i.e the count matrix, the sample and gene -metadata from csv files. These are typically how RNA-Seq data are -provided (after raw data have been processed). +为了创建一个`SummarizedExperiment`,我们将从 csv 文件中创建 +个单独组件,即计数矩阵、样本和基因 +元数据。 这些通常是 RNA-Seq 数据 +的提供方式(在原始数据被处理之后)。 ```{r, echo=FALSE, message=FALSE} rna <- read_csv("data/rnaseq.csv") -## count matrix +## 计数矩阵 counts <- rna %>% select(gene, sample, expression) %>% pivot_wider(names_from = sample, values_from = expression) -## convert to matrix and set row names +## 转换为矩阵并设置行名称 count_matrix <- counts %>% select(-gene) %>% as.matrix() rownames(count_matrix) <- counts$gene -## sample annotation +## 样本注释 sample_metadata <- rna %>% - select(sample, organism, age, sex, infection, strain, time, tissue, mouse) + 选择(样本、生物体、年龄、性别、感染、菌株、时间、组织、小鼠) -## remove redundancy +## 消除冗余 sample_metadata <- unique(sample_metadata) -## gene annotation +## 基因注释 gene_metadata <- rna %>% - select(gene, ENTREZID, product, ensembl_gene_id, external_synonym, - chromosome_name, gene_biotype, phenotype_description, - hsapiens_homolog_associated_gene_name) + 选择(基因、ENTREZID、产品、ensembl_gene_id、external_synonym、 + chromosome_name、gene_biotype、phenotype_description、 + hsapiens_homolog_associated_gene_name) -# remove redundancy +# 消除冗余 gene_metadata <- unique(gene_metadata) -## write to csv +## 写入到 csv write.csv(count_matrix, file = "data/count_matrix.csv") write.csv(gene_metadata, file = "data/gene_metadata.csv", row.names = FALSE) write.csv(sample_metadata, file = "data/sample_metadata.csv", row.names = FALSE) ``` -- **An expression matrix**: we load the count matrix, specifying that - the first columns contains row/gene names, and convert the - `data.frame` to a `matrix`. You can download it - [here](https://carpentries-incubator.github.io/bioc-intro/data/count_matrix.csv). +- **表达矩阵**:我们加载计数矩阵,指定 + 第一列包含行/基因名称,并将 + `data.frame`转换为`matrix`。 您可以从 [这里](https://carpentries-incubator.github.io/bioc-intro/data/count_matrix.csv) 下载它 + 。 ```{r} count_matrix <- read.csv("data/count_matrix.csv", @@ -140,8 +140,8 @@ count_matrix[1:5, ] dim(count_matrix) ``` -- **A table describing the samples**, available - [here](https://carpentries-incubator.github.io/bioc-intro/data/sample_metadata.csv). +- **描述样本的表格**,可在 + [此处](https://carpentries-incubator.github.io/bioc-intro/data/sample_metadata.csv) 获取。 ```{r} sample_metadata <- read.csv("data/sample_metadata.csv") @@ -149,8 +149,8 @@ sample_metadata dim(sample_metadata) ``` -- **A table describing the genes**, available - [here](https://carpentries-incubator.github.io/bioc-intro/data/gene_metadata.csv). +- **描述基因的表格**,可在 + [此处](https://carpentries-incubator.github.io/bioc-intro/data/gene_metadata.csv) 获取。 ```{r} gene_metadata <- read.csv("data/gene_metadata.csv") @@ -158,27 +158,27 @@ gene_metadata[1:10, 1:4] dim(gene_metadata) ``` -We will create a `SummarizedExperiment` from these tables: +我们将根据以下表格创建一个“SummarizedExperiment”: -- The count matrix that will be used as the **`assay`** +- 将用作\*\*\`分析\*\*的计数矩阵 -- The table describing the samples will be used as the **sample - metadata** slot +- 描述样本的表格将用作**样本 + 元数据**槽 -- The table describing the genes will be used as the **features - metadata** slot +- 描述基因的表格将用作**特征 + 元数据**槽 -To do this we can put the different parts together using the -`SummarizedExperiment` constructor: +为此,我们可以使用 +`SummarizedExperiment` 构造函数将不同的部分放在一起: ```{r, message=FALSE, warning=FALSE} ## BiocManager::install("SummarizedExperiment") library("SummarizedExperiment") ``` -First, we make sure that the samples are in the same order in the -count matrix and the sample annotation, and the same for the genes in -the count matrix and the gene annotation. +首先,我们确保 +计数矩阵和样本注释中的样本顺序相同,并且 +计数矩阵和基因注释中的基因顺序相同。 ```{r} stopifnot(rownames(count_matrix) == gene_metadata$gene) @@ -192,63 +192,63 @@ se <- SummarizedExperiment(assays = list(counts = count_matrix), se ``` -### Saving data +### 保存数据 -Exporting data to a spreadsheet, as we did in a previous episode, has -several limitations, such as those described in the first chapter -(possible inconsistencies with `,` and `.` for decimal separators and -lack of variable type definitions). Furthermore, exporting data to a -spreadsheet is only relevant for rectangular data such as dataframes -and matrices. +就像我们在上一集中所做的那样,将数据导出到电子表格有 +几个限制,例如第一章 +中描述的限制(小数分隔符 `,` 和 `.` 可能不一致,以及 +缺少变量类型定义)。 此外,将数据导出到 +电子表格仅与矩形数据(例如数据框 +和矩阵)相关。 -A more general way to save data, that is specific to R and is -guaranteed to work on any operating system, is to use the `saveRDS` -function. Saving objects like this will generate a binary -representation on disk (using the `rds` file extension here), which -can be loaded back into R using the `readRDS` function. +保存数据的更通用的方法是使用 `saveRDS` +函数,这种方法特定于 R,并且 +保证可以在任何操作系统上运行。 像这样保存对象将在磁盘上生成二进制 +表示(此处使用`rds`文件扩展名),可以使用`readRDS`函数将其 +加载回R。 ```{r, eval=FALSE} -saveRDS(se, file = "data_output/se.rds") +saveRDS(se,file = “data_output/se.rds”) rm(se) -se <- readRDS("data_output/se.rds") +se <- readRDS(“data_output/se.rds”) head(se) ``` To conclude, when it comes to saving data from R that will be loaded again in R, saving and loading with `saveRDS` and `readRDS` is the -preferred approach. If tabular data need to be shared with somebody -that is not using R, then exporting to a text-based spreadsheet is a -good alternative. +preferred approach. 如果需要与不使用 R 的人 +共享表格数据,那么导出到基于文本的电子表格是一个 +不错的选择。 -Using this data structure, we can access the expression matrix with -the `assay` function: +使用这个数据结构,我们可以通过 +`assay` 函数访问表达矩阵: ```{r} -head(assay(se)) -dim(assay(se)) +头部(测定(se)) +暗淡(测定(se)) ``` -We can access the sample metadata using the `colData` function: +我们可以使用“colData”函数访问样本元数据: ```{r} colData(se) -dim(colData(se)) +暗淡(colData(se)) ``` -We can also access the feature metadata using the `rowData` function: +我们还可以使用“rowData”函数访问特征元数据: ```{r} -head(rowData(se)) -dim(rowData(se)) +头(rowData(se)) +dim(rowData(se)) ``` -### Subsetting a SummarizedExperiment +### 对 SummarizedExperiment 进行子集设置 -SummarizedExperiment can be subset just like with data frames, with -numerics or with characters of logicals. +SummarizedExperiment 可以像数据框一样被子集化,具有 +数字或逻辑字符。 -Below, we create a new instance of class SummarizedExperiment that -contains only the 5 first features for the 3 first samples. +下面,我们创建 SummarizedExperiment 类的新实例,其中 +仅包含前 3 个样本的前 5 个特征。 ```{r} se1 <- se[1:5, 1:3] @@ -256,20 +256,20 @@ se1 ``` ```{r} -colData(se1) -rowData(se1) +colData(se1) +rowData(se1) ``` -We can also use the `colData()` function to subset on something from -the sample metadata or the `rowData()` to subset on something from the -feature metadata. For example, here we keep only miRNAs and the non -infected samples: +我们还可以使用 `colData()` 函数从 +样本元数据中对某些内容进行子集化,或者使用 `rowData()` 从 +特征元数据中对某些内容进行子集化。 例如,这里我们只保留 miRNA 和非 +感染的样本: ```{r} se1 <- se[rowData(se)$gene_biotype == "miRNA", colData(se)$infection == "NonInfected"] se1 -assay(se1) +analysis(se1) colData(se1) rowData(se1) ``` @@ -288,16 +288,16 @@ function.--> <!-- ``` --> -::::::::::::::::::::::::::::::::::::::: challenge +::::::::::::::::::::::::::::::::::::::: 挑战 -## Challenge +## 挑战 -Extract the gene expression levels of the 3 first genes in samples -at time 0 and at time 8. +提取时间 0 和时间 8 时样本 +中前 3 个基因的基因表达水平。 ::::::::::::::: solution -## Solution +## 解决方案 ```{r, purl=FALSE} assay(se)[1:3, colData(se)$time != 4] @@ -312,58 +312,58 @@ assay(se)[1:3, colData(se)$time == 0 | colData(se)$time == 8] ::::::::::::::::::::::::::::::::::::::: challenge -## Challenge +## 挑战 -Verify that you get the same values using the long `rna` table. +验证您是否使用长“rna”表获得相同的值。 ::::::::::::::: solution -## Solution +## 解决方案 ```{r, purl=FALSE} rna |> - filter(gene %in% c("Asl", "Apod", "Cyd2d22")) |> - filter(time != 4) |> select(expression) + 过滤器(基因 %in% c("Asl", "Apod", "Cyd2d22")) |> + 过滤器(时间 != 4)|> 选择(表达式) ``` ::::::::::::::::::::::::: :::::::::::::::::::::::::::::::::::::::::::::::::: -The long table and the `SummarizedExperiment` contain the same -information, but are simply structured differently. Each approach has its -own advantages: the former is a good fit for the `tidyverse` packages, -while the latter is the preferred structure for many bioinformatics and -statistical processing steps. For example, a typical RNA-Seq analyses using -the `DESeq2` package. +长表和`SummarizedExperiment`包含相同的 +信息,但是结构不同。 每种方法都有其 +自身的优势:前者非常适合`tidyverse`包, +而后者是许多生物信息学和 +统计处理步骤的首选结构。 例如,典型的 RNA-Seq 分析使用 +`DESeq2` 包。 -#### Adding variables to metadata +#### 向元数据添加变量 -We can also add information to the metadata. -Suppose that you want to add the center where the samples were collected... +我们还可以向元数据中添加信息。 +假设您想添加收集样本的中心…… ```{r} -colData(se)$center <- rep("University of Illinois", nrow(colData(se))) +colData(se)$center <- rep("伊利诺伊大学", nrow(colData(se))) colData(se) ``` This illustrates that the metadata slots can grow indefinitely without affecting the other structures! -### tidySummarizedExperiment +### tidy总结实验 -You may be wondering, can we use tidyverse commands to interact with -`SummarizedExperiment` objects? The answer is yes, we can with the -`tidySummarizedExperiment` package. +您可能想知道,我们可以使用 tidyverse 命令与 +`SummarizedExperiment` 对象交互吗? 答案是肯定的,我们可以使用 +`tidySummarizedExperiment` 包。 -Remember what our SummarizedExperiment object looks like: +记住我们的 SummarizedExperiment 对象是什么样的: ```{r, message=FALSE} -se +塞 ``` -Load `tidySummarizedExperiment` and then take a look at the se object -again. +加载“tidySummarizedExperiment”,然后再次查看 se 对象 +。 ```{r, message=FALSE} #BiocManager::install("tidySummarizedExperiment") @@ -372,63 +372,63 @@ library("tidySummarizedExperiment") se ``` -It's still a `SummarizedExperiment` object, so maintains the efficient -structure, but now we can view it as a tibble. Note the first line of -the output says this, it's a `SummarizedExperiment`\-`tibble` -abstraction. We can also see in the second line of the output the -number of transcripts and samples. +它仍然是一个 `SummarizedExperiment` 对象,因此保持了高效的 +结构,但现在我们可以将其视为一个 tibble。 注意输出的第一行 +说明了这一点,它是一个 `SummarizedExperiment`\-`tibble` +抽象。 我们还可以在输出的第二行看到 +份成绩单和样本的数量。 -If we want to revert to the standard `SummarizedExperiment` view, we -can do that. +如果我们想恢复到标准的“SummarizedExperiment”视图,我们 +可以这样做。 ```{r} -options("restore_SummarizedExperiment_show" = TRUE) +选项(“restore_SummarizedExperiment_show” = TRUE) se ``` -But here we use the tibble view. +但这里我们使用 tibble 视图。 ```{r} -options("restore_SummarizedExperiment_show" = FALSE) +选项(“restore_SummarizedExperiment_show” = FALSE) se ``` -We can now use tidyverse commands to interact with the -`SummarizedExperiment` object. +我们现在可以使用 tidyverse 命令与 +`SummarizedExperiment` 对象交互。 -We can use `filter` to filter for rows using a condition e.g. to view -all rows for one sample. +我们可以使用“过滤器”根据条件过滤行,例如查看 +一个样本的所有行。 ```{r} -se %>% filter(.sample == "GSM2545336") +se %>% 过滤器(.sample == “GSM2545336”) ``` -We can use `select` to specify columns we want to view. +我们可以使用“select”来指定我们想要查看的列。 ```{r} -se %>% select(.sample) +se %>% 选择(.sample) ``` -We can use `mutate` to add metadata info. +我们可以使用“mutate”来添加元数据信息。 ```{r} -se %>% mutate(center = "Heidelberg University") +se %>% mutate(center = "海德堡大学") ``` -We can also combine commands with the tidyverse pipe `%>%`. For -example, we could combine `group_by` and `summarise` to get the total -counts for each sample. +我们还可以将命令与 tidyverse 管道“%>%”组合起来。 对于 +示例,我们可以结合 `group_by` 和 `summarise` 来获取每个样本的总 +计数。 ```{r} se %>% group_by(.sample) %>% - summarise(total_counts=sum(counts)) + 汇总(total_counts=sum(counts)) ``` -We can treat the tidy SummarizedExperiment object as a normal tibble -for plotting. +我们可以将整洁的 SummarizedExperiment 对象视为用于绘图的正常 tibble +。 -Here we plot the distribution of counts per sample. +这里我们绘制了每个样本的计数分布。 ```{r tidySE-plot} se %>% @@ -438,26 +438,26 @@ se %>% theme_bw() ``` -For more information on tidySummarizedExperiment, see the package -website -[here](https://stemangiola.github.io/tidySummarizedExperiment/). +有关 tidySummarizedExperiment 的更多信息,请参阅包 +网站 +[此处](https://stemangiola.github.io/tidySummarizedExperiment/)。 -**Take-home message** +**带回家的信息** -- `SummarizedExperiment` represents an efficient way to store and - handle omics data. +- `SummarizedExperiment` 代表了一种存储和 + 处理组学数据的有效方法。 -- They are used in many Bioconductor packages. +- 它们被用于许多 Bioconductor 包中。 -If you follow the next training focused on RNA sequencing analysis, -you will learn to use the Bioconductor `DESeq2` package to do some -differential expression analyses. The whole analysis of the `DESeq2` -package is handled in a `SummarizedExperiment`. +如果您参加下一次以 RNA 测序分析为重点的培训, +您将学习使用 Bioconductor `DESeq2` 包进行一些 +差异表达分析。 `DESeq2` +包的整个分析在 `SummarizedExperiment` 中处理。 :::::::::::::::::::::::::::::::::::::::: keypoints -- Bioconductor is a project provide support and packages for the - comprehension of high high-throughput biology data. +- Bioconductor 是一个为 + 理解高通量生物学数据提供支持和包的项目。 - A `SummarizedExperiment` is a type of object useful to store and manage high-throughput omics data. From e7cadaeab47ba6187419b49b76ca49ea59dcf186 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Wed, 14 Aug 2024 18:21:30 +0900 Subject: [PATCH 207/334] New translations 10-data-organisation.md (Spanish) --- locale/es/episodes/10-data-organisation.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/locale/es/episodes/10-data-organisation.Rmd b/locale/es/episodes/10-data-organisation.Rmd index 5d672156a..d04ccd3f9 100644 --- a/locale/es/episodes/10-data-organisation.Rmd +++ b/locale/es/episodes/10-data-organisation.Rmd @@ -8,7 +8,7 @@ exercises: 30 ```{r, include=FALSE} ``` -::::::::::::::::::::::::::::::::::::::: objetivos +::::::::::::::::::::::::::::::::::::::: objectives - Aprenda sobre las hojas de cálculo, sus fortalezas y debilidades. - ¿Cómo damos formato a los datos en hojas de cálculo para un uso eficaz de los datos? From bfe68f4bfe95eaf873fd99b21db9d8752547c886 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Wed, 14 Aug 2024 18:21:39 +0900 Subject: [PATCH 208/334] New translations 20-r-rstudio.md (Spanish) --- locale/es/episodes/20-r-rstudio.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/locale/es/episodes/20-r-rstudio.Rmd b/locale/es/episodes/20-r-rstudio.Rmd index 5edbfded3..3c80fec73 100644 --- a/locale/es/episodes/20-r-rstudio.Rmd +++ b/locale/es/episodes/20-r-rstudio.Rmd @@ -8,7 +8,7 @@ exercises: 0 ```{r, include=FALSE} ``` -::::::::::::::::::::::::::::::::::::::: objetivos +::::::::::::::::::::::::::::::::::::::: objectives - Describa el propósito de los paneles RStudio Script, Consola, Entorno y Gráficos. - Organice archivos y directorios para un conjunto de análisis como un proyecto de R y comprenda el propósito del directorio de trabajo. From ed2a27baf99adde61cd5690ac12590a1dba2235e Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Wed, 14 Aug 2024 18:21:41 +0900 Subject: [PATCH 209/334] New translations 20-r-rstudio.md (Japanese) --- locale/ja/episodes/20-r-rstudio.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/locale/ja/episodes/20-r-rstudio.Rmd b/locale/ja/episodes/20-r-rstudio.Rmd index 747be52b4..fd1053846 100644 --- a/locale/ja/episodes/20-r-rstudio.Rmd +++ b/locale/ja/episodes/20-r-rstudio.Rmd @@ -8,7 +8,7 @@ exercises: 0 ```{r, include=FALSE} ``` -::::::::::::::::::::::::::::::::::::::: 目的 +::::::::::::::::::::::::::::::::::::::: objectives - RStudio スクリプト、コンソール、環境、およびプロットペインの目的について説明します。 - Rプロジェクトとして一連の分析のためのファイルとディレクトリを整理し、作業ディレクトリの目的を理解する。 From 2f4f7dd5058ca79bc59b7ec776d650fdc13ad4f2 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Wed, 14 Aug 2024 18:21:47 +0900 Subject: [PATCH 210/334] New translations 23-starting-with-r.md (French) --- locale/fr/episodes/23-starting-with-r.Rmd | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/locale/fr/episodes/23-starting-with-r.Rmd b/locale/fr/episodes/23-starting-with-r.Rmd index 039310451..a74715b27 100644 --- a/locale/fr/episodes/23-starting-with-r.Rmd +++ b/locale/fr/episodes/23-starting-with-r.Rmd @@ -8,7 +8,7 @@ exercises: 60 ```{r, include=FALSE} ``` -::::::::::::::::::::::::::::::::::::::: objectifs +::::::::::::::::::::::::::::::::::::::: objectives - Définissez les termes suivants relatifs à R : objet, affectation, appel, fonction, arguments, options. - Attribuez des valeurs aux objets dans R. @@ -22,7 +22,7 @@ exercises: 60 :::::::::::::::::::::::::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::: des questions +:::::::::::::::::::::::::::::::::::::::: questions - Premières commandes dans R From af4df03e6085c315046fbe92277bf4813614af62 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Wed, 14 Aug 2024 18:21:49 +0900 Subject: [PATCH 211/334] New translations 23-starting-with-r.md (Spanish) --- locale/es/episodes/23-starting-with-r.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/locale/es/episodes/23-starting-with-r.Rmd b/locale/es/episodes/23-starting-with-r.Rmd index 267cbe47a..67057b323 100644 --- a/locale/es/episodes/23-starting-with-r.Rmd +++ b/locale/es/episodes/23-starting-with-r.Rmd @@ -8,7 +8,7 @@ exercises: 60 ```{r, include=FALSE} ``` -::::::::::::::::::::::::::::::::::::::: objetivos +::::::::::::::::::::::::::::::::::::::: objectives - Defina los siguientes términos en relación con R: objeto, asignación, llamada, función, argumentos, opciones. - Asignar valores a objetos en R. From fdf17851501db6255fd2a6e5ce7511611c6df1c4 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Wed, 14 Aug 2024 18:21:51 +0900 Subject: [PATCH 212/334] New translations 23-starting-with-r.md (Japanese) --- locale/ja/episodes/23-starting-with-r.Rmd | 50 +++++++++++------------ 1 file changed, 25 insertions(+), 25 deletions(-) diff --git a/locale/ja/episodes/23-starting-with-r.Rmd b/locale/ja/episodes/23-starting-with-r.Rmd index 29a615048..32fda42fd 100644 --- a/locale/ja/episodes/23-starting-with-r.Rmd +++ b/locale/ja/episodes/23-starting-with-r.Rmd @@ -8,7 +8,7 @@ exercises: 60 ```{r, include=FALSE} ``` -::::::::::::::::::::::::::::::::::::::: 目的 +::::::::::::::::::::::::::::::::::::::: objectives - R に関連する次の用語を定義します: オブジェクト、代入、呼び出し、関数、引数、オプション。 - R のオブジェクトに値を割り当てます。 @@ -22,7 +22,7 @@ exercises: 60 :::::::::::::::::::::::::::::::::::::::::::::: -::::::::::::::::::::::::::::::::::::::: 質問 +:::::::::::::::::::::::::::::::::::::::: questions - R の最初のコマンド @@ -137,7 +137,7 @@ R のメモリに「weight_kg」があるので、それを使って算術演算 体重_kg <- 100 ``` -::::::::::::::::::::::::::::::::::::::: チャレンジ +::::::::::::::::::::::::::::::::::::::: challenge ## チャレンジ: @@ -158,7 +158,7 @@ RStudio では、段落のコメントまたはコメント解除が簡単に行 位置にカーソルを置きます (つまり、行全体を選択する必要はありません)。その後 Ctrl</kbd> + <kbd>Shift</kbd> + <kbd>C</kbd><kbd>押します。 -::::::::::::::::::::::::::::::::::::::: チャレンジ +::::::::::::::::::::::::::::::::::::::: challenge ## チャレンジ @@ -341,7 +341,7 @@ Weight_g 重要なものは、リスト (`list`)、行列 (`matrix`)、データ フレーム (`data.frame`)、因子 (`factor`)、および配列 (`array`) です。 -::::::::::::::::::::::::::::::::::::::: チャレンジ +::::::::::::::::::::::::::::::::::::::: challenge ## チャレンジ: @@ -349,7 +349,7 @@ Weight_g double)、整数、および論理型であることがわかりました。 しかし、これらのタイプを つのベクトルに混在させようとするとどうなるでしょうか? -::::::::::::::: 解決 +::::::::::::::: solution ## 解決 @@ -359,7 +359,7 @@ R はそれらをすべて同じ型に暗黙的に変換します。 :::::::::::::::::::::::::::::::::::::::::::::: -::::::::::::::::::::::::::::::::::::::: チャレンジ +::::::::::::::::::::::::::::::::::::::: challenge ## チャレンジ: @@ -373,7 +373,7 @@ char_logical <- c("a", "b", "c", TRUE) ) トリッキー <- c(1, 2, 3, "4") ``` -::::::::::::::: 解決 +::::::::::::::: solution ## 解決 @@ -392,13 +392,13 @@ char_logical :::::::::::::::::::::::::::::::::::::::::::::: -::::::::::::::::::::::::::::::::::::::: チャレンジ +::::::::::::::::::::::::::::::::::::::: challenge ## チャレンジ: なぜそれが起こると思いますか? -::::::::::::::: 解決 +::::::::::::::: solution ## 解決 @@ -410,7 +410,7 @@ char_logical :::::::::::::::::::::::::::::::::::::::::::::: -::::::::::::::::::::::::::::::::::::::: チャレンジ +::::::::::::::::::::::::::::::::::::::: challenge ## チャレンジ: @@ -423,7 +423,7 @@ char_logical <- c("a", "b", "c", TRUE) combined_logical <- c(num_logical, char_logical) ``` -::::::::::::::: 解決 +::::::::::::::: solution ## 解決 @@ -440,7 +440,7 @@ combined_logical <- c(num_logical, char_logical) :::::::::::::::::::::::::::::::::::::::::::::: -::::::::::::::::::::::::::::::::::::::: チャレンジ +::::::::::::::::::::::::::::::::::::::: challenge ## チャレンジ: @@ -450,7 +450,7 @@ R では、オブジェクトをあるクラスから別のクラスに変換す がどのように強制されるかの階層を表す図を描いてもらえます ? -::::::::::::::: 解決 +::::::::::::::: solution ## 解決 @@ -568,13 +568,13 @@ AND) または `|` (少なくとも 1 つの条件が true、OR) を使用して 分子[分子 %in% c("rna", " 「DNA」、「代謝物」、「ペプチド」、「グリセロール」)] ``` -::::::::::::::::::::::::::::::::::::::: チャレンジ +::::::::::::::::::::::::::::::::::::::: challenge ## チャレンジ: なぜ `"four" > "five"` が `TRUE` を返すのか理解できますか? -::::::::::::::: 解決 +::::::::::::::: solution ## 解決 @@ -650,7 +650,7 @@ na.omit(heights) の高さ[完全なケース(高さ)] ``` -::::::::::::::::::::::::::::::::::::::: チャレンジ +::::::::::::::::::::::::::::::::::::::: challenge ## チャレンジ: @@ -663,7 +663,7 @@ na.omit(heights) 2. 関数 `median()` を使用して、`heights` ベクトルの中央値を計算します。 3. R を使用して、セット内の身長が 67 インチを超える人が何人いるかを計算します。 -::::::::::::::: 解決 +::::::::::::::: solution ## 解決 @@ -714,13 +714,13 @@ set.seed(1) 文字と論理に対しても同様のコンストラクターがあり、 `character()` と `logical()` という名前が付けられます。 -::::::::::::::::::::::::::::::::::::::: チャレンジ +::::::::::::::::::::::::::::::::::::::: challenge ## チャレンジ: 文字ベクトルと論理ベクトルのデフォルトは何ですか? -::::::::::::::: 解決 +::::::::::::::: solution ## 解決 @@ -759,7 +759,7 @@ set.seed(1) rep(c(1, 2, 3), 5) ``` -::::::::::::::::::::::::::::::::::::::: チャレンジ +::::::::::::::::::::::::::::::::::::::: challenge ## チャレンジ: @@ -767,7 +767,7 @@ rep(c(1, 2, 3), 5) 1 を 5 つ、2 を 5 つ、3 を 5 つこの順序で取得した場合はどうなるでしょうか。 可能性は 2 つあります。ヘルプについては `?rep` または `?sort` を参照してください。 -::::::::::::::: 解決 +::::::::::::::: solution ## 解決 @@ -836,7 +836,7 @@ seq(from = 1、to = 20、length.out = 3) サンプル(1:5、10、置換 = TRUE) ``` -::::::::::::::::::::::::::::::::::::::: チャレンジ +::::::::::::::::::::::::::::::::::::::: challenge ## チャレンジ: @@ -855,7 +855,7 @@ seq(from = 1、to = 20、length.out = 3) 別のシードを設定して繰り返します。 -::::::::::::::: 解決 +::::::::::::::: solution ## 解決 @@ -915,7 +915,7 @@ rnorm(5, 100, 5) のデータ構造の基本を学習したので、より大きなデータの操作を開始する準備が整い、データ フレームについて します。 -:::::::::::::::::::::::::::::::::::::::: キーポイント +:::::::::::::::::::::::::::::::::::::::: keypoints - Rと対話する方法 From 4248b36600cbdf2f93071ea9fd2d924dd7441878 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Wed, 14 Aug 2024 18:21:56 +0900 Subject: [PATCH 213/334] New translations 25-starting-with-data.md (French) --- locale/fr/episodes/25-starting-with-data.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/locale/fr/episodes/25-starting-with-data.Rmd b/locale/fr/episodes/25-starting-with-data.Rmd index 411f2d942..35709d1ee 100644 --- a/locale/fr/episodes/25-starting-with-data.Rmd +++ b/locale/fr/episodes/25-starting-with-data.Rmd @@ -8,7 +8,7 @@ exercises: 30 ```{r, include=FALSE} ``` -::::::::::::::::::::::::::::::::::::::: objectifs +::::::::::::::::::::::::::::::::::::::: objectives - Décrivez ce qu'est un « data.frame ». - Chargez des données externes à partir d'un fichier .csv dans un bloc de données. From 9e6a4eaa7c0ce270ac0814af86b4d664ffccf92d Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Wed, 14 Aug 2024 18:21:58 +0900 Subject: [PATCH 214/334] New translations 25-starting-with-data.md (Spanish) --- locale/es/episodes/25-starting-with-data.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/locale/es/episodes/25-starting-with-data.Rmd b/locale/es/episodes/25-starting-with-data.Rmd index bc7da42f4..db31bc071 100644 --- a/locale/es/episodes/25-starting-with-data.Rmd +++ b/locale/es/episodes/25-starting-with-data.Rmd @@ -8,7 +8,7 @@ exercises: 30 ```{r, include=FALSE} ``` -::::::::::::::::::::::::::::::::::::::: objetivos +::::::::::::::::::::::::::::::::::::::: objectives - Describir un objeto de tipo `data.frame`. - Cargar datos externos desde un archivo .csv a un objecto `data.frame`. From 44bc3499faaeb2463b847691acc87632dc3781e5 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Wed, 14 Aug 2024 18:22:00 +0900 Subject: [PATCH 215/334] New translations 25-starting-with-data.md (Japanese) --- locale/ja/episodes/25-starting-with-data.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/locale/ja/episodes/25-starting-with-data.Rmd b/locale/ja/episodes/25-starting-with-data.Rmd index e0cf3ff3a..40fd82861 100644 --- a/locale/ja/episodes/25-starting-with-data.Rmd +++ b/locale/ja/episodes/25-starting-with-data.Rmd @@ -8,7 +8,7 @@ exercises: 30 ```{r, include=FALSE} ``` -::::::::::::::::::::::::::::::::::::::: 目的 +::::::::::::::::::::::::::::::::::::::: objectives - `data.frame` が何なのか説明してみましょう。 - .csv ファイルからデータ フレームに外部データを読み込みましょう。 From 1f0d0047836a5651716f0818979eb4ac22d51bed Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Wed, 14 Aug 2024 18:22:06 +0900 Subject: [PATCH 216/334] New translations 30-dplyr.md (French) --- locale/fr/episodes/30-dplyr.Rmd | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/locale/fr/episodes/30-dplyr.Rmd b/locale/fr/episodes/30-dplyr.Rmd index 3683d6c13..03194ce7b 100644 --- a/locale/fr/episodes/30-dplyr.Rmd +++ b/locale/fr/episodes/30-dplyr.Rmd @@ -8,7 +8,7 @@ exercises: 75 ```{r, include=FALSE} ``` -::::::::::::::::::::::::::::::::::::::: objectifs +::::::::::::::::::::::::::::::::::::::: objectives - Décrivez l'objectif des packages **`dplyr`** et **`tidyr`**. - Décrivez plusieurs de leurs fonctions extrêmement utiles pour @@ -19,7 +19,7 @@ exercises: 75 :::::::::::::::::::::::::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::: des questions +:::::::::::::::::::::::::::::::::::::::: questions - Analyse de données dans R à l'aide du méta-paquet Tidyverse From 27131310e80872b9e0e74dd75a947720f7a718e2 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Wed, 14 Aug 2024 18:22:08 +0900 Subject: [PATCH 217/334] New translations 30-dplyr.md (Spanish) --- locale/es/episodes/30-dplyr.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/locale/es/episodes/30-dplyr.Rmd b/locale/es/episodes/30-dplyr.Rmd index fd1d5da6e..0b7c7c173 100644 --- a/locale/es/episodes/30-dplyr.Rmd +++ b/locale/es/episodes/30-dplyr.Rmd @@ -8,7 +8,7 @@ exercises: 75 ```{r, include=FALSE} ``` -::::::::::::::::::::::::::::::::::::::: objetivos +::::::::::::::::::::::::::::::::::::::: objectives - Describe el propósito de los paquetes **`dplyr`** y **`tidyr`**. - Describe varias de sus funciones que son extremadamente útiles para From ecd17269571fee702267fa79f3a7056379b68cd3 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Wed, 14 Aug 2024 18:22:10 +0900 Subject: [PATCH 218/334] New translations 30-dplyr.md (Japanese) --- locale/ja/episodes/30-dplyr.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/locale/ja/episodes/30-dplyr.Rmd b/locale/ja/episodes/30-dplyr.Rmd index acf596daa..a54cb810c 100644 --- a/locale/ja/episodes/30-dplyr.Rmd +++ b/locale/ja/episodes/30-dplyr.Rmd @@ -8,7 +8,7 @@ exercises: 75 ```{r, include=FALSE} ``` -::::::::::::::::::::::::::::::::::::::: 目的 +::::::::::::::::::::::::::::::::::::::: objectives - dplyr`** と **tidyr`\*\* パッケージの目的を説明する。 - データを操作するのに非常に便利な関数をいくつか説明する。 From d05e8a9c078dae0eff600d775d0af81a18fd6f86 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Wed, 14 Aug 2024 18:22:16 +0900 Subject: [PATCH 219/334] New translations 40-visualization.md (French) --- locale/fr/episodes/40-visualization.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/locale/fr/episodes/40-visualization.Rmd b/locale/fr/episodes/40-visualization.Rmd index c7b68831c..eec428127 100644 --- a/locale/fr/episodes/40-visualization.Rmd +++ b/locale/fr/episodes/40-visualization.Rmd @@ -21,7 +21,7 @@ download.file(url = "https://github.com/carpentries-incubator/bioc-intro/raw/mai :::::::::::::::::::::::::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::: des questions +:::::::::::::::::::::::::::::::::::::::: questions - Visualisation en R From df0df641ed9dcb6520fd52d2435b6a2ec84b4444 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Wed, 14 Aug 2024 18:22:18 +0900 Subject: [PATCH 220/334] New translations 40-visualization.md (Spanish) --- locale/es/episodes/40-visualization.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/locale/es/episodes/40-visualization.Rmd b/locale/es/episodes/40-visualization.Rmd index 97849eb0b..db7eda723 100644 --- a/locale/es/episodes/40-visualization.Rmd +++ b/locale/es/episodes/40-visualization.Rmd @@ -11,7 +11,7 @@ download.file(url = "https://github.com/carpentries-incubator/bioc-intro/raw/mai destfile = "datos/rnaseq.csv") ``` -::::::::::::::::::::::::::::::::::::::: objetivos +::::::::::::::::::::::::::::::::::::::: objectives - Produzca diagramas de dispersión, diagramas de caja, diagramas de líneas, etc. utilizando ggplot. - Establezca configuraciones de trama universales. From c4683e62a59d09b889f1cf8020b673164116caf7 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Wed, 14 Aug 2024 18:22:26 +0900 Subject: [PATCH 221/334] New translations 60-next-steps.md (Spanish) --- locale/es/episodes/60-next-steps.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/locale/es/episodes/60-next-steps.Rmd b/locale/es/episodes/60-next-steps.Rmd index f190dc41a..9a7c9717f 100644 --- a/locale/es/episodes/60-next-steps.Rmd +++ b/locale/es/episodes/60-next-steps.Rmd @@ -8,7 +8,7 @@ exercises: 45 ```{r, include=FALSE} ``` -::::::::::::::::::::::::::::::::::::::: objetivos +::::::::::::::::::::::::::::::::::::::: objectives - Presentar el proyecto Bioconductor. - Introducir la noción de contenedores de datos. From 71c9c1833d21f0b9a1f5a51f4345a2a3c86b5a84 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Wed, 14 Aug 2024 18:47:24 +0900 Subject: [PATCH 222/334] New translations 25-starting-with-data.md (Japanese) --- locale/ja/episodes/25-starting-with-data.Rmd | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/locale/ja/episodes/25-starting-with-data.Rmd b/locale/ja/episodes/25-starting-with-data.Rmd index 40fd82861..4da684c75 100644 --- a/locale/ja/episodes/25-starting-with-data.Rmd +++ b/locale/ja/episodes/25-starting-with-data.Rmd @@ -194,7 +194,7 @@ str(rna)\`の出力に基づいて、以下の - オブジェクト `rna` のクラスは何ですか? - このオブジェクトにはいくつの行といくつの列がありますか? -:::::::::::::::::::: 解決策 +::::::::::::::: solution ## ソリューション @@ -281,7 +281,7 @@ RStudio では、オートコンプリート機能を使用して、列の完全 行だけを保持し、`head(rna)\`の 挙動を再現することができる。 -:::::::::::::::::::: 解決策 +::::::::::::::: solution ## ソリューション @@ -422,7 +422,7 @@ plot(sex) - F "と "M "の名前をそれぞれ "Female "と "Male "に変更する。 -:::::::::::::::::::: 解決策 +::::::::::::::: solution ## ソリューション @@ -451,7 +451,7 @@ animal_data <- data.frame( weight = c(45, 8 1.1, 0.8)) ``` -:::::::::::::::::::: 解決策 +::::::::::::::: solution ## ソリューション @@ -489,7 +489,7 @@ country_climate <- data.frame( ) ``` -:::::::::::::::::::: 解決策 +::::::::::::::: solution ## ソリューション @@ -546,7 +546,7 @@ installed.packages()`という関数を使って、 あなたのコンピューターに現在インストールされているすべてのパッケージの情報を含む `文字\`行列 を作成します。 探検してみよう。 -:::::::::::::::::::: 解決策 +::::::::::::::: solution ## 解決策 @@ -578,7 +578,7 @@ colnames(ip) 正規分布データ (平均0、標準偏差1)の次元1000×3の行列を作る。 -:::::::::::::::::::: 解決策 +::::::::::::::: solution ## ソリューション @@ -767,7 +767,7 @@ write.csv(rna, file = "data_output/my_rna.csv") 、列の区切り文字としてカンマを使用しているにもかかわらず、 、Rに正しく読み込むことができます。 -::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: キーポイント +:::::::::::::::::::::::::::::::::::::::: keypoints - Rでの表形式データ From 42429fcaa825cdb39c15fe9bb4f85b41b485eb7d Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Wed, 14 Aug 2024 18:47:28 +0900 Subject: [PATCH 223/334] New translations 30-dplyr.md (Japanese) --- locale/ja/episodes/30-dplyr.Rmd | 20 ++++++++++---------- 1 file changed, 10 insertions(+), 10 deletions(-) diff --git a/locale/ja/episodes/30-dplyr.Rmd b/locale/ja/episodes/30-dplyr.Rmd index a54cb810c..a347645ef 100644 --- a/locale/ja/episodes/30-dplyr.Rmd +++ b/locale/ja/episodes/30-dplyr.Rmd @@ -254,7 +254,7 @@ rna3 、遺伝子の発現が50000より高い雌マウスのオブザベーションを保持するように`rna`データをサブセットし、 `gene`、`sample`、`time`、`expression`、`age`の列のみを保持する。 -:::::::::::::::::::: 解決策 +::::::::::::::: solution ## ソリューション @@ -307,7 +307,7 @@ phenotype_descriptionに関連し、log expressionが5より高い遺伝子の **ヒント**:このデータフレームを 、どのようにコマンドを並べるべきか考えてみよう! -:::::::::::::::::::: 解決策 +::::::::::::::: solution ## ソリューション @@ -401,7 +401,7 @@ rna %>% 遺伝子 "Dok3 "のタイムポイントごとの平均発現量を計算する。 -:::::::::::::::::::: 解決策 +::::::::::::::: solution ## ソリューション @@ -489,7 +489,7 @@ rna %>% 3. サンプルを1つ選び、バイオタイプ別に遺伝子数を評価する。 4. DNAメチル化異常」という表現型に関連する遺伝子を特定し、時間0、時間4、時間8における平均発現量(対数)を計算する。 -:::::::::::::::::::: 解決策 +::::::::::::::: solution ## ソリューション @@ -714,7 +714,7 @@ rnaテーブルから始めて、`pivot_wider()`関数を使用して、 、各マウスの遺伝子発現レベルを示すワイドフォーマットのテーブルを作成する。 そして、`pivot_longer()`関数を使って、ロングフォーマットの表を復元する。 -:::::::::::::::::::: 解決策 +::::::::::::::: solution ## ソリューション @@ -746,7 +746,7 @@ knitr::include_graphics("fig/Exercise_pivot_W.png") 整形する前にまとめる必要がある! -:::::::::::::::::::: 解決策 +::::::::::::::: solution ## ソリューション @@ -795,7 +795,7 @@ rna\`データセットを使って、 各行が遺伝子の平均発現量を表し、 各列が異なるタイムポイントを表す発現行列を作成する。 -:::::::::::::::::::: 解決策 +::::::::::::::: solution ## ソリューション @@ -867,7 +867,7 @@ rna %>% を含む新しい列を作成する。 この表を、計算されたフォールド・チェンジを集めたロングフォーマットの表に変換する。 -:::::::::::::::::::: 解決策 +::::::::::::::: solution ## ソリューション @@ -983,7 +983,7 @@ full_join(rna_mini, annot2, by = c("gene" = "external_gene_name")) をクリックして `annot3` テーブルをダウンロードし、そのテーブルをあなたの data/ リポジトリに置いてください。 full_join()`関数を使用して、テーブル`rna_mini`と`annot3\` を結合する。 、遺伝子_Klk6_、_mt-Tf_、_mt-Rnr1_、_mt-Tv_、_mt-Rnr2_、_mt-Tl1_はどうなったのか? -:::::::::::::::::::: 解決策 +::::::::::::::: solution ## ソリューション @@ -1024,7 +1024,7 @@ write_csv()\`を使用して、以前に作成したrna_wideテーブルを保 write_csv(rna_wide, file = "data_output/rna_wide.csv") ``` -::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: キーポイント +:::::::::::::::::::::::::::::::::::::::: keypoints - tidyverseメタパッケージを使用したRでの表形式データ From bc4ccac6ec0586438e00673055a41d8e0196162e Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Wed, 14 Aug 2024 18:47:32 +0900 Subject: [PATCH 224/334] New translations 40-visualization.md (Japanese) --- locale/ja/episodes/40-visualization.Rmd | 20 ++++++++++---------- 1 file changed, 10 insertions(+), 10 deletions(-) diff --git a/locale/ja/episodes/40-visualization.Rmd b/locale/ja/episodes/40-visualization.Rmd index 4065fb4a6..9d5439486 100644 --- a/locale/ja/episodes/40-visualization.Rmd +++ b/locale/ja/episodes/40-visualization.Rmd @@ -209,7 +209,7 @@ ggplot(rna, aes(x = expression_log))+ geom_histogram() - `scale_x_log10()` を参照。 前のグラフと比較してみよう。 、警告メッセージが表示されるようになったのはなぜですか? -:::::::::::::::::::: 解決策 +::::::::::::::: solution ## ソリューション @@ -360,7 +360,7 @@ library("hexbin") プロットの相対的な長所と短所は何か? 上記の散布図( )を調べ、作成した六角形のビンプロットと比較する。 -:::::::::::::::::::: 解決策 +::::::::::::::: solution ## ソリューション @@ -389,7 +389,7 @@ ggplot(data = rna_fc, mapping = aes(x = time_4_vs_0, y = time_8_vs_0)) + の散布図を作成し、 異なる色で時間を表示する。 このようなデータを表示するのは良い方法ですか? -:::::::::::::::::::: 解決策 +::::::::::::::: solution ## ソリューション @@ -430,7 +430,7 @@ ggplot(data = rna, ボックスプロットレイヤーがジッターレイヤーの前にあることに注目してほしい。 、ボックスプロットをポイントの下に配置するために、コードのどこを変更する必要がありますか? -:::::::::::::::::::: 解決策 +::::::::::::::: solution ## ソリューション @@ -472,7 +472,7 @@ _ヒント:_ `time`のクラスをチェックする。 `time` のクラスをggplotマッピングで整数から因数に直接変更することを検討する。 、Rのグラフの作り方が変わるのはなぜですか? -:::::::::::::::::::: 解決策 +::::::::::::::: solution ## ソリューション @@ -510,7 +510,7 @@ ggplot(data = rna, - geom_violin()`を参照してください。 引数 `fill\` の時間に従ってヴァイオリンにフィルを入れる。 -:::::::::::::::::::: 解決策 +::::::::::::::: solution ## ソリューション @@ -531,7 +531,7 @@ ggplot(data = rna, - ヴァイオリンのプロットを修正し、ヴァイオリンを `sex` で埋める。 -:::::::::::::::::::: 解決策 +::::::::::::::: solution ## ソリューション @@ -659,7 +659,7 @@ ggplot(data = mean_exp_by_time_sex, 、各染色体の 平均発現量が感染期間を通じてどのように変化するかをプロットする。 -:::::::::::::::::::: 解決策 +::::::::::::::: solution ## ソリューション @@ -817,7 +817,7 @@ ggplot(rna, aes(x = expression_log))+ カラーを手動で指定してみてください( [http://www.cookbook-r.com/Graphs/Colors_(ggplot2)/](https://www.cookbook-r.com/Graphs/Colors_\(ggplot2\)/) 参照)。 -:::::::::::::::::::: 解決策 +::::::::::::::: solution ## ソリューション @@ -1086,7 +1086,7 @@ lattice`** パッケージは `ggplot2` と似ているが、 lattice\`パッケージの良いリファレンスは@latticebookだ。 -::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: キーポイント +:::::::::::::::::::::::::::::::::::::::: keypoints - Rによる可視化 From 28fe676a1b0abc289211e843e54a31ae1324599f Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Wed, 14 Aug 2024 19:25:21 +0900 Subject: [PATCH 225/334] New translations 20-r-rstudio.md (Spanish) --- locale/es/episodes/20-r-rstudio.Rmd | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/locale/es/episodes/20-r-rstudio.Rmd b/locale/es/episodes/20-r-rstudio.Rmd index 3c80fec73..cb18ffda9 100644 --- a/locale/es/episodes/20-r-rstudio.Rmd +++ b/locale/es/episodes/20-r-rstudio.Rmd @@ -15,13 +15,13 @@ exercises: 0 - Utilice la interfaz de ayuda integrada de RStudio para buscar más información sobre las funciones de R. - Demuestre cómo proporcionar suficiente información para la resolución de problemas con la comunidad de usuarios de R. -:::::::::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::: :::::::::::::::::::::::::::::::::::::::: questions - ¿Qué son R y RStudio? -:::::::::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::: > Este episodio se basa en la lección _Análisis de datos y > Visualización en R para ecologistas_ de Data Carpentries. @@ -261,7 +261,7 @@ cree una carpeta llamada `data` dentro de su directorio de trabajo recién cread su consola R). Repita estas operaciones para crear una carpeta `data_output/` y `fig_output`. -:::::::::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::: Mantendremos el script en la raíz de nuestro directorio de trabajo porque solo usaremos un archivo y facilitará las cosas @@ -665,4 +665,4 @@ De forma predeterminada, `BiocManager::install()` también verificará todos los - Comience a usar R y RStudio -:::::::::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::: From 7a7efc386415e52fd3b1b185946e070adb1bfb73 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Wed, 14 Aug 2024 19:25:56 +0900 Subject: [PATCH 226/334] New translations 23-starting-with-r.md (Spanish) --- locale/es/episodes/23-starting-with-r.Rmd | 26 +++++++++++------------ 1 file changed, 13 insertions(+), 13 deletions(-) diff --git a/locale/es/episodes/23-starting-with-r.Rmd b/locale/es/episodes/23-starting-with-r.Rmd index 67057b323..d8692d96a 100644 --- a/locale/es/episodes/23-starting-with-r.Rmd +++ b/locale/es/episodes/23-starting-with-r.Rmd @@ -22,7 +22,7 @@ exercises: 60 :::::::::::::::::::::::::::::::::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::: preguntas +:::::::::::::::::::::::::::::::::::::::: questions - Primeros comandos en R @@ -137,7 +137,7 @@ y luego cambie `weight_kg` a 100. peso_kg <- 100 ``` -::::::::::::::::::::::::::::::::::::::: desafío +::::::::::::::::::::::::::::::::::::::: challenge ## Desafío: @@ -159,7 +159,7 @@ you only want to comment out one line, you can put the cursor at any location of that line (i.e. no need to select the whole line), then press <kbd>Ctrl</kbd> + <kbd>Shift</kbd> + <kbd>C</kbd>. -::::::::::::::::::::::::::::::::::::::: desafío +::::::::::::::::::::::::::::::::::::::: challenge ## Desafío @@ -343,7 +343,7 @@ Los vectores son una de las muchas **estructuras de datos** que utiliza R. Otros importantes son listas (`list`), matrices (`matrix`), marcos de datos (`data.frame`), factores (`factor`) y matrices (`array` ). -::::::::::::::::::::::::::::::::::::::: desafío +::::::::::::::::::::::::::::::::::::::: challenge ## Desafío: @@ -361,7 +361,7 @@ R los convierte implícitamente para que todos sean del mismo tipo :::::::::::::::::::::::::::::::::::::::::::::::::::::::: -::::::::::::::::::::::::::::::::::::::: desafío +::::::::::::::::::::::::::::::::::::::: challenge ## Desafío: @@ -394,7 +394,7 @@ tricky :::::::::::::::::::::::::::::::::::::::::::::::::::::::: -::::::::::::::::::::::::::::::::::::::: desafío +::::::::::::::::::::::::::::::::::::::: challenge ## Desafío: @@ -412,7 +412,7 @@ no pierda ninguna información. :::::::::::::::::::::::::::::::::::::::::::::::::::::::: -::::::::::::::::::::::::::::::::::::::: desafío +::::::::::::::::::::::::::::::::::::::: challenge ## Desafío: @@ -442,7 +442,7 @@ lógico_combinado :::::::::::::::::::::::::::::::::::::::::::::::::::::::: -::::::::::::::::::::::::::::::::::::::: desafío +::::::::::::::::::::::::::::::::::::::: challenge ## Desafío: @@ -570,7 +570,7 @@ molecules %in% c("rna", "dna", "metabolite", "peptide", "glycerol") molecules[molecules %in% c("rna", "dna", "metabolite", "peptide", "glycerol")] ``` -::::::::::::::::::::::::::::::::::::::: desafío +::::::::::::::::::::::::::::::::::::::: challenge ## Desafío: @@ -652,7 +652,7 @@ na.omit(heights) alturas[completos.casos(alturas)] ``` -::::::::::::::::::::::::::::::::::::::: desafío +::::::::::::::::::::::::::::::::::::::: challenge ## Desafío: @@ -716,7 +716,7 @@ numérico(0) Hay constructores similares para caracteres y lógicos, llamados `character()` y `logic()` respectivamente. -::::::::::::::::::::::::::::::::::::::: desafío +::::::::::::::::::::::::::::::::::::::: challenge ## Desafío: @@ -761,7 +761,7 @@ de longitud 1) y de cualquier tipo. Por ejemplo, si queremos repetir los valores representante(c(1, 2, 3), 5) ``` -::::::::::::::::::::::::::::::::::::::: desafío +::::::::::::::::::::::::::::::::::::::: challenge ## Desafío: @@ -838,7 +838,7 @@ en `TRUE`: muestra(1:5, 10, reemplazar = VERDADERO) ``` -::::::::::::::::::::::::::::::::::::::: desafío +::::::::::::::::::::::::::::::::::::::: challenge ## Desafío: From f1b9ee1cb2b405676ac40b711b98b562211ea9d7 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Wed, 14 Aug 2024 19:26:00 +0900 Subject: [PATCH 227/334] New translations 23-starting-with-r.md (Chinese Simplified) --- locale/zh/episodes/23-starting-with-r.Rmd | 48 +++++++++++------------ 1 file changed, 24 insertions(+), 24 deletions(-) diff --git a/locale/zh/episodes/23-starting-with-r.Rmd b/locale/zh/episodes/23-starting-with-r.Rmd index a043e2203..4ab0582c9 100644 --- a/locale/zh/episodes/23-starting-with-r.Rmd +++ b/locale/zh/episodes/23-starting-with-r.Rmd @@ -8,7 +8,7 @@ exercises: 60 ```{r, include=FALSE} ``` -:::::::::::::::::::::::::::::::::::::::::: 目标 +::::::::::::::::::::::::::::::::::::::: objectives - 定义与 R 相关的以下术语:对象、分配、调用、函数、参数、选项。 - 为 R 中的对象分配值。 @@ -22,7 +22,7 @@ exercises: 60 ::::::::::::::::::::::::::::::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::: 问题 +:::::::::::::::::::::::::::::::::::::::: questions - R 中的第一个命令 @@ -138,7 +138,7 @@ weight_kg # 输入对象的名称也会打印任何内容 体重_kg <- 100 ``` -::::::::::::::::::::::::::::::::::::::: 挑战 +::::::::::::::::::::::::::::::::::::::: challenge ## 挑战: @@ -160,7 +160,7 @@ RStudio 可以轻松注释或取消注释一个段落:在 位置(即不需要选择整行),然后 按 <kbd>Ctrl</kbd> + <kbd>Shift</kbd> + <kbd>C</kbd>。 -::::::::::::::::::::::::::::::::::::::: 挑战 +::::::::::::::::::::::::::::::::::::::: challenge ## 挑战 @@ -344,7 +344,7 @@ the result back into `weight_g`. 重要的是列表(`list`)、矩阵(`matrix`)、数据框 (`data.frame`)、因子(`factor`)和数组(`array`)。 -::::::::::::::::::::::::::::::::::::::: 挑战 +::::::::::::::::::::::::::::::::::::::: challenge ## 挑战: @@ -352,7 +352,7 @@ the result back into `weight_g`. 双精度型)、整数类型和逻辑类型。 但是如果我们尝试在一个向量中混合 这些类型会发生什么? -::::::::::::::: 解决方案 +::::::::::::::: solution ## 解决方案 @@ -362,7 +362,7 @@ R 隐式地将它们全部转换为同一类型 ::::::::::::::::::::::::::::::::::::::::::::::::::::: -::::::::::::::::::::::::::::::::::::::: 挑战 +::::::::::::::::::::::::::::::::::::::: challenge ## 挑战: @@ -376,7 +376,7 @@ char_logical <- c("a", "b", "c", TRUE) tricky <- c(1, 2, 3, "4") ``` -::::::::::::::: 解决方案 +::::::::::::::: solution ## 解决方案 @@ -395,13 +395,13 @@ tricky ::::::::::::::::::::::::::::::::::::::::::::::::::::: -::::::::::::::::::::::::::::::::::::::: 挑战 +::::::::::::::::::::::::::::::::::::::: challenge ## 挑战: 您认为为什么会发生这种情况? -::::::::::::::: 解决方案 +::::::::::::::: solution ## 解决方案 @@ -413,7 +413,7 @@ tricky ::::::::::::::::::::::::::::::::::::::::::::::::::::: -::::::::::::::::::::::::::::::::::::::: 挑战 +::::::::::::::::::::::::::::::::::::::: challenge ## 挑战: @@ -426,7 +426,7 @@ char_logical <- c("a", "b", "c", TRUE) combined_logical <- c(num_logical, char_logical) ``` -::::::::::::::: 解决方案 +::::::::::::::: solution ## 解决方案 @@ -443,7 +443,7 @@ combined_logical <- c(num_logical, char_logical) ::::::::::::::::::::::::::::::::::::::::::::::::::::: -::::::::::::::::::::::::::::::::::::::: 挑战 +::::::::::::::::::::::::::::::::::::::: challenge ## 挑战: @@ -453,7 +453,7 @@ _强制_。 这些转换根据层次结构进行, you draw a diagram that represents the hierarchy of how these data types are coerced? -::::::::::::::: 解决方案 +::::::::::::::: solution ## 解决方案 @@ -571,13 +571,13 @@ molecules %in% c("rna", "dna", "metabolite", "peptide", "glycerol") molecules[molecules %in% c("rna", "dna", "metabolite", "peptide", "glycerol")] ``` -::::::::::::::::::::::::::::::::::::::: 挑战 +::::::::::::::::::::::::::::::::::::::: challenge ## 挑战: 你能弄清楚为什么 `"four" > "five"` 返回 `TRUE` 吗? -::::::::::::::: 解决方案 +::::::::::::::: solution ## 解决方案 @@ -653,7 +653,7 @@ na.omit(heights) heights[complete.cases(heights)] ``` -::::::::::::::::::::::::::::::::::::::: 挑战 +::::::::::::::::::::::::::::::::::::::: challenge ## 挑战: @@ -666,7 +666,7 @@ heights[complete.cases(heights)] 2. 使用函数“median()”计算“高度”向量的中值。 3. 使用 R 找出集合中有多少人的身高超过 67 英寸。 -::::::::::::::: 解决方案 +::::::::::::::: solution ## 解决方案 @@ -717,13 +717,13 @@ heights_no_na <- na.omit(heights) 字符和逻辑值有类似的构造函数,分别名为 `character()` 和 `logical()`。 -::::::::::::::::::::::::::::::::::::::: 挑战 +::::::::::::::::::::::::::::::::::::::: challenge ## 挑战: 字符和逻辑向量的默认值是什么? -::::::::::::::: 解决方案 +::::::::::::::: solution ## 解决方案 @@ -762,7 +762,7 @@ logical(2) ## FALSE 代表(c(1,2,3),5) ``` -::::::::::::::::::::::::::::::::::::::: 挑战 +::::::::::::::::::::::::::::::::::::::: challenge ## 挑战: @@ -770,7 +770,7 @@ logical(2) ## FALSE 却按顺序获得五个 1、五个 2 和五个 3,该怎么办? 有两种 可能性 - 请参阅 `?rep` 或 `?sort` 寻求帮助。 -::::::::::::::: 解决方案 +::::::::::::::: solution ## 解决方案 @@ -839,7 +839,7 @@ seq(从 = 1,到 = 20,长度.out = 3) 样本(1:5,10,替换=TRUE) ``` -::::::::::::::::::::::::::::::::::::::: 挑战 +::::::::::::::::::::::::::::::::::::::: challenge ## 挑战: @@ -858,7 +858,7 @@ before drawing the random sample. 通过设置不同的种子来重复。 -::::::::::::::: 解决方案 +::::::::::::::: solution ## 解决方案 From 2281168a198addad016444ed06d8ba485101ddc6 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Wed, 14 Aug 2024 19:26:07 +0900 Subject: [PATCH 228/334] New translations 25-starting-with-data.md (Chinese Simplified) --- locale/zh/episodes/25-starting-with-data.Rmd | 28 ++++++++++---------- 1 file changed, 14 insertions(+), 14 deletions(-) diff --git a/locale/zh/episodes/25-starting-with-data.Rmd b/locale/zh/episodes/25-starting-with-data.Rmd index 3cfcff2c7..d056df592 100644 --- a/locale/zh/episodes/25-starting-with-data.Rmd +++ b/locale/zh/episodes/25-starting-with-data.Rmd @@ -8,7 +8,7 @@ exercises: 三十 ```{r, include=FALSE} ``` -:::::::::::::::::::::::::::::::::::::::::: 目标 +::::::::::::::::::::::::::::::::::::::: objectives - 描述什么是“data.frame”。 - 将 .csv 文件中的外部数据加载到数据框中。 @@ -188,7 +188,7 @@ str(RNA) 注意:这些函数大部分都是“通用的”,除了“data.frame”之外,它们还可以用于其他类型的 对象。 -::::::::::::::::::::::::::::::::::::::: 挑战 +::::::::::::::::::::::::::::::::::::::: challenge ## 挑战: @@ -198,7 +198,7 @@ str(RNA) - 对象“rna”的类别是什么? - 这个对象有多少行、多少列? -::::::::::::::: 解决方案 +::::::::::::::: solution ## 解决方案 @@ -259,7 +259,7 @@ rna$gene # 结果是一个向量 In RStudio, you can use the autocompletion feature to get the full and correct names of the columns. -::::::::::::::::::::::::::::::::::::::: 挑战 +::::::::::::::::::::::::::::::::::::::: challenge ## 挑战 @@ -286,7 +286,7 @@ correct names of the columns. 行为,仅保留 rna 数据集的第一到第六个 行。 -::::::::::::::: 解决方案 +::::::::::::::: solution ## 解决方案 @@ -427,7 +427,7 @@ sex ## 重新排序后 - 将“F”和“M”分别重命名为“女性”和“男性”。 -::::::::::::::: 解决方案 +::::::::::::::: solution ## 解决方案 @@ -440,7 +440,7 @@ sex ## 重新排序后 ::::::::::::::::::::::::::::::::::::::::::::::::::::: -::::::::::::::::::::::::::::::::::::::: 挑战 +::::::::::::::::::::::::::::::::::::::: challenge ## 挑战: @@ -456,7 +456,7 @@ animal_data <- data.frame( weight = c(45, 8 1.1, 0.8)) ``` -::::::::::::::: 解决方案 +::::::::::::::: solution ## 解决方案 @@ -468,7 +468,7 @@ animal_data <- data.frame( ::::::::::::::::::::::::::::::::::::::::::::::::::::: -::::::::::::::::::::::::::::::::::::::: 挑战 +::::::::::::::::::::::::::::::::::::::: challenge ## 挑战: @@ -494,7 +494,7 @@ country_climate <- data.frame( ) ``` -::::::::::::::: 解决方案 +::::::::::::::: solution ## 解决方案 @@ -543,7 +543,7 @@ m [^ncol]: 行数或列数就足够了,因为另一个可以从值的长度推断出来。 尝试一下如果值和行数/列数不相加会发生什么。 -::::::::::::::::::::::::::::::::::::::: 挑战 +::::::::::::::::::::::::::::::::::::::: challenge ## 挑战: @@ -551,7 +551,7 @@ m ,其中包含有关当前安装在 计算机上的所有包的信息。 探索它。 -::::::::::::::: 解决方案 +::::::::::::::: solution ## 解决方案: @@ -577,14 +577,14 @@ colnames(ip) 1 的正态分布中抽取的随机 数据,这可以使用 `rnorm()` 函数完成。 -::::::::::::::::::::::::::::::::::::::: 挑战 +::::::::::::::::::::::::::::::::::::::: challenge ## 挑战: 构建一个维度为 1000、长度为 3 的正态分布数据矩阵 (平均值为 0,标准差为 1) -::::::::::::::: 解决方案 +::::::::::::::: solution ## 解决方案 From a1a1ec6d96937c09929a72ad160fbd5f744600e3 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Wed, 14 Aug 2024 19:26:15 +0900 Subject: [PATCH 229/334] New translations 30-dplyr.md (Chinese Simplified) --- locale/zh/episodes/30-dplyr.Rmd | 40 ++++++++++++++++----------------- 1 file changed, 20 insertions(+), 20 deletions(-) diff --git a/locale/zh/episodes/30-dplyr.Rmd b/locale/zh/episodes/30-dplyr.Rmd index 6dc64f731..5067411dd 100644 --- a/locale/zh/episodes/30-dplyr.Rmd +++ b/locale/zh/episodes/30-dplyr.Rmd @@ -8,7 +8,7 @@ exercises: 75 ```{r, include=FALSE} ``` -:::::::::::::::::::::::::::::::::::::::::: 目标 +::::::::::::::::::::::::::::::::::::::: objectives - 描述\*\*`dplyr`**和**`tidyr`\*\*包的用途。 - 描述一些对于 @@ -19,7 +19,7 @@ exercises: 75 ::::::::::::::::::::::::::::::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::: 问题 +:::::::::::::::::::::::::::::::::::::::: questions - 使用 tidyverse 元包在 R 中进行数据分析 @@ -252,7 +252,7 @@ rna3 <- rna %>% rna3 ``` -::::::::::::::::::::::::::::::::::::::: 挑战 +::::::::::::::::::::::::::::::::::::::: challenge ## 挑战: @@ -260,7 +260,7 @@ rna3 其中基因表达高于 50000,并且仅保留列 `基因`、`样本`、`时间`、`表达` 和 `年龄`。 -::::::::::::::: 解决方案 +::::::::::::::: solution ## 解决方案 @@ -299,7 +299,7 @@ rna %>% 选择(时间,time_hours,time_mn) ``` -::::::::::::::::::::::::::::::::::::::: 挑战 +::::::::::::::::::::::::::::::::::::::: challenge ## 挑战 @@ -313,7 +313,7 @@ rna %>% **提示**:思考一下应该如何排列命令来生成 这个数据框! -::::::::::::::: 解决方案 +::::::::::::::: solution ## 解决方案 @@ -402,13 +402,13 @@ rna %>% 中位数表达 = 中位数(表达)) ``` -::::::::::::::::::::::::::::::::::::::: 挑战 +::::::::::::::::::::::::::::::::::::::: challenge ## 挑战 按时间点计算基因“Dok3”的平均表达水平。 -::::::::::::::: 解决方案 +::::::::::::::: solution ## 解决方案 @@ -487,7 +487,7 @@ rna %>% 排列(desc(n)) ``` -::::::::::::::::::::::::::::::::::::::: 挑战 +::::::::::::::::::::::::::::::::::::::: challenge ## 挑战 @@ -496,7 +496,7 @@ rna %>% 3. 选择一个样本并根据生物型评估基因的数量。 4. 识别与“异常 DNA 甲基化”表型描述相关的基因,并计算它们在时间 0、时间 4 和时间 8 的平均表达(以对数表示)。 -::::::::::::::: 解决方案 +::::::::::::::: solution ## 解决方案 @@ -715,7 +715,7 @@ wide_with_NA %>% 转向更宽更长的格式可以成为平衡数据集 的有效方法,这样每个重复都有相同的组成。 -::::::::::::::::::::::::::::::::::::::: 挑战 +::::::::::::::::::::::::::::::::::::::: challenge ## 问题 @@ -723,7 +723,7 @@ wide_with_NA %>% 一个宽格式表,给出每只小鼠的基因表达水平。 然后使用`pivot_longer()`函数恢复长格式的表。 -::::::::::::::: 解决方案 +::::::::::::::: solution ## 解决方案 @@ -741,7 +741,7 @@ pivot_longer(names_to = "mouse_id", values_to = "counts", -gene) ::::::::::::::::::::::::::::::::::::::::::::::::::::: -::::::::::::::::::::::::::::::::::::::: 挑战 +::::::::::::::::::::::::::::::::::::::: challenge ## 问题 @@ -756,7 +756,7 @@ knitr::include_graphics(“fig/Exercise_pivot_W.png”) 重塑之前需要先总结一下! -::::::::::::::: 解决方案 +::::::::::::::: solution ## 解决方案 @@ -798,7 +798,7 @@ rna_1 %>% ::::::::::::::::::::::::::::::::::::::::::::::::::::: -::::::::::::::::::::::::::::::::::::::: 挑战 +::::::::::::::::::::::::::::::::::::::: challenge ## 问题 @@ -806,7 +806,7 @@ rna_1 %>% 代表基因的平均表达水平,每列代表 不同的时间点。 -::::::::::::::: 解决方案 +::::::::::::::: solution ## 解决方案 @@ -870,7 +870,7 @@ rna %>% ::::::::::::::::::::::::::::::::::::::::::::::::::::: -::::::::::::::::::::::::::::::::::::::: 挑战 +::::::::::::::::::::::::::::::::::::::: challenge ## 问题 @@ -879,7 +879,7 @@ rna %>% 。 将此表转换为长格式表,收集计算出的倍数变化。 -::::::::::::::: 解决方案 +::::::::::::::: solution ## 解决方案 @@ -989,7 +989,7 @@ full_join(rna_mini,annot2,by = c(“基因”=“external_gene_name” 从上可以看出,第一个表的变量名在连接后的表中保留为 。 -::::::::::::::::::::::::::::::::::::::: 挑战 +::::::::::::::::::::::::::::::::::::::: challenge ## 挑战: @@ -999,7 +999,7 @@ full_join(rna_mini,annot2,by = c(“基因”=“external_gene_name” 函数,连接表 `rna_mini` 和 `annot3`。 基因 _Klk6_、_mt-Tf_、_mt-Rnr1_、_mt-Tv_、_mt-Rnr2_ 和 _mt-Tl1_ 发生了什么? -::::::::::::::: 解决方案 +::::::::::::::: solution ## 解决方案 From b4e0c1ae80865908cbcc79b3971385f3c9c81d38 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Wed, 14 Aug 2024 19:26:22 +0900 Subject: [PATCH 230/334] New translations 40-visualization.md (Chinese Simplified) --- locale/zh/episodes/40-visualization.Rmd | 40 ++++++++++++------------- 1 file changed, 20 insertions(+), 20 deletions(-) diff --git a/locale/zh/episodes/40-visualization.Rmd b/locale/zh/episodes/40-visualization.Rmd index aa94dd913..7e90cfd88 100644 --- a/locale/zh/episodes/40-visualization.Rmd +++ b/locale/zh/episodes/40-visualization.Rmd @@ -21,7 +21,7 @@ exercises: 60 ::::::::::::::::::::::::::::::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::: 问题 +:::::::::::::::::::::::::::::::::::::::: questions - R 中的可视化 @@ -146,7 +146,7 @@ rna_plot <- ggplot(data = rna, rna_plot + geom_histogram() ``` -::::::::::::::::::::::::::::::::::::::: 挑战 +::::::::::::::::::::::::::::::::::::::: challenge ## 挑战 @@ -198,7 +198,7 @@ ggplot(rna,aes(x = expression_log)) + geom_histogram() 从现在开始我们将研究对数转换的表达值。 -::::::::::::::::::::::::::::::::::::::: 挑战 +::::::::::::::::::::::::::::::::::::::: challenge ## 挑战 @@ -213,7 +213,7 @@ ggplot(rna,aes(x = expression_log)) + geom_histogram() `scale_x_log10()`。 将其与之前的图表进行比较。 为什么 现在出现警告信息? -::::::::::::::: 解决方案 +::::::::::::::: solution ## 解决方案 @@ -343,7 +343,7 @@ ggplot(数据 = rna_fc,映射 = aes(x = time_4_vs_0,y = time_8_vs_0, 库(“hexbin”) ``` -::::::::::::::::::::::::::::::::::::::: 挑战 +::::::::::::::::::::::::::::::::::::::: challenge ## 挑战 @@ -364,7 +364,7 @@ ggplot(数据 = rna_fc,映射 = aes(x = time_4_vs_0,y = time_8_vs_0, 图的相对优势和劣势是什么? 检查上述散点图 并将其与您创建的六边形箱图进行比较。 -::::::::::::::: 解决方案 +::::::::::::::: solution ## 解决方案 @@ -385,7 +385,7 @@ ggplot(数据 = rna_fc,映射 = aes(x = time_4_vs_0,y = time_8_vs_0)) + ::::::::::::::::::::::::::::::::::::::::::::::::::::: -::::::::::::::::::::::::::::::::::::::: 挑战 +::::::::::::::::::::::::::::::::::::::: challenge ## 挑战 @@ -393,7 +393,7 @@ ggplot(数据 = rna_fc,映射 = aes(x = time_4_vs_0,y = time_8_vs_0)) + 在“sample”上的散点图,其中时间以 不同的颜色显示。 这是显示此类数据的好方法吗? -::::::::::::::: 解决方案 +::::::::::::::: solution ## 解决方案 @@ -427,14 +427,14 @@ ggplot(数据 = rna, geom_boxplot(alpha = 0) ``` -::::::::::::::::::::::::::::::::::::::: 挑战 +::::::::::::::::::::::::::::::::::::::: challenge ## 挑战 请注意箱线图层是如何位于抖动图层前面的? What do you need to change in the code to put the boxplot below the points? -::::::::::::::: 解决方案 +::::::::::::::: solution ## 解决方案 @@ -465,7 +465,7 @@ ggplot(数据 = rna, 主题(axis.text.x = element_text(angle = 90,hjust = 0.5,vjust = 0.5)) ``` -::::::::::::::::::::::::::::::::::::::: 挑战 +::::::::::::::::::::::::::::::::::::::: challenge ## 挑战 @@ -476,7 +476,7 @@ ggplot(数据 = rna, `时间`类从整数直接更改为 ggplot 映射中的因子。 为什么 会改变 R 绘制图形的方式? -::::::::::::::: 解决方案 +::::::::::::::: solution ## 解决方案 @@ -502,7 +502,7 @@ ggplot(data = rna, ::::::::::::::::::::::::::::::::::::::::::::::::::::: -::::::::::::::::::::::::::::::::::::::: 挑战 +::::::::::::::::::::::::::::::::::::::: challenge ## 挑战 @@ -514,7 +514,7 @@ not see it in a boxplot. 箱线图的替代方法是小提琴 - 用小提琴图代替箱线图;参见“geom_violin()”。 使用参数“fill”根据时间在小提琴中填充 。 -::::::::::::::: 解决方案 +::::::::::::::: solution ## 解决方案 @@ -529,13 +529,13 @@ ggplot(数据 = rna, ::::::::::::::::::::::::::::::::::::::::::::::::::::: -::::::::::::::::::::::::::::::::::::::: 挑战 +::::::::::::::::::::::::::::::::::::::: challenge ## 挑战 - 修改小提琴图以按“性别”填充小提琴。 -::::::::::::::: 解决方案 +::::::::::::::: solution ## 解决方案 @@ -660,7 +660,7 @@ ggplot(数据 = mean_exp_by_time_sex, 主题(panel.grid = element_blank()) ``` -::::::::::::::::::::::::::::::::::::::: 挑战 +::::::::::::::::::::::::::::::::::::::: challenge ## 挑战 @@ -668,7 +668,7 @@ ggplot(数据 = mean_exp_by_time_sex, 感染持续期间,每个染色体的 平均表达如何变化。 -::::::::::::::: 解决方案 +::::::::::::::: solution ## 解决方案 @@ -807,7 +807,7 @@ ggplot(rna, aes(x = expression_log)) + blue_theme ``` -::::::::::::::::::::::::::::::::::::::: 挑战 +::::::::::::::::::::::::::::::::::::::: challenge ## 挑战 @@ -824,7 +824,7 @@ ggplot(rna, aes(x = expression_log)) + 颜色(参见 [http://www.cookbook-r.com/Graphs/Colors\_(ggplot2)/](https://www.cookbook-r.com/Graphs/Colors_\\(ggplot2\\)/))。 -::::::::::::::: 解决方案 +::::::::::::::: solution ## 解决方案 From f5cd9d691a478f87b025809f36cf8802249c186e Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Wed, 14 Aug 2024 19:26:25 +0900 Subject: [PATCH 231/334] New translations 60-next-steps.md (Spanish) --- locale/es/episodes/60-next-steps.Rmd | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/locale/es/episodes/60-next-steps.Rmd b/locale/es/episodes/60-next-steps.Rmd index 9a7c9717f..773826bc9 100644 --- a/locale/es/episodes/60-next-steps.Rmd +++ b/locale/es/episodes/60-next-steps.Rmd @@ -22,7 +22,7 @@ exercises: 45 - ¿Qué es un "experimento resumido"? - ¿Qué es un bioconductor? -:::::::::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::: ## Próximos pasos @@ -288,7 +288,7 @@ function.--> <!-- ``` --> -::::::::::::::::::::::::::::::::::::::: desafío +::::::::::::::::::::::::::::::::::::::: challenge ## Desafío From 665aa04156ae0a60ff7d6d59af6736c6b7e603dc Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Wed, 14 Aug 2024 19:26:28 +0900 Subject: [PATCH 232/334] New translations 60-next-steps.md (Chinese Simplified) --- locale/zh/episodes/60-next-steps.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/locale/zh/episodes/60-next-steps.Rmd b/locale/zh/episodes/60-next-steps.Rmd index 33035a520..59ff8c466 100644 --- a/locale/zh/episodes/60-next-steps.Rmd +++ b/locale/zh/episodes/60-next-steps.Rmd @@ -288,7 +288,7 @@ function.--> <!-- ``` --> -::::::::::::::::::::::::::::::::::::::: 挑战 +::::::::::::::::::::::::::::::::::::::: challenge ## 挑战 From 64577129194648f5f2cc082992ebb0cde6a5c4fd Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Wed, 14 Aug 2024 20:21:27 +0900 Subject: [PATCH 233/334] New translations 20-r-rstudio.md (Spanish) --- locale/es/episodes/20-r-rstudio.Rmd | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/locale/es/episodes/20-r-rstudio.Rmd b/locale/es/episodes/20-r-rstudio.Rmd index cb18ffda9..c2f243954 100644 --- a/locale/es/episodes/20-r-rstudio.Rmd +++ b/locale/es/episodes/20-r-rstudio.Rmd @@ -424,7 +424,7 @@ que se adapte a su propósito. podría facilitarle el comienzo. lo que es posible hacer con R. ```{r kitten, results="markup", echo=FALSE, purl=FALSE, out.width="400px", fig.align="center"} -knitr::include_graphics("fig/gatito-prueba-cosas.jpg") +knitr::include_graphics("fig/kitten-try-things.jpg") ``` ## Buscando ayuda @@ -448,13 +448,13 @@ Si necesita ayuda con una función específica, digamos `barplot()`, puede escribir: ```{r, eval=FALSE, purl=TRUE} -diagrama de barras +?barplot ``` Si sólo necesita recordar los nombres de los argumentos, puede utilizar: ```{r, eval=FALSE, purl=TRUE} -argumentos (lm) +args(lm) ``` ### Quiero usar una función que haga X, debe haber una función para ello pero no sé cuál... @@ -540,7 +540,7 @@ issue). Alternativamente, en particular si su pregunta no está relacionada con un marco de datos, puede guardar cualquier objeto R en un archivo[^export]: ```{r, eval=FALSE, purl=FALSE} -guardarRDS(iris, archivo="/tmp/iris.rds") +saveRDS(iris, file="/tmp/iris.rds") ``` Sin embargo, el contenido de este archivo no es legible por humanos y no se puede @@ -550,7 +550,7 @@ se supone que el archivo descargado está en una carpeta `Descargas` en el directorio de inicio del usuario): ```{r, eval=FALSE, purl=FALSE} -algunos_datos <- readRDS(file="~/Downloads/iris.rds") +some_data <- readRDS(file="~/Downloads/iris.rds") ``` Por último, pero no menos importante, **siempre incluya la salida de `sessionInfo()`** @@ -559,7 +559,7 @@ los paquetes que está usando y otra información que puede ser muy útil para comprender su problema. ```{r, results="show", purl=TRUE} -información de sesión() +sessionInfo() ``` ### ¿Dónde pedir ayuda? From edbaa8f5b3e61f53820c67ebd331fd996e813319 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Thu, 15 Aug 2024 02:20:51 +0900 Subject: [PATCH 234/334] New translations 23-starting-with-r.md (Japanese) --- locale/ja/episodes/23-starting-with-r.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/locale/ja/episodes/23-starting-with-r.Rmd b/locale/ja/episodes/23-starting-with-r.Rmd index 32fda42fd..43a8f4970 100644 --- a/locale/ja/episodes/23-starting-with-r.Rmd +++ b/locale/ja/episodes/23-starting-with-r.Rmd @@ -45,7 +45,7 @@ _オブジェクト_ に割り当てる必要があります。 オブジェク 代入演算子 `<-` と、それに付けたい値を付ける必要があります。 ```{r, purl=TRUE} -体重_kg <- 55 +weight_kg <- 55 ``` `<-` は代入演算子です。 右側の値を左側の From 3859553bc5be5f99e2609aea81d00c3f348e0d54 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Thu, 15 Aug 2024 03:25:57 +0900 Subject: [PATCH 235/334] New translations 23-starting-with-r.md (Japanese) --- locale/ja/episodes/23-starting-with-r.Rmd | 159 +++++++++++----------- 1 file changed, 79 insertions(+), 80 deletions(-) diff --git a/locale/ja/episodes/23-starting-with-r.Rmd b/locale/ja/episodes/23-starting-with-r.Rmd index 43a8f4970..b0c517156 100644 --- a/locale/ja/episodes/23-starting-with-r.Rmd +++ b/locale/ja/episodes/23-starting-with-r.Rmd @@ -104,23 +104,23 @@ Tidyverse は非常に包括的であり、最初は に値を強制的に出力させることができます。 ```{r, purl=TRUE} -Weight_kg <- 55 # 何も出力しません -(weight_kg <- 55) # しかし、呼び出しを括弧で囲むと `weight_kg` の値 -が出力され、オブジェクトの名前を入力しても同様に出力されます +weight_kg <- 55 # 何も出力しません +(weight_kg <- 55) # しかし、呼び出しを括弧で囲むと `weight_kg` の値が出力され、 +weight_kg # オブジェクトの名前を入力しても同様に出力されます ``` R のメモリに「weight_kg」があるので、それを使って算術演算を行うことができます。 、この重量をポンドに変換したい場合があります (ポンドでの重量は kg での重量の 2.2 倍です)。 ```{r, purl=TRUE} -2.2 * 体重_kg +2.2 * weight_kg ``` オブジェクトに新しい値を割り当てることで、オブジェクトの値を変更することもできます。 ```{r, purl=TRUE} -体重kg <- 57.5 -2.2 * 体重kg +weight_kg <- 57.5 +2.2 * weight_kg ``` これは、 @@ -128,13 +128,13 @@ R のメモリに「weight_kg」があるので、それを使って算術演算 オブジェクト `weight_lb` に保存してみましょう。 ```{r, purl=TRUE} -体重ポンド <- 2.2 * 体重キログラム +weight_lb <- 2.2 * weight_kg ``` 次に「weight_kg」を 100 に変更します。 ```{r} -体重_kg <- 100 +weight_kg <- 100 ``` ::::::::::::::::::::::::::::::::::::::: challenge @@ -165,11 +165,11 @@ Ctrl</kbd> + <kbd>Shift</kbd> + <kbd>C</kbd><kbd>押します。 次の各ステートメントの後の値は何ですか? ```{r, purl=TRUE} -質量 <- 47.5 # 質量? -年齢 <- 122 # 年齢? -質量 ← 質量 * 2.0 # 質量? -年齢 <- 年齢 - 20 # 年齢? -質量指数 <- 質量/年齢 # 質量指数? +mass <- 47.5 # 質量? +age <- 122 # 年齢? +mass <- mass * 2.0 # 質量? +age <- age - 20 # 年齢? +mass_index <- mass/age # 質量指数? ``` :::::::::::::::::::::::::::::::::::::::::::::: @@ -209,7 +209,7 @@ b <- sqrt(a) 複数の引数を取ることができる関数 `round()` を試してみましょう。 ```{r, results="show", purl=TRUE} -ラウンド(3.14159) +round(3.14159) ``` ここでは、1 つの引数 `3.14159` を指定して `round()` を呼び出しましたが、 @@ -219,31 +219,31 @@ b <- sqrt(a) 関数のヘルプを参照することができます。 ```{r, results="show", purl=TRUE} -引数(丸め) +args(round) ``` ```{r, eval=FALSE, purl=TRUE} -?ラウンド +?round ``` 別の桁数が必要な場合は、`digits=2` または必要な桁数を入力 ことがわかります。 ```{r, results="show", purl=TRUE} -ラウンド(3.14159、桁数 = 2) +round(3.14159, digits = 2) ``` 定義されているのとまったく同じ順序で引数を指定する場合は に名前を付ける必要はありません。 ```{r, results="show", purl=TRUE} -ラウンド(3.14159, 2) +round(3.14159, 2) ``` 引数に名前を付けた場合は、その順序を入れ替えることができます。 ```{r, results="show", purl=TRUE} -Round(桁数 = 2、x = 3.14159) +round(digits = 2, x = 3.14159) ``` 関数呼び出しの最初にオプションではない引数 ( @@ -263,15 +263,15 @@ Round(桁数 = 2、x = 3.14159) に割り当てることができます。 ```{r, purl=TRUE} -体重g <- c(50, 60, 65, 82) -体重g +weight_g <- c(50, 60, 65, 82) +weight_g ``` ベクトルには文字も含めることができます。 ```{r, purl=TRUE} -分子 <- c("dna", "rna", "タンパク質") -分子 +molecules <- c("dna", "rna", "protein") +molecules ``` ここでは「dna」や「rna」などの引用符が重要です。 引用符 @@ -282,8 +282,8 @@ Round(桁数 = 2、x = 3.14159) ベクトルの内容を検査できる関数が多数あります。 `length()` は、特定のベクトルに含まれる要素の数を示します。 ```{r, purl=TRUE} -長さ(重量_g) -長さ(分子) +length(weight_g) +length(molecules) ``` ベクトルの重要な特徴は、すべての要素が @@ -291,8 +291,8 @@ Round(桁数 = 2、x = 3.14159) 型の要素) を示します。 ```{r, purl=TRUE} -クラス(体重g) -クラス(分子) +class(weight_g) +class(molecules) ``` 関数 `str()` は、 @@ -300,16 +300,16 @@ Round(桁数 = 2、x = 3.14159) て複雑なオブジェクトを扱う場合に便利な関数です。 ```{r, purl=TRUE} -str(体重g) -str(分子) +str(weight_g) +str(molecules) ``` `c()` 関数を使用して、ベクトルに他の要素を追加できます。 ```{r} -Weight_g <- c(weight_g, 90) # ベクトルの最後に追加 -Weight_g <- c(30,weight_g) # ベクトルの先頭に追加 -Weight_g +weight_g <- c(weight_g, 90) # ベクトルの最後に追加 +weight_g <- c(30, weight_g) # ベクトルの先頭に追加 +weight_g ``` 最初の行では、元のベクトル `weight_g` を取得し、その末尾に @@ -369,8 +369,8 @@ R はそれらをすべて同じ型に暗黙的に変換します。 ```{r, eval=TRUE} num_char <- c(1, 2, 3, "a") num_logical <- c(1, 2, 3, TRUE, FALSE) -char_logical <- c("a", "b", "c", TRUE) ) -トリッキー <- c(1, 2, 3, "4") +char_logical <- c("a", "b", "c", TRUE) +tricky <- c(1, 2, 3, "4") ``` ::::::::::::::: solution @@ -382,10 +382,10 @@ class(num_char) num_char class(num_logical) num_logical -クラス(char_logical) +class(char_logical) char_logical -クラス(トリッキー) -トリッキー +class(tricky) +tricky ``` :::::::::::::::::::::::: @@ -433,7 +433,7 @@ combined_logical <- c(num_logical, char_logical) 「1」に変換される前に、「1」に変換されます。 ```{r} -結合論理 +combined_logical ``` :::::::::::::::::::::::: @@ -461,29 +461,28 @@ R では、オブジェクトをあるクラスから別のクラスに変換す :::::::::::::::::::::::::::::::::::::::::::::: ```{r, echo=FALSE, eval=FALSE, purl=TRUE} -## アトミック ベクトルのタイプは、文字、数値、整数、および -## 論理型であることがわかりました。しかし、これらのタイプを単一の -## ベクトルに混在させようとするとどうなるでしょうか? +## アトミックベクトルは文字型、数値型、整数型、 +## 論理型があることを見てきました。しかし、これらの型を1つの +## ベクトルに混在させようとしたらどうなるでしょうか? -## これらのそれぞれの例では何が起こるでしょうか? (ヒント: `class()` を使用して -## オブジェクトのデータ型を確認します) +## それぞれの例で何が起こるか?(ヒント: `class()` を使って +## オブジェクトのデータ型をチェックしてみましょう) num_char <- c(1, 2, 3, "a") -num_logical <- c(1, 2, 3, TRUE) ) +num_logical <- c(1, 2, 3, TRUE) char_logical <- c("a", "b", "c", TRUE) -トリッキー <- c(1, 2, 3, "4") +tricky <- c(1, 2, 3, "4") -## なぜそれが起こると思いますか? +## なぜそうなると思いますか? -## おそらく、異なる型のオブジェクトがベクトル内の -## 単一の共有型に変換されることに気づいたでしょう。 R では、 -## オブジェクトをあるクラスから別のクラスに変換することを -## _強制_と呼びます。これらの変換は階層に従って行われ、 -## これにより、一部の型が優先的に他の型に強制されます。 -## これらのデータ型がどのように強制されるかの階層を表す図を描くことができますか? ## +## ベクトル内で、異なる型のオブジェクトが単一の共有型に変換されることにお気づきでしょう。Rでは、 +## オブジェクトをあるクラスから別のクラスに変換することを +## _coercion_と呼んでいます。これらの変換は階層に従って行われ、 +## ある型が優先的に他の型に強制されます。 +## これらのデータ型がどのように強制されるのか、その階層を表す図を描けますか? ``` ## ベクトルのサブセット化 @@ -492,16 +491,16 @@ char_logical <- c("a", "b", "c", TRUE) ます。 例えば: ```{r, results="show", purl=TRUE} -分子 <- c("dna", "rna", "ペプチド", "タンパク質") -分子[2] -分子[c(3, 2)] +molecules <- c("dna", "rna", "peptide", "protein") +molecules[2] +molecules[c(3, 2)] ``` インデックスを繰り返して、元のオブジェクトよりも要素 が多いオブジェクトを作成することもできます。 ```{r, results="show", purl=TRUE} -more_molecules <- 分子[c(1, 2, 3, 2, 1, 4)] +more_molecules <- molecules[c(1, 2, 3, 2, 1, 4)] more_molecules ``` @@ -526,8 +525,8 @@ Julia、R などのプログラミング言語は が、`FALSE` は選択しません。 ```{r, purl=TRUE} -重み_g <- c(21, 34, 39, 54, 55) -重み_g[c(TRUE, FALSE, TRUE, TRUE, FALSE)] +weight_g <- c(21, 34, 39, 54, 55) +weight_g[c(TRUE, FALSE, TRUE, TRUE, FALSE)] ``` 通常、これらの論理ベクトルは手動で入力されるのではなく、他の関数または論理テストの @@ -546,8 +545,8 @@ Weight_g[weight_g > 50] を超える値のみを選択できます。 AND) または `|` (少なくとも 1 つの条件が true、OR) を使用して複数のテストを結合できます。 ```{r, results="show", purl=TRUE} -体重g[体重g < 30 |体重g > 50] -体重g[体重g >= 30 & 体重g == 21] +weight_g[weight_g < 30 | weight_g > 50] +weight_g[weight_g >= 30 & weight_g == 21] ``` ここで、「<」は「より小さい」、「>」は「より大きい」、「>=」は @@ -579,7 +578,7 @@ AND) または `|` (少なくとも 1 つの条件が true、OR) を使用して ## 解決 ```{r} -「4」 > 「5」 +"four" > "five" ``` 文字列で `>` または `<` を使用すると、R はそれらのアルファベット順を比較します。 @@ -624,11 +623,11 @@ R はデータセットを分析するように設計されているため、欠 として計算できます。 ```{r} -身長 <- c(2, 4, 4, NA, 6) -平均(身長) -最大(身長) -平均(身長、na.rm = TRUE) -最大(身長、na.rm = TRUE) +heights <- c(2, 4, 4, NA, 6) +mean(heights) +max(heights) +mean(heights, na.rm = TRUE) +max(heights, na.rm = TRUE) ``` データに欠損値が含まれている場合は、関数 `is.na()`、`na.omit()`、および `complete.cases()` に @@ -674,7 +673,7 @@ heights_no_na <- na.omit(heights) ``` ```{r, purl=TRUE} -中央値(身長、na.rm = TRUE) +median(heights, na.rm = TRUE) ``` ```{r, purl=TRUE} @@ -796,7 +795,7 @@ seq(from = 1, to = 20, by = 2) ```{r, purl=TRUE} seq(1, 5, 1) -seq(1, 5) ## デフォルトは +seq(1, 5) ## default by 1:5 ``` @@ -804,7 +803,7 @@ seq(1, 5) ## デフォルトは の 1 から 20 までの一連の数値を生成するには、次のコマンドを使用します。 ```{r, purl=TRUE} -seq(from = 1、to = 20、length.out = 3) +seq(from = 1, to = 20, length.out = 3) ``` ### ランダムなサンプルと順列 @@ -816,7 +815,7 @@ seq(from = 1、to = 20、length.out = 3) まず各生徒に 1 から 10 までの番号を割り当てます (たとえば、名前のアルファベット順に基づきます)。次に次のようにします。 ```{r, purl=TRUE} -サンプル(1:10) +sample(1:10) ``` さらなる引数がなければ、`sample` はベクトルのすべての @@ -825,7 +824,7 @@ seq(from = 1、to = 20、length.out = 3) 文字をサンプリングします。 ```{r, purl=TRUE} -サンプル(文字、5) +sample(letters, 5) ``` 入力ベクトルよりも大きな出力が必要な場合、または一部の要素を複数回 @@ -833,7 +832,7 @@ seq(from = 1、to = 20、length.out = 3) を `TRUE` に設定する必要があります。 ```{r, purl=TRUE} -サンプル(1:5、10、置換 = TRUE) +sample(1:5, 10, replace = TRUE) ``` ::::::::::::::::::::::::::::::::::::::: challenge @@ -862,26 +861,26 @@ seq(from = 1、to = 20、length.out = 3) さまざまな順列 ```{r, purl=TRUE} -サンプル(1:10) -サンプル(1:10) +sample(1:10) +sample(1:10) ``` シード 123 と同じ順列 ```{r, purl=TRUE} set.seed(123) -サンプル(1:10) +sample(1:10) set.seed(123) -サンプル(1:10) +sample(1:10) ``` 違う種 ```{r, purl=TRUE} set.seed(1) -サンプル(1:10) +sample(1:10) set.seed(1) -サンプル(1:10) +sample(1:10) ``` :::::::::::::::::::::::: @@ -897,8 +896,8 @@ _N(100, 5)_ と表記) を以下に示します。 ```{r, echo=FALSE, fig.width=12, fig.height=6, fig.cap="Two normal distributions: *N(0, 1)* on the left and *N(100, 5)* on the right."} par(mfrow = c(1, 2)) -プロット(密度(rnorm(1000)), メイン = "", サブ = "N(0, 1)") -プロット(密度(rnorm(1000, 100, 5) ))、メイン = ""、サブ = "N(100, 5)") +plot(density(rnorm(1000)), main = "", sub = "N(0, 1)") +plot(density(rnorm(1000, 100, 5)), main = "", sub = "N(100, 5)") ``` 3 つの引数「n」、「mean」、「sd」は、サンプル From 598f7cb53fe32374375cd735615f08d2682688d1 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Thu, 15 Aug 2024 23:09:33 +0900 Subject: [PATCH 236/334] New translations 20-r-rstudio.md (Spanish) --- locale/es/episodes/20-r-rstudio.Rmd | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/locale/es/episodes/20-r-rstudio.Rmd b/locale/es/episodes/20-r-rstudio.Rmd index c2f243954..8d4e3348f 100644 --- a/locale/es/episodes/20-r-rstudio.Rmd +++ b/locale/es/episodes/20-r-rstudio.Rmd @@ -244,7 +244,7 @@ las necesidades de su proyecto, pero estos deberían formar la columna vertebral . ```{r, results="markup", fig.cap="Example of a working directory structure.", echo=FALSE, purl=FALSE, out.width="100%", fig.align="center"} -knitr::include_graphics("fig/estructura-directorio-de-trabajo.png") +knitr::include_graphics("fig/working-directory-structure.png") ``` Para este curso, necesitaremos una carpeta `data/` para almacenar nuestros datos sin procesar, @@ -270,7 +270,7 @@ porque solo usaremos un archivo y facilitará las cosas Su directorio de trabajo ahora debería verse así: ```{r, results="markup", fig.cap="How it should look like at the beginning of this lesson", echo=FALSE, purl=FALSE, out.width="100%", fig.align="center"} -knitr::include_graphics("fig/r-empezando-cómo-debería-verse-como.png") +knitr::include_graphics("fig/r-starting-how-it-should-look-like.png") ``` **La gestión de proyectos** también es aplicable a proyectos de bioinformática, From 82683886705ce51e9b06885869e126a67db85c5d Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Fri, 16 Aug 2024 01:56:00 +0900 Subject: [PATCH 237/334] New translations 60-next-steps.md (Spanish) --- locale/es/episodes/60-next-steps.Rmd | 26 +++++++++++++------------- 1 file changed, 13 insertions(+), 13 deletions(-) diff --git a/locale/es/episodes/60-next-steps.Rmd b/locale/es/episodes/60-next-steps.Rmd index 773826bc9..3b291549a 100644 --- a/locale/es/episodes/60-next-steps.Rmd +++ b/locale/es/episodes/60-next-steps.Rmd @@ -27,7 +27,7 @@ exercises: 45 ## Próximos pasos ```{r, echo=FALSE, message=FALSE} -biblioteca("tidyverse") +library("tidyverse") ``` Los datos en bioinformática suelen ser complejos. Para solucionar esto, los desarrolladores de @@ -133,11 +133,11 @@ write.csv(sample_metadata, file = "data/sample_metadata.csv", row.names = FALSE) ```{r} count_matrix <- read.csv("data/count_matrix.csv", - fila.nombres = 1) %>% + row.names = 1) %>% as.matrix() count_matrix[1:5, ] -tenue(count_matrix) +dim(count_matrix) ``` - **Una tabla que describe las muestras**, disponible @@ -172,8 +172,8 @@ Para hacer esto podemos juntar las diferentes partes usando el constructor `SummarizedExperiment`: ```{r, message=FALSE, warning=FALSE} -## BiocManager::install("Experimento resumido") -biblioteca("Experimento resumido") +## BiocManager::install("SummarizedExperiment") +library("SummarizedExperiment") ``` Primero, nos aseguramos de que las muestras estén en el mismo orden en la matriz de conteo @@ -359,7 +359,7 @@ objetos `SummarizedExperiment`? La respuesta es sí, podemos hacerlo con el paqu Recuerde cómo se ve nuestro objeto SummarizedExperiment: ```{r, message=FALSE} -sí +se ``` Cargue `tidySummarizedExperiment` y luego eche un vistazo al objeto se @@ -367,7 +367,7 @@ nuevamente. ```{r, message=FALSE} #BiocManager::install("tidySummarizedExperiment") -biblioteca("tidySummarizedExperiment") +library("tidySummarizedExperiment") se ``` @@ -382,14 +382,14 @@ Si queremos volver a la vista estándar `Experimento resumido`, podemos hacerlo. ```{r} -opciones ("restore_SummarizedExperiment_show" = VERDADERO) +options("restore_SummarizedExperiment_show" = TRUE) se ``` Pero aquí usamos la vista tibble. ```{r} -opciones("restore_SummarizedExperiment_show" = FALSO) +options("restore_SummarizedExperiment_show" = FALSE) se ``` @@ -400,19 +400,19 @@ Podemos usar `filter` para filtrar filas usando una condición, por ejemplo, par todas las filas de una muestra. ```{r} -se %>% filtro(.sample == "GSM2545336") +se %>% filter(.sample == "GSM2545336") ``` Podemos usar `select` para especificar las columnas que queremos ver. ```{r} -se %>% seleccionar(.muestra) +se %>% select(.sample) ``` Podemos usar `mutate` para agregar información de metadatos. ```{r} -se %>% mutate(centro = "Universidad de Heidelberg") +se %>% mutate(center = "Heidelberg University") ``` También podemos combinar comandos con la canalización tidyverse `%>%`. Por ejemplo, @@ -422,7 +422,7 @@ para cada muestra. ```{r} se %>% group_by(.sample) %>% - resumen(total_counts=sum(counts)) + summarise(total_counts=sum(counts)) ``` Podemos tratar el objeto ordenado SummarizedExperiment como un tibble normal From 5e08dfaf0f7d3bf34717e5130f5ab90ab0990c1a Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Fri, 16 Aug 2024 02:58:25 +0900 Subject: [PATCH 238/334] New translations 20-r-rstudio.md (Spanish) --- locale/es/episodes/20-r-rstudio.Rmd | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/locale/es/episodes/20-r-rstudio.Rmd b/locale/es/episodes/20-r-rstudio.Rmd index 8d4e3348f..73ad92224 100644 --- a/locale/es/episodes/20-r-rstudio.Rmd +++ b/locale/es/episodes/20-r-rstudio.Rmd @@ -528,8 +528,8 @@ puedes usar la función `dput()`. Generará código R que se puede usar para recrear exactamente el mismo objeto que el que está en la memoria: ```{r, results="show", purl=TRUE} -## iris es un marco de datos de ejemplo que viene con R y head() es una -## función que devuelve la primera parte del marco de datos +## iris is an example data frame that comes with R and head() is a +## function that returns the first part of the data frame dput(head(iris)) ``` @@ -626,7 +626,7 @@ primero debemos cargarlo para poder usarlo. . Esto se hace con la función `library()`. A continuación, cargamos `ggplot2`. ```{r loadp, eval=FALSE, purl=TRUE} -biblioteca("ggplot2") +library("ggplot2") ``` ### Instalación de paquetes @@ -637,7 +637,7 @@ instalar con la función `install.packages()`. Debajo, por ejemplo, instalamos el paquete `dplyr` del que aprenderemos más adelante. ```{r craninstall, eval=FALSE, purl=TRUE} -instalar.paquetes("dplyr") +install.packages("dplyr") ``` Este comando instalará el paquete `dplyr` así como todas sus @@ -647,7 +647,7 @@ Bioconductor mantiene otro importante repositorio de paquetes R. Los [paquetes d , concretamente `BiocManager`, que se puede instalar desde CRAN con ```{r, eval=FALSE, purl=TRUE} -instalar.paquetes("BiocManager") +install.packages("BiocManager") ``` Paquetes individuales como `SummarizedExperiment` (lo usaremos @@ -655,7 +655,7 @@ más adelante), `DESeq2` (para análisis RNA-Seq) y cualquier otro de Bioconduct con ` BiocManager::instalar`. ```{r, eval=FALSE, purl=TRUE} -BiocManager::install("Experimento resumido") +BiocManager::install("SummarizedExperiment") BiocManager::install("DESeq2") ``` From 8103f5a71e9e343faeba498a5c37b53437cc9130 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Fri, 16 Aug 2024 02:58:33 +0900 Subject: [PATCH 239/334] New translations 23-starting-with-r.md (Spanish) --- locale/es/episodes/23-starting-with-r.Rmd | 22 +++++++++++----------- 1 file changed, 11 insertions(+), 11 deletions(-) diff --git a/locale/es/episodes/23-starting-with-r.Rmd b/locale/es/episodes/23-starting-with-r.Rmd index d8692d96a..9ba73cd6f 100644 --- a/locale/es/episodes/23-starting-with-r.Rmd +++ b/locale/es/episodes/23-starting-with-r.Rmd @@ -20,13 +20,13 @@ exercises: 60 - Subconjunto y extracción de valores de vectores. - Analizar vectores con datos faltantes. -:::::::::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::: :::::::::::::::::::::::::::::::::::::::: questions - Primeros comandos en R -:::::::::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::: > Este episodio se basa en la lección _Análisis de datos y > Visualización en R para ecologistas_ de Data Carpentries. @@ -45,7 +45,7 @@ _objetos_. Para crear un objeto, debemos darle un nombre seguido del operador de `<-` y el valor que queremos darle: ```{r, purl=TRUE} -peso_kg <- 55 +weight_kg <- 55 ``` `<-` es el operador de asignación. Asigna valores a la derecha a @@ -104,23 +104,23 @@ puede forzar a R a imprimir el valor usando paréntesis o escribiendo el nombre : ```{r, purl=TRUE} -peso_kg <- 55 # no imprime nada -(peso_kg <- 55) # pero al poner paréntesis alrededor de la llamada imprime el valor de `weight_kg` -peso_kg # y también lo hace al escribir el nombre del objeto +weight_kg <- 55 # doesn't print anything +(weight_kg <- 55) # but putting parenthesis around the call prints the value of `weight_kg` +weight_kg # and so does typing the name of the object ``` Ahora que R tiene `weight_kg` en la memoria, podemos hacer aritmética con él. Por ejemplo , es posible que deseemos convertir este peso en libras (el peso en libras es 2,2 veces el peso en kg): ```{r, purl=TRUE} -2.2 * peso_kg +2.2 * weight_kg ``` También podemos cambiar el valor de un objeto asignándole uno nuevo: ```{r, purl=TRUE} -peso_kg <- 57,5 -2,2 * peso_kg +weight_kg <- 57,5 +2,2 * weight_kg ``` Esto significa que asignar un valor a un objeto no cambia los valores de @@ -128,13 +128,13 @@ otros objetos. Por ejemplo, almacenemos el peso del animal en libras en un nuevo , `weight_lb`: ```{r, purl=TRUE} -peso_lb <- 2,2 * peso_kg +weight_lb <- 2,2 * weight_kg ``` y luego cambie `weight_kg` a 100. ```{r} -peso_kg <- 100 +weight_kg <- 100 ``` ::::::::::::::::::::::::::::::::::::::: challenge From 20c99c38b4806b743d1e96c3342564858d11e491 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Fri, 16 Aug 2024 02:58:41 +0900 Subject: [PATCH 240/334] New translations 25-starting-with-data.md (Spanish) --- locale/es/episodes/25-starting-with-data.Rmd | 138 +++++++++---------- 1 file changed, 69 insertions(+), 69 deletions(-) diff --git a/locale/es/episodes/25-starting-with-data.Rmd b/locale/es/episodes/25-starting-with-data.Rmd index db31bc071..cb1a19ab3 100644 --- a/locale/es/episodes/25-starting-with-data.Rmd +++ b/locale/es/episodes/25-starting-with-data.Rmd @@ -25,7 +25,7 @@ exercises: 30 - Primer análisis de datos en R -:::::::::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::: > Este episodio se basa en la lección _Análisis de datos y > Visualización en R para ecologistas_ de Data Carpentries. @@ -74,13 +74,13 @@ preexistente llamada`"data"\`. ```{r, eval=TRUE} download.file(url = "https://github.com/carpentries-incubator/bioc-intro/raw/main/episodes/data/rnaseq.csv", - destfile = "data/rnaseq.csv" ) + destfile = "data/rnaseq.csv") ``` Ahora está listo para cargar los datos: ```{r, eval=TRUE, purl=TRUE} -arn <- read.csv("datos/rnaseq.csv") +rna <- read.csv("data/rnaseq.csv") ``` Esta declaración no produce ningún resultado porque, como @@ -89,7 +89,7 @@ nuestros datos han sido cargados, podemos ver el contenido del marco de datos escribiendo su nombre: ```{r, eval=FALSE} -arn +rna ``` Guau... eso fue mucho resultado. Al menos significa que los datos se cargaron @@ -98,8 +98,8 @@ usando la función `head()`: ```{r, purl=TRUE} head(rna) -## Prueba también -## Ver(rna) +## Try also +## View(rna) ``` **Nota** @@ -119,9 +119,9 @@ los datos anteriores también podrían haberse cargado usando `read.table()` con el argumento de separación como `,`. El código es el siguiente: ```{r, eval=TRUE, purl=TRUE} -rna <- read.table(archivo = "data/rnaseq.csv", +rna <- read.table(file = "data/rnaseq.csv", sep = ",", - encabezado = VERDADERO) + header = TRUE) ``` El argumento del encabezado debe establecerse en VERDADERO para poder leer los encabezados @@ -150,7 +150,7 @@ Podemos ver esto al inspeccionar la estructura <b>str</b>de un marco de datos con la función `str()`: ```{r} -cadena (arn) +str(rna) ``` ## Inspeccionando objetos `data.frame` @@ -188,7 +188,7 @@ contenido/estructura de los datos. ¡Probémoslos! Nota: la mayoría de estas funciones son "genéricas", se pueden usar en otros tipos de objetos además de `data.frame`. -::::::::::::::::::::::::::::::::::::::: desafío +::::::::::::::::::::::::::::::::::::::: challenge ## Desafío: @@ -198,16 +198,16 @@ preguntas? - ¿Cuál es la clase del objeto `rna`? - ¿Cuántas filas y cuántas columnas hay en este objeto? -::::::::::::::: solución +::::::::::::::: solution ## Solución - clase: marco de datos - cuantas filas: 66465, cuantas columnas: 11 -:::::::::::::::::::::::::::: +::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::: ## Indexación y subconjunto de marcos de datos @@ -218,21 +218,21 @@ números de columna. Sin embargo, tenga en cuenta que diferentes formas de espec conducen a resultados con diferentes clases. ```{r, eval=FALSE, purl=TRUE} -# primer elemento en la primera columna del marco de datos (como vector) +# first element in the first column of the data frame (as a vector) rna[1, 1] -# primer elemento en la sexta columna (como vector) -rna [1, 6] -# primera columna del marco de datos (como un vector) +# first element in the 6th column (as a vector) +rna[1, 6] +# first column of the data frame (as a vector) rna[, 1] -# primera columna del marco de datos (como un data.frame ) +# first column of the data frame (as a data.frame) rna[1] -# primeros tres elementos en la séptima columna (como un vector) +# first three elements in the 7th column (as a vector) rna[1:3, 7] -# la tercera fila del marco de datos (como un data.frame) +# the 3rd row of the data frame (as a data.frame) rna[3, ] -# equivalente a head_rna <- head(rna) +# equivalent to head_rna <- head(rna) head_rna <- rna[1:6, ] -cabeza_rna +head_rna ``` `:` es una función especial que crea vectores numéricos de números enteros en @@ -242,8 +242,8 @@ orden creciente o decreciente, pruebe `1:10` y `10:1` para la instancia También puedes excluir ciertos índices de un marco de datos usando el signo "`-`": ```{r, eval=FALSE, purl=TRUE} -rna[, -1] ## Todo el marco de datos, excepto la primera columna -rna[-c(7:66465), ] ## Equivalente a head(rna) +rna[, -1] ## The whole data frame, except the first column +rna[-c(7:66465), ] ## Equivalent to head(rna) ``` Los marcos de datos se pueden subconjuntos llamando a índices (como se mostró anteriormente), @@ -259,7 +259,7 @@ rna$gene # Result is a vector En RStudio, puede utilizar la función de autocompletar para obtener los nombres completos y correctos de las columnas. -::::::::::::::::::::::::::::::::::::::: desafío +::::::::::::::::::::::::::::::::::::::: challenge ## Desafío @@ -286,7 +286,7 @@ correctos de las columnas. de `head(rna)`, manteniendo solo la primera a la sexta filas del conjunto de datos de rna. -::::::::::::::: solución +::::::::::::::: solution ## Solución @@ -294,18 +294,18 @@ correctos de las columnas. ## 1. rna_200 <- rna[200, ] ## 2. -## Guardando `n_rows` para mejorar la legibilidad y reducir la duplicación -n_rows < - nrow(rna) +## Saving `n_rows` to improve readability and reduce duplication +n_rows <- nrow(rna) rna_last <- rna[n_rows, ] ## 3. rna_middle <- rna[n_rows / 2, ] -## 4 +## 4. rna_head <- rna[-(7:n_rows), ] ``` -:::::::::::::::::::::::::::: +::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::: ## Factores @@ -320,7 +320,7 @@ conocidos como _niveles_. De forma predeterminada, R siempre ordena los niveles . Por ejemplo, si tienes un factor con 2 niveles: ```{r, purl=TRUE} -sexo <- factor(c("masculino", "femenino", "femenino", "masculino", "femenino")) +sex <- factor(c("male", "female", "female", "male", "female")) ``` R will assign `1` to the level `"female"` and `2` to the level @@ -329,8 +329,8 @@ in this vector is `"male"`). Puedes ver esto usando la función `levels()` y puedes encontrar el número de niveles usando `nlevels()`: ```{r, purl=TRUE} -niveles(sexo) -nniveles(sexo) +levels(sex) +nlevels(sex) ``` A veces, el orden de los factores no importa, otras veces @@ -421,13 +421,13 @@ sexo trama(sexo) ``` -:::::::::::::::::::::::::::::::::::::: desafío +:::::::::::::::::::::::::::::::::::::: challenge ## Desafío: - Cambie el nombre de "F" y "M" a "Mujer" y "Masculino" respectivamente. -::::::::::::::: solución +::::::::::::::: solution ## Solución @@ -436,11 +436,11 @@ niveles(sexo) niveles(sexo) <- c("Hombre", "Mujer") ``` -:::::::::::::::::::::::::::: +::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::: -::::::::::::::::::::::::::::::::::::::: desafío +::::::::::::::::::::::::::::::::::::::: challenge ## Desafío: @@ -456,7 +456,7 @@ animal_data <- data.frame( weight = c(45, 8 1.1, 0.8)) ``` -::::::::::::::: solución +::::::::::::::: solution ## Solución @@ -464,11 +464,11 @@ animal_data <- data.frame( - falta una entrada en la columna "sensación" (probablemente para uno de los animales peludos) - falta una coma en la columna de peso -:::::::::::::::::::::::::::: +::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::: -::::::::::::::::::::::::::::::::::::::: desafío +::::::::::::::::::::::::::::::::::::::: challenge ## Desafío: @@ -494,7 +494,7 @@ country_climate <- data.frame( ) ``` -::::::::::::::: solución +::::::::::::::: solution ## Solución @@ -509,9 +509,9 @@ country_climate <- data.frame( str(clima_país) ``` -:::::::::::::::::::::::::::: +::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::: La conversión automática de tipos de datos es a veces una bendición, a veces una molestia. Tenga en cuenta que existe, aprenda las reglas y verifique que los datos @@ -537,13 +537,13 @@ columnas[^ncol]. Los valores se ordenan a lo largo de las columnas, como se ilus a continuación. ```{r mat1, purl=TRUE} -m <- matriz(1:9, ncol = 3, nrow = 3) +m <- matrix(1:9, ncol = 3, nrow = 3) m ``` [^ncol]: O el número de filas o columnas es suficiente, ya que el otro se puede deducir de la longitud de los valores. Pruebe qué sucede si los valores y el número de filas/columnas no cuadran. -::::::::::::::::::::::::::::::::::::::: desafío +::::::::::::::::::::::::::::::::::::::: challenge ## Desafío: @@ -551,7 +551,7 @@ Usando la función `installed.packages()`, cree una matriz de `caracteres` que contenga la información sobre todos los paquetes actualmente instalados en su computadora. Explorarlo. -::::::::::::::: solución +::::::::::::::: solution ## Solución: @@ -568,23 +568,23 @@ rownames(ip) nombres de columna (ip) ``` -:::::::::::::::::::::::::::: +::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::: A menudo resulta útil crear grandes matrices de datos aleatorios como datos de prueba . El siguiente ejercicio le pide que cree dicha matriz con datos aleatorios extraídos de una distribución normal de media 0 y desviación estándar 1, lo cual se puede hacer con la función `rnorm()`. -::::::::::::::::::::::::::::::::::::::: desafío +::::::::::::::::::::::::::::::::::::::: challenge ## Desafío: Construya una matriz de dimensión 1000 por 3 de datos distribuidos normalmente (media 0, desviación estándar 1) -::::::::::::::: solución +::::::::::::::: solution ## Solución @@ -595,9 +595,9 @@ dim(m) cabeza(m) ``` -:::::::::::::::::::::::::::: +::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::: ## Formato de fechas @@ -702,7 +702,7 @@ Asegurémonos de que todo funcionó correctamente. Una forma de inspeccionar la nueva columna es usar `summary()`: ```{r, purl=TRUE} -resumen(x$date) +summary(x$date) ``` Tenga en cuenta que `ymd()` espera tener el año, mes y día, en ese orden @@ -710,7 +710,7 @@ Tenga en cuenta que `ymd()` espera tener el año, mes y día, en ese orden `dmy()`. ```{r, purl=TRUE} -dmy(pegar(x$day, x$month, x$year, sep = "-")) +dmy(paste(x$day, x$month, x$year, sep = "-")) ``` `lubdridate` tiene muchas funciones para abordar todas las variaciones de fechas. @@ -737,12 +737,12 @@ A continuación, creemos una lista que contiene un vector de números, caractere una matriz, un marco de datos y otra lista: ```{r list0, purl=TRUE} -l <- lista (1:10, ## numérico - letras, ## carácter - paquetes.instalados(), ## una matriz - autos, ## un marco.de.datos - lista(1, 2, 3)) ## una lista -longitud(l) +l <- list(1:10, ## numeric + letters, ## character + installed.packages(), ## a matrix + cars, ## a data.frame + list(1, 2, 3)) ## a list +length(l) str(l) ``` @@ -751,9 +751,9 @@ para extraer un solo elemento de esa lista (usando índices o nombres, si la lista es nombrado). ```{r, purl=TRUE} -l[[1]] ## primer elemento -l[1:2] ## una lista de longitud 2 -l[1] ## una lista de longitud 1 +l[[1]] ## first element +l[1:2] ## a list of length 2 +l[1] ## a list of length 1 ``` ## Exportar y guardar datos tabulares {#sec:exportandsave} @@ -767,7 +767,7 @@ y el archivo al que se exportará. Por ejemplo, para exportar los datos , ejecutaríamos: ```{r, eval=FALSE, purl=TRUE} -write.csv(rna, archivo = "data_output/my_rna.csv") +write.csv(rna, file = "data_output/my_rna.csv") ``` This new csv file can now be shared with other collaborators who @@ -777,8 +777,8 @@ by default surround each field with quotes, and thus we will be able to read it back into R correctly, despite also using commas as column separators. -:::::::::::::::::::::::::::::::::::::::: puntos clave +:::::::::::::::::::::::::::::::::::::::: keypoints - Datos tabulares en R -:::::::::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::: From e06310fc27189d4749d46b734722397bf10f1b75 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Fri, 16 Aug 2024 02:58:49 +0900 Subject: [PATCH 241/334] New translations 30-dplyr.md (Spanish) --- locale/es/episodes/30-dplyr.Rmd | 22 +++++++++++----------- 1 file changed, 11 insertions(+), 11 deletions(-) diff --git a/locale/es/episodes/30-dplyr.Rmd b/locale/es/episodes/30-dplyr.Rmd index 0b7c7c173..bba8a46e6 100644 --- a/locale/es/episodes/30-dplyr.Rmd +++ b/locale/es/episodes/30-dplyr.Rmd @@ -17,9 +17,9 @@ exercises: 75 cómo remodelar un marco de datos de un formato a otro. - Demuestre cómo unir tablas. -:::::::::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::: preguntas +:::::::::::::::::::::::::::::::::::::::: questions - Análisis de datos en R utilizando el metapaquete tidyverse @@ -252,7 +252,7 @@ rna3 <- rna %>% rna3 ``` -::::::::::::::::::::::::::::::::::::::: desafío +::::::::::::::::::::::::::::::::::::::: challenge ## Desafío: @@ -299,7 +299,7 @@ rna %>% seleccionar(tiempo, tiempo_horas, tiempo_mn) ``` -::::::::::::::::::::::::::::::::::::::: desafío +::::::::::::::::::::::::::::::::::::::: challenge ## Desafío @@ -402,7 +402,7 @@ rna %>% expresión_mediana = mediana (expresión)) ``` -::::::::::::::::::::::::::::::::::::::: desafío +::::::::::::::::::::::::::::::::::::::: challenge ## Desafío @@ -487,7 +487,7 @@ rna %>% organizar(desc(n)) ``` -::::::::::::::::::::::::::::::::::::::: desafío +::::::::::::::::::::::::::::::::::::::: challenge ## Desafío @@ -715,7 +715,7 @@ wide_with_NA %>% Pasar a formatos más amplios y largos puede ser una forma útil de equilibrar un conjunto de datos para que cada réplica tenga la misma composición. -::::::::::::::::::::::::::::::::::::::: desafío +::::::::::::::::::::::::::::::::::::::: challenge ## Pregunta @@ -741,7 +741,7 @@ pivot_longer(names_to = "mouse_id", valores_to = "cuentas", -gene) :::::::::::::::::::::::::::::::::::::::::::::::::::::::: -::::::::::::::::::::::::::::::::::::::: desafío +::::::::::::::::::::::::::::::::::::::: challenge ## Pregunta @@ -798,7 +798,7 @@ rna_1 %>% :::::::::::::::::::::::::::::::::::::::::::::::::::::::: -::::::::::::::::::::::::::::::::::::::: desafío +::::::::::::::::::::::::::::::::::::::: challenge ## Pregunta @@ -870,7 +870,7 @@ rna %>% :::::::::::::::::::::::::::::::::::::::::::::::::::::::: -::::::::::::::::::::::::::::::::::::::: desafío +::::::::::::::::::::::::::::::::::::::: challenge ## Pregunta @@ -989,7 +989,7 @@ full_join(rna_mini, annot2, by = c("gene" = "external_gene_name")) Como se puede ver arriba, el nombre de la variable de la primera tabla se conserva en la unida. -::::::::::::::::::::::::::::::::::::::: desafío +::::::::::::::::::::::::::::::::::::::: challenge ## Desafío: From 11e446107a261a615b2677d6adbda8c65db055fd Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Fri, 16 Aug 2024 02:58:57 +0900 Subject: [PATCH 242/334] New translations 40-visualization.md (Spanish) --- locale/es/episodes/40-visualization.Rmd | 48 ++++++++++++------------- 1 file changed, 24 insertions(+), 24 deletions(-) diff --git a/locale/es/episodes/40-visualization.Rmd b/locale/es/episodes/40-visualization.Rmd index db7eda723..e454a13f0 100644 --- a/locale/es/episodes/40-visualization.Rmd +++ b/locale/es/episodes/40-visualization.Rmd @@ -7,8 +7,8 @@ exercises: 60 ```{r loaddata_vis, echo=FALSE, purl=FALSE, message=FALSE} if (!file.exists("data/rnaseq.csv")) -download.file(url = "https://github.com/carpentries-incubator/bioc-intro/raw/main/episodes/ datos/rnaseq.csv", - destfile = "datos/rnaseq.csv") +download.file(url = "https://github.com/carpentries-incubator/bioc-intro/raw/main/episodes/data/rnaseq.csv", + destfile = "data/rnaseq.csv") ``` ::::::::::::::::::::::::::::::::::::::: objectives @@ -19,16 +19,16 @@ download.file(url = "https://github.com/carpentries-incubator/bioc-intro/raw/mai - Modifique la estética de un gráfico ggplot existente (incluidas las etiquetas de los ejes y el color). - Cree gráficos complejos y personalizados a partir de datos en un marco de datos. -:::::::::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::: :::::::::::::::::::::::::::::::::::::::: preguntas - Visualización en R -:::::::::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::: ```{r vis_setup, echo=FALSE} -arn <- read.csv("datos/rnaseq.csv") +rna <- read.csv("data/rnaseq.csv") ``` > Este episodio se basa en la lección _Análisis de datos y @@ -40,14 +40,14 @@ Comenzamos cargando los paquetes requeridos. **`ggplot2`** está incluido en el paquete **`tidyverse`**. ```{r load-package, message=FALSE, purl=TRUE} -biblioteca("tidyverse") +library("tidyverse") ``` Si aún no está en el espacio de trabajo, cargue los datos que guardamos en la lección anterior. ```{r load-data, eval=FALSE, purl=TRUE} -arn <- read.csv("datos/rnaseq.csv") +rna <- read.csv("data/rnaseq.csv") ``` La Hoja de trucos de visualización de datos @@ -94,14 +94,14 @@ Para construir un ggplot, usaremos la siguiente plantilla básica que se puede usar para diferentes tipos de gráficos: ``` -ggplot(datos = <DATA>, mapeo = aes(<MAPPINGS>)) + <GEOM_FUNCTION>() +ggplot(data = <DATA>, mapping = aes(<MAPPINGS>)) + <GEOM_FUNCTION>() ``` - use la función `ggplot()` y vincule el gráfico a un \*\*marco de datos \*\* específico usando el argumento `data` ```{r, eval=FALSE} -ggplot(datos = arn) +ggplot(data = rna) ``` - defina un **mapeo** (usando la función estética (`aes`)), seleccionando @@ -110,7 +110,7 @@ ggplot(datos = arn) tamaño, forma, color, etc. ```{r, eval=FALSE} -ggplot(datos = rna, mapeo = aes(x = expresión)) +ggplot(data = rna, mapping = aes(x = expression)) ``` - agregue '**geoms**': geometrías o representaciones gráficas de los datos @@ -128,7 +128,7 @@ Para agregar una geometría (etry) al gráfico, use el operador `+`. Usemos `geom_histogram()` primero: ```{r first-ggplot, cache=FALSE, purl=TRUE} -ggplot(datos = rna, mapeo = aes(x = expresión)) + +ggplot(data = rna, mapping = aes(x = expression)) + geom_histogram() ``` @@ -140,13 +140,13 @@ configurar fácilmente plantillas de gráficos y explorar cómodamente diferente ```{r, eval=FALSE, purl=TRUE} # Asignar gráfico a una variable rna_plot <- ggplot(data = rna, - mapeo = aes(x = expresión)) + mapping = aes(x = expression)) # Dibujar el gráfico -rna_plot + geom_histograma() +rna_plot + geom_histogram() ``` -::::::::::::::::::::::::::::::::::::::: desafío +::::::::::::::::::::::::::::::::::::::: challenge ## Desafío @@ -154,7 +154,7 @@ Probablemente hayas notado un mensaje automático que aparece cuando dibuja el histograma: ```{r, echo=FALSE, fig.show="hide"} -ggplot(rna, aes(x = expresión)) + +ggplot(rna, aes(x = expression)) + geom_histogram() ``` @@ -198,7 +198,7 @@ ggplot(rna, aes(x = expresión_log)) + geom_histogram() De ahora en adelante trabajaremos en los valores de expresión transformados logarítmicamente. -::::::::::::::::::::::::::::::::::::::: desafío +::::::::::::::::::::::::::::::::::::::: challenge ## Desafío @@ -343,7 +343,7 @@ ggplot(data = rna_fc, mapeo = aes(x = time_4_vs_0, y = time_8_vs_0, biblioteca("hexbin") ``` -::::::::::::::::::::::::::::::::::::::: desafío +::::::::::::::::::::::::::::::::::::::: challenge ## Desafío @@ -385,7 +385,7 @@ ggplot(data = rna_fc, mapeo = aes(x = time_4_vs_0, y = time_8_vs_0)) + :::::::::::::::::::::::::::::::::::::::::::::::::::::::: -::::::::::::::::::::::::::::::::::::::: desafío +::::::::::::::::::::::::::::::::::::::: challenge ## Desafío @@ -427,7 +427,7 @@ ggplot(datos = rna, geom_boxplot( alfa = 0) ``` -::::::::::::::::::::::::::::::::::::::: desafío +::::::::::::::::::::::::::::::::::::::: challenge ## Desafío @@ -465,7 +465,7 @@ ggplot(datos = rna, tema(axis.text.x = element_text(ángulo = 90, hjust = 0.5, vjust = 0.5)) ``` -::::::::::::::::::::::::::::::::::::::: desafío +::::::::::::::::::::::::::::::::::::::: challenge ## Desafío @@ -502,7 +502,7 @@ ggplot(data = rna, :::::::::::::::::::::::::::::::::::::::::::::::::::::::: -::::::::::::::::::::::::::::::::::::::: desafío +::::::::::::::::::::::::::::::::::::::: challenge ## Desafío @@ -529,7 +529,7 @@ ggplot(data = rna, :::::::::::::::::::::::::::::::::::::::::::::::::::::::: -::::::::::::::::::::::::::::::::::::::: desafío +::::::::::::::::::::::::::::::::::::::: challenge ## Desafío @@ -660,7 +660,7 @@ ggplot(data = mean_exp_by_time_sex, tema(panel.grid = element_blank()) ``` -::::::::::::::::::::::::::::::::::::::: desafío +::::::::::::::::::::::::::::::::::::::: challenge ## Desafío @@ -809,7 +809,7 @@ ggplot(rna, aes(x = expression_log)) + blue_theme ``` -::::::::::::::::::::::::::::::::::::::: desafío +::::::::::::::::::::::::::::::::::::::: challenge ## Desafío From e50eefe4d6d41d818506f6431baaba70e992f1b3 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Fri, 16 Aug 2024 02:59:04 +0900 Subject: [PATCH 243/334] New translations 60-next-steps.md (Spanish) --- locale/es/episodes/60-next-steps.Rmd | 22 +++++++++++----------- 1 file changed, 11 insertions(+), 11 deletions(-) diff --git a/locale/es/episodes/60-next-steps.Rmd b/locale/es/episodes/60-next-steps.Rmd index 3b291549a..56fc00cde 100644 --- a/locale/es/episodes/60-next-steps.Rmd +++ b/locale/es/episodes/60-next-steps.Rmd @@ -186,9 +186,9 @@ stopifnot(colnames(count_matrix) == sample_metadata$sample) ``` ```{r} -se <- Experimento resumido (ensayos = lista (recuentos = matriz_conteo), - colData = muestra_metadatos, - filaData = gene_metadata) +se <- SummarizedExperiment(assays = list(counts = count_matrix), + colData = sample_metadata, + rowData = gene_metadata) se ``` @@ -224,8 +224,8 @@ Usando esta estructura de datos, podemos acceder a la matriz de expresión con la función `ensayo`: ```{r} -head(ensayo(se)) -dim(ensayo(se)) +head(assay(se)) +dim(assay(se)) ``` Podemos acceder a los metadatos de muestra usando la función `colData`: @@ -238,8 +238,8 @@ dim(colData(se)) También podemos acceder a los metadatos de la característica usando la función `rowData`: ```{r} -head(filaData(se)) -dim(filaData(se)) +head(rowData(se)) +dim(rowData(se)) ``` ### Subconjunto de un experimento resumido @@ -257,7 +257,7 @@ se1 ```{r} colData(se1) -filaData(se1) +rowData(se1) ``` También podemos usar la función `colData()` para crear un subconjunto de algo de @@ -322,8 +322,8 @@ Verifique que obtenga los mismos valores usando la tabla larga `rna`. ```{r, purl=FALSE} rna |> - filtro(gen %in% c("Asl", "Apod", "Cyd2d22")) |> - filtro(tiempo!= 4) |> seleccionar(expresión ) + filter(gene %in% c("Asl", "Apod", "Cyd2d22")) |> + filter(time != 4) |> select(expression) ``` ::::::::::::::::::::::::: @@ -343,7 +343,7 @@ También podemos agregar información a los metadatos. Supongamos que desea agregar el centro donde se recolectaron las muestras... ```{r} -colData(se)$center <- rep("Universidad de Illinois", nrow(colData(se))) +colData(se)$center <- rep("University of Illinois", nrow(colData(se))) colData(se) ``` From a608488f0998ddb16e7053a97033007ba55e992e Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Fri, 16 Aug 2024 04:07:16 +0900 Subject: [PATCH 244/334] New translations 25-starting-with-data.md (Spanish) --- locale/es/episodes/25-starting-with-data.Rmd | 90 ++++++++++---------- 1 file changed, 45 insertions(+), 45 deletions(-) diff --git a/locale/es/episodes/25-starting-with-data.Rmd b/locale/es/episodes/25-starting-with-data.Rmd index cb1a19ab3..95e676fcc 100644 --- a/locale/es/episodes/25-starting-with-data.Rmd +++ b/locale/es/episodes/25-starting-with-data.Rmd @@ -340,9 +340,9 @@ por un tipo particular de análisis. Aquí, una forma de reordenar nuestros nive en el vector `sex` sería: ```{r, purl=TRUE} -sexo ## orden actual -sexo <- factor(sexo, niveles = c("masculino", "femenino")) -sexo ## después de reordenar +sex ## current order +sex <- factor(sex, levels = c("male", "female")) +sex ## after re-ordering ``` En la memoria de R, estos factores están representados por números enteros (1, 2, 3), @@ -359,7 +359,7 @@ representadas por cada nivel de factor. Veamos la cantidad de hombres y mujeres en nuestros datos. ```{r firstfactorplot, fig.cap="Bar plot of the number of females and males.", purl=TRUE} -trama (sexo) +plot(sex) ``` ### Convertirse en personaje @@ -368,7 +368,7 @@ Si necesita convertir un factor en un vector de caracteres, utilice `as.character(x)`. ```{r, purl=TRUE} -como.personaje(sexo) +as.character(sex) ``` <!-- ### Numeric factors --> @@ -415,10 +415,10 @@ Si queremos cambiar el nombre de estos factores, basta con cambiar sus niveles: ```{r, purl=TRUE} -niveles(sexo) -niveles(sexo) <- c("M", "F") -sexo -trama(sexo) +levels(sex) +levels(sex) <- c("M", "F") +sex +plot(sex) ``` :::::::::::::::::::::::::::::::::::::: challenge @@ -432,8 +432,8 @@ trama(sexo) ## Solución ```{r, eval=TRUE, purl=TRUE} -niveles(sexo) -niveles(sexo) <- c("Hombre", "Mujer") +levels(sex) +levels(sex) <- c("Male", "Female") ``` ::::::::::::::::::::::::: @@ -486,12 +486,12 @@ Comprueba tus conjeturas usando `str(country_climate)`: ```{r, eval=FALSE, purl=TRUE} country_climate <- data.frame( - country = c("Canadá", "Panamá", "Sudáfrica", "Australia"), - clima = c("frío", "caliente" , "templado", "caliente/templado"), - temperatura = c(10, 30, 18, "15"), - hemisferio_norte = c(VERDADERO, VERDADERO, FALSO, "FALSO" ), - has_kangaroo = c(FALSO, FALSO, FALSO, 1) -) + country = c("Canada", "Panama", "South Africa", "Australia"), + climate = c("cold", "hot", "temperate", "hot/temperate"), + temperature = c(10, 30, 18, "15"), + northern_hemisphere = c(TRUE, TRUE, FALSE, "FALSE"), + has_kangaroo = c(FALSE, FALSE, FALSE, 1) + ) ``` ::::::::::::::: solution @@ -500,13 +500,13 @@ country_climate <- data.frame( ```{r, eval=TRUE, purl=TRUE} country_climate <- data.frame( - country = c("Canadá", "Panamá", "Sudáfrica", "Australia"), - clima = c("frío", "caliente" , "templado", "caliente/templado"), - temperatura = c(10, 30, 18, "15"), - hemisferio_norte = c(VERDADERO, VERDADERO, FALSO, "FALSO" ), - has_kangaroo = c(FALSO, FALSO, FALSO, 1) + country = c("Canada", "Panama", "South Africa", "Australia"), + climate = c("cold", "hot", "temperate", "hot/temperate"), + temperature = c(10, 30, 18, "15"), + northern_hemisphere = c(TRUE, TRUE, FALSE, "FALSE"), + has_kangaroo = c(FALSE, FALSE, FALSE, 1) ) -str(clima_país) +str(country_climate) ``` ::::::::::::::::::::::::: @@ -556,16 +556,16 @@ su computadora. Explorarlo. ## Solución: ```{r pkg_sln, eval=FALSE, purl=TRUE} -## crea la matriz -ip <- install.packages() +## create the matrix +ip <- installed.packages() head(ip) -## prueba también View(ip) -## número de paquete +## try also View(ip) +## number of package nrow(ip) -## nombres de todos los paquetes instalados +## names of all installed packages rownames(ip) -## tipo de información que tenemos sobre cada paquete -nombres de columna (ip) +## type of information we have about each package +colnames(ip) ``` ::::::::::::::::::::::::: @@ -590,9 +590,9 @@ Construya una matriz de dimensión 1000 por 3 de datos distribuidos normalmente ```{r rnormmat_sln, purl=TRUE} set.seed(123) -m <- matriz(rnorm(3000), ncol = 3) +m <- matrix(rnorm(3000), ncol = 3) dim(m) -cabeza(m) +head(m) ``` ::::::::::::::::::::::::: @@ -641,7 +641,7 @@ explícitamente con `library(lubridate)`. Comience cargando el paquete requerido: ```{r loadlibridate, message=FALSE, purl=TRUE} -biblioteca("lubricar") +library("lubridate") ``` `ymd()` toma un vector que representa año, mes y día, y lo convierte @@ -653,16 +653,16 @@ con el formato "AAAA-MM-DD". Creemos un objeto de fecha e inspeccionemos la estructura: ```{r, purl=TRUE} -mi_fecha <- ymd("2015-01-01") -str(mi_fecha) +my_date <- ymd("2015-01-01") +str(my_date) ``` Ahora peguemos el año, el mes y el día por separado; obtenemos el mismo resultado: ```{r, purl=TRUE} -# sep indica el carácter a utilizar para separar cada componente +# sep indicates the character to use to separate each component my_date <- ymd(paste("2015", "1", "1", sep = "-")) -str(my_date ) +str(my_date) ``` Familiaricémonos ahora con una canalización típica de manipulación de fechas @@ -670,10 +670,10 @@ Familiaricémonos ahora con una canalización típica de manipulación de fechas "mes" y "día". ```{r, purl=TRUE} -x <- data.frame(año = c(1996, 1992, 1987, 1986, 2000, 1990, 2002, 1994, 1997, 1985), - mes = c(2, 3, 3, 10, 1 , 8, 3, 4, 5, 5), - día = c(24, 8, 1, 5, 8, 17, 13, 10, 11, 24), - valor = c (4, 5, 1, 9, 3, 8, 10, 2, 6, 7)) +x <- data.frame(year = c(1996, 1992, 1987, 1986, 2000, 1990, 2002, 1994, 1997, 1985), + month = c(2, 3, 3, 10, 1, 8, 3, 4, 5, 5), + day = c(24, 8, 1, 5, 8, 17, 13, 10, 11, 24), + value = c(4, 5, 1, 9, 3, 8, 10, 2, 6, 7)) x ``` @@ -682,20 +682,20 @@ a partir de las columnas `año`, `mes` y `día` de `x` usando `paste()`: ```{r, purl=TRUE} -pegar(x$year, x$month, x$day, sep = "-") +paste(x$year, x$month, x$day, sep = "-") ``` Este vector de caracteres se puede utilizar como argumento para `ymd()`: ```{r, purl=TRUE} -ymd(pegar(x$year, x$month, x$day, sep = "-")) +ymd(paste(x$year, x$month, x$day, sep = "-")) ``` El vector "Fecha" resultante se puede agregar a "x" como una nueva columna llamada "fecha": ```{r, purl=TRUE} -x$date <- ymd(pegar(x$year, x$month, x$day, sep = "-")) -str(x) # observe la nueva columna, con 'fecha' como clase +x$date <- ymd(paste(x$year, x$month, x$day, sep = "-")) +str(x) # notice the new column, with 'date' as the class ``` Asegurémonos de que todo funcionó correctamente. Una forma de inspeccionar la From ddffe0116adcd1ccc7ebf3f775a69a96809fcb2c Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Fri, 16 Aug 2024 04:07:25 +0900 Subject: [PATCH 245/334] New translations 30-dplyr.md (Spanish) --- locale/es/episodes/30-dplyr.Rmd | 100 ++++++++++++++++---------------- 1 file changed, 50 insertions(+), 50 deletions(-) diff --git a/locale/es/episodes/30-dplyr.Rmd b/locale/es/episodes/30-dplyr.Rmd index bba8a46e6..9a12a01ea 100644 --- a/locale/es/episodes/30-dplyr.Rmd +++ b/locale/es/episodes/30-dplyr.Rmd @@ -23,12 +23,12 @@ exercises: 75 - Análisis de datos en R utilizando el metapaquete tidyverse -:::::::::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::: ```{r loaddata_dplyr, echo=FALSE, purl=FALSE, message=FALSE} if (!file.exists("data/rnaseq.csv")) -download.file(url = "https://github.com/carpentries-incubator/bioc-intro/raw/main/episodes/ datos/rnaseq.csv", - destfile = "datos/rnaseq.csv") +download.file(url = "https://github.com/carpentries-incubator/bioc-intro/raw/main/episodes/data/rnaseq.csv", + destfile = "data/rnaseq.csv") ``` > Este episodio se basa en la lección _Análisis de datos y @@ -73,15 +73,15 @@ Si realizó la configuración, ya debería haber instalado el paquete tidyverse. Comprueba si lo tienes intentando cargarlo desde la biblioteca: ```{r, message=FALSE, purl=TRUE} -## cargar los paquetes tidyverse, incl. dplyr -biblioteca("tidyverse") +## load the tidyverse packages, incl. dplyr +library("tidyverse") ``` Si recibió un mensaje de error `no hay ningún paquete llamado 'tidyverse'` entonces aún no ha instalado el paquete para esta versión de R. Para instalar el tipo de paquete **`tidyverse`**: ```{r, eval=FALSE, purl=TRUE} -BiocManager::instalar("tidyverse") +BiocManager::install("tidyverse") ``` Si tuvo que instalar el paquete **`tidyverse`**, ¡no olvide cargarlo en esta sesión de R usando el comando `library()` arriba! @@ -95,7 +95,7 @@ function (notice the `_` instead of the `.`), from the tidyverse package ```{r, message=FALSE, purl=TRUE} rna <- read_csv("data/rnaseq.csv") -## ver los datos +## view the data rna ``` @@ -128,14 +128,14 @@ de esta función es el marco de datos (`rna`), y los argumentos siguientes son las columnas que se deben conservar. ```{r, purl=TRUE} -seleccionar (arn, gen, muestra, tejido, expresión) +select(rna, gene, sample, tissue, expression) ``` Para seleccionar todas las columnas _excepto_ algunas, coloque un "-" delante de la variable para excluirla. ```{r, purl=TRUE} -seleccionar (arn, -tejido, -organismo) +select(rna, -tissue, -organism) ``` Esto seleccionará todas las variables en `rna` excepto `tejido` @@ -144,8 +144,8 @@ y `organismo`. Para elegir filas según un criterio específico, utilice `filtro()`: ```{r, purl=TRUE} -filter(rna, sexo == "Masculino") -filter(rna, sex == "Masculino" & infección == "No infectado") +filter(rna, sex == "Male") +filter(rna, sex == "Male" & infection == "NonInfected") ``` Ahora imaginemos que estamos interesados en los homólogos humanos de los genes @@ -165,7 +165,7 @@ Algunos genes de ratón no tienen homólogos humanos. Estos se pueden recuperar algo es un `NA`. ```{r, purl=TRUE} -filtro (genes, is.na (hsapiens_homolog_associated_gene_name)) +filter(genes, is.na(hsapiens_homolog_associated_gene_name)) ``` Si queremos conservar sólo genes de ratón que tienen un homólogo humano, podemos @@ -174,7 +174,7 @@ cada fila donde hsapiens\_homolog\_associated\_gene\_name _no es_ un `NA`. ```{r, purl=TRUE} -filtro(genes, !is.na(hsapiens_homolog_associated_gene_name)) +filter(genes, !is.na(hsapiens_homolog_associated_gene_name)) ``` ## Tubería @@ -186,8 +186,8 @@ Con pasos intermedios, crea un marco de datos temporal y lo usa como entrada para la siguiente función, como esta: ```{r, purl=TRUE} -rna2 <- filter(rna, sexo == "Masculino") -rna3 <- select(rna2, gen, muestra, tejido, expresión) +rna2 <- filter(rna, sex == "Male") +rna3 <- select(rna2, gene, sample, tissue, expression) rna3 ``` @@ -199,7 +199,7 @@ También puedes anidar funciones (es decir, una función dentro de otra), así: ```{r, purl=TRUE} -rna3 <- select(filtro(rna, sexo == "Masculino"), gen, muestra, tejido, expresión) +rna3 <- select(filter(rna, sex == "Male"), gene, sample, tissue, expression) rna3 ``` @@ -228,8 +228,8 @@ incluir explícitamente el marco de datos como un argumento para las funciones ` ```{r, purl=TRUE} rna %>% - filtro(sexo == "Masculino") %>% - seleccionar(gen, muestra, tejido, expresión) + filter(sex == "Male") %>% + select(gene, sample, tissue, expression) ``` A algunos les puede resultar útil leer la tubería como la palabra "entonces". Por ejemplo, @@ -246,8 +246,8 @@ podemos asignarle un nuevo nombre: ```{r, purl=TRUE} rna3 <- rna %>% - filter(sexo == "Masculino") %>% - select(gen, muestra, tejido, expresión) + filter(sex == "Male") %>% + select(gene, sample, tissue, expression) rna3 ``` @@ -260,7 +260,7 @@ Usando tuberías, subconjunto de datos de `rna` para mantener las observaciones donde el gen tiene una expresión superior a 50000, y retenga solo las columnas `gene`, `sample `, `tiempo`, `expresión` y `edad`. -::::::::::::::: solución +::::::::::::::: solution ## Solución @@ -272,9 +272,9 @@ rna %>% seleccionar(gen, muestra , tiempo, expresión, edad) ``` -:::::::::::::::::::::::::::: +::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::: ## Mudar @@ -313,7 +313,7 @@ fenotipo\_descripción y con una expresión logarítmica superior a 5. **Sugerencia**: piense en cómo se deben ordenar los comandos para producir este marco de datos. -::::::::::::::: solución +::::::::::::::: solution ## Solución @@ -408,7 +408,7 @@ rna %>% Calcule el nivel de expresión medio del gen "Dok3" por puntos de tiempo. -::::::::::::::: solución +::::::::::::::: solution ## Solución @@ -496,7 +496,7 @@ rna %>% 3. Elija una muestra y evalúe la cantidad de genes por biotipo. 4. Identifique los genes asociados con la descripción del fenotipo de "metilación anormal del ADN" y calcule su expresión media (en log) en el tiempo 0, el tiempo 4 y el tiempo 8. -::::::::::::::: solución +::::::::::::::: solution ## Solución @@ -723,7 +723,7 @@ A partir de la tabla de ARN, utilice la función `pivot_wider()` para crear una tabla de formato amplio que proporcione los niveles de expresión genética en cada ratón. Luego use la función `pivot_longer()` para restaurar una tabla de formato largo. -::::::::::::::: solución +::::::::::::::: solution ## Solución @@ -756,7 +756,7 @@ knitr::include_graphics("fig/Exercise_pivot_W.png") ¡Necesitará resumir antes de remodelar! -::::::::::::::: solución +::::::::::::::: solution ## Solución @@ -794,9 +794,9 @@ rna_1 %>% ``` -:::::::::::::::::::::::::::: +::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::: ::::::::::::::::::::::::::::::::::::::: challenge @@ -806,7 +806,7 @@ Utilice el conjunto de datos `rna` para crear una matriz de expresión donde cad represente los niveles de expresión medios de genes y las columnas representen los diferentes puntos de tiempo. -::::::::::::::: solución +::::::::::::::: solution ## Solución @@ -814,8 +814,8 @@ Primero calculemos la expresión media por gen y por tiempo. ```{r} rna %>% - group_by(gen, tiempo) %>% - resumen(exp_media = media(expresión)) + group_by(gene, time) %>% + summarise(mean_exp = mean(expression)) ``` antes de usar la función pivot\_wider() @@ -866,9 +866,9 @@ rna %>% select(gene, time4) ``` -:::::::::::::::::::::::::::: +::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::: ::::::::::::::::::::::::::::::::::::::: challenge @@ -879,21 +879,21 @@ una nueva columna que contenga los cambios de pliegue entre el punto de tiempo 8 entre el punto de tiempo 8 y el punto de tiempo 4. Convierta esta tabla en una tabla de formato largo que recopile los cambios de pliegue calculados. -::::::::::::::: solución +::::::::::::::: solution ## Solución A partir del tibble rna\_time: ```{r} -tiempo_rna +rna_time ``` Calcular cambios de pliegue: ```{r} rna_time %>% - mutar(time_8_vs_0 = `8` / `0`, time_8_vs_4 = `8` / `4`) + mutate(time_8_vs_0 = `8` / `0`, time_8_vs_4 = `8` / `4`) ``` Y use la función pivot\_longer(): @@ -906,9 +906,9 @@ rna_time %>% time_8_vs_0:time_8_vs_4) ``` -:::::::::::::::::::::::::::: +::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::: ## Unir mesas @@ -935,7 +935,7 @@ columnas y 10 líneas. ```{r} rna_mini <- rna %>% - select(gen, muestra, expresión) %>% + select(gene, sample, expression) %>% head(10) rna_mini ``` @@ -949,7 +949,7 @@ puedes usar el código R a continuación para descargarlo directamente a la carp ```{r, message=FALSE} download.file(url = "https://carpentries-incubator.github.io/bioc-intro/data/annot1.csv", destfile = "data/annot1.csv") -annot1 <- read_csv(archivo = "datos/annot1.csv") +annot1 <- read_csv(file = "data/annot1.csv") annot1 ``` @@ -961,7 +961,7 @@ común. Estas variables se denominan claves. Las claves se utilizan para hacer c observaciones en diferentes tablas. ```{r} -unión_completa(rna_mini, annot1) +full_join(rna_mini, annot1) ``` En la vida real, las anotaciones genéticas a veces se etiquetan de manera diferente. @@ -974,7 +974,7 @@ tú mismo y muévelo a `data/ `o use el código R a continuación. ```{r, message=FALSE} download.file(url = "https://carpentries-incubator.github.io/bioc-intro/data/annot2.csv", destfile = "data/annot2.csv") -annot2 <- read_csv(archivo = "datos/annot2.csv") +annot2 <- read_csv(file = "data/annot2.csv") annot2 ``` @@ -999,7 +999,7 @@ y coloque la tabla en su repositorio de datos. Usando la función `full_join()` , une las tablas `rna_mini` y `annot3`. ¿Qué ha sucedido con los genes _Klk6_, _mt-Tf_, _mt-Rnr1_, _mt-Tv_, _mt-Rnr2_ y _mt-Tl1_? -::::::::::::::: solución +::::::::::::::: solution ## Solución @@ -1012,9 +1012,9 @@ Los genes _Klk6_ solo están presentes en `rna_mini`, mientras que los genes _mt _mt-Rnr2_ y _mt-Tl1_ están solo está presente en la tabla `annot3`. Sus valores respectivos para las variables de la tabla se han codificado como faltantes. -:::::::::::::::::::::::::::: +::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::: ## Exportar datos @@ -1037,11 +1037,11 @@ volver a generarlos. Usemos `write_csv()` para guardar la tabla rna\_wide que hemos creado anteriormente. ```{r, purl=TRUE, eval=FALSE} -write_csv(rna_wide, archivo = "data_output/rna_wide.csv") +write_csv(rna_wide, file = "data_output/rna_wide.csv") ``` -:::::::::::::::::::::::::::::::::::::::: puntos clave +:::::::::::::::::::::::::::::::::::::::: keypoints - Datos tabulares en R usando el metapaquete tidyverse -:::::::::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::: From ffcbd932169591ffc46ea3dd31f7c3c808c23170 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sat, 17 Aug 2024 00:48:00 +0900 Subject: [PATCH 246/334] New translations 10-data-organisation.md (Chinese Simplified) --- locale/zh/episodes/10-data-organisation.Rmd | 24 ++++++++++----------- 1 file changed, 12 insertions(+), 12 deletions(-) diff --git a/locale/zh/episodes/10-data-organisation.Rmd b/locale/zh/episodes/10-data-organisation.Rmd index 49797bf78..bc4cf8c17 100644 --- a/locale/zh/episodes/10-data-organisation.Rmd +++ b/locale/zh/episodes/10-data-organisation.Rmd @@ -739,11 +739,11 @@ most likely incorrectly display the data in columns. 这是因为 例如,我们的数据可能如下所示: ``` -species_id、属、种类、分类群 -AB、Amphispiza、bilineata、鸟类 -AH、Ammospermophilus、harrisi、啮齿动物、未普查 -AS、Ammodramus、savannarum、鸟类 -BA、Baiomys、taylori、啮齿动物 +species_id,genus,species,taxa +AB,Amphispiza,bilineata,Bird +AH,Ammospermophilus,harrisi,Rodent, not censused +AS,Ammodramus,savannarum,Bird +BA,Baiomys,taylori,Rodent ``` 在记录“AH,Ammospermophilus,harrisi,Rodent, not censused”中,“taxa”的 @@ -752,7 +752,7 @@ BA、Baiomys、taylori、啮齿动物 得到如下内容: ```{r, results="markup", fig.cap="The risks of having commas inside comma-separated data.", echo=FALSE, purl=FALSE, out.width="80%", fig.align="center"} -knitr::include_graphics(“fig/csv-mistake.png”) +knitr::include_graphics("fig/csv-mistake.png") ``` “taxa” 的值被分成两列(而不是被放在“D”列中的 @@ -768,11 +768,11 @@ knitr::include_graphics(“fig/csv-mistake.png”) 数据可能如下所示: ``` -species_id、属、种类、分类群 -“AB”、“Amphispiza”、“bilineata”、“鸟类” -“AH”、“Ammospermophilus”、“harrisi”、“啮齿类,未经普查” -“AS”、“Ammodramus”、“savannarum”、“鸟类” -“BA”、“Baiomys”、“taylori”、“啮齿类” +species_id,genus,species,taxa +"AB","Amphispiza","bilineata","Bird" +"AH","Ammospermophilus","harrisi","Rodent, not censused" +"AS","Ammodramus","savannarum","Bird" +"BA","Baiomys","taylori","Rodent" ``` 现在在 Excel 中将此文件作为 `csv` 打开不会导致出现多余的 @@ -829,4 +829,4 @@ knitr::include_graphics("fig/analysis.png") - 良好的数据组织是任何研究项目的基础。 -::::::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::: From 4cf8a069a71eb54d3c4a4d016866934aa9c1b6de Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sat, 17 Aug 2024 01:46:40 +0900 Subject: [PATCH 247/334] New translations 23-starting-with-r.md (Chinese Simplified) --- locale/zh/episodes/23-starting-with-r.Rmd | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/locale/zh/episodes/23-starting-with-r.Rmd b/locale/zh/episodes/23-starting-with-r.Rmd index 4ab0582c9..fc16b5509 100644 --- a/locale/zh/episodes/23-starting-with-r.Rmd +++ b/locale/zh/episodes/23-starting-with-r.Rmd @@ -45,7 +45,7 @@ _对象_分配\*值。 要创建一个对象,我们需要给它一个名字, 赋值运算符`<-`,以及我们想要赋予它的值: ```{r, purl=TRUE} -体重_kg <- 55 +weight_kg <- 55 ``` `<-` 是赋值运算符。 它将右侧的值分配给左侧的 @@ -105,16 +105,16 @@ first. 您可以安装 对象名称来强制 R 打印该值: ```{r, purl=TRUE} -weight_kg <- 55 # 不打印任何内容 -(weight_kg <- 55) # 但是在调用周围加上括号会打印 `weight_kg` 的值 -weight_kg # 输入对象的名称也会打印任何内容 +weight_kg <- 55 # doesn't print anything +(weight_kg <- 55) # but putting parenthesis around the call prints the value of `weight_kg` +weight_kg # and so does typing the name of the object ``` 现在 R 内存中有了“weight_kg”,我们可以用它进行算术运算。 例如,对于 来说,我们可能希望将这个重量转换为磅(磅重量是公斤重量的 2.2 倍): ```{r, purl=TRUE} -2.2 * 体重_公斤 +2.2 * weight_kg ``` 我们还可以通过分配新值来更改对象的值: From cc76eac51ea0d7b9d214cc37d38831a5d9a79d4d Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sat, 17 Aug 2024 01:46:42 +0900 Subject: [PATCH 248/334] New translations 40-visualization.md (French) --- locale/fr/episodes/40-visualization.Rmd | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/locale/fr/episodes/40-visualization.Rmd b/locale/fr/episodes/40-visualization.Rmd index eec428127..4d4b94120 100644 --- a/locale/fr/episodes/40-visualization.Rmd +++ b/locale/fr/episodes/40-visualization.Rmd @@ -138,12 +138,12 @@ facilement configurer des modèles de tracé et explorer facilement différents tracés, de sorte que le tracé ci-dessus peut également être généré avec un code comme celui-ci : ```{r, eval=FALSE, purl=TRUE} -# Attribuer un tracé à une variable +# Assign plot to a variable rna_plot <- ggplot(data = rna, mapping = aes(x = expression)) -# Dessiner le tracé -rna_plot + geom_histogramme() +# Draw the plot +rna_plot + geom_histogram() ``` ::::::::::::::::::::::::::::::::::::::: défi From b7fc03d1bf3420a54e1222028beef36197d8bcfa Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sat, 17 Aug 2024 01:46:44 +0900 Subject: [PATCH 249/334] New translations 40-visualization.md (Spanish) --- locale/es/episodes/40-visualization.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/locale/es/episodes/40-visualization.Rmd b/locale/es/episodes/40-visualization.Rmd index e454a13f0..428b1b86f 100644 --- a/locale/es/episodes/40-visualization.Rmd +++ b/locale/es/episodes/40-visualization.Rmd @@ -21,7 +21,7 @@ download.file(url = "https://github.com/carpentries-incubator/bioc-intro/raw/mai :::::::::::::::::::::::::::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::: preguntas +:::::::::::::::::::::::::::::::::::::::: questions - Visualización en R From a7d158416240b22600e9b1bcab06b8553e44cf71 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sat, 17 Aug 2024 01:46:50 +0900 Subject: [PATCH 250/334] New translations 40-visualization.md (Chinese Simplified) --- locale/zh/episodes/40-visualization.Rmd | 74 ++++++++++++------------- 1 file changed, 37 insertions(+), 37 deletions(-) diff --git a/locale/zh/episodes/40-visualization.Rmd b/locale/zh/episodes/40-visualization.Rmd index 7e90cfd88..af17a4f86 100644 --- a/locale/zh/episodes/40-visualization.Rmd +++ b/locale/zh/episodes/40-visualization.Rmd @@ -28,7 +28,7 @@ exercises: 60 ::::::::::::::::::::::::::::::::::::::::::::::::::::: ```{r vis_setup, echo=FALSE} -rna <- read.csv(“数据/rnaseq.csv”) +rna <- read.csv("data/rnaseq.csv") ``` > 本集基于 Data Carpentries 的_面向生态学家的 R 语言数据分析和 @@ -40,14 +40,14 @@ rna <- read.csv(“数据/rnaseq.csv”) **`tidyverse`** 包中。 ```{r load-package, message=FALSE, purl=TRUE} -图书馆(“tidyverse”) +library("tidyverse") ``` 如果还不在工作区中,请加载我们在上一节 课中保存的数据。 ```{r load-data, eval=FALSE, purl=TRUE} -rna <- read.csv(“数据/rnaseq.csv”) +rna <- read.csv("data/rnaseq.csv") ``` 数据可视化秘籍 @@ -138,11 +138,11 @@ ggplot(数据 = rna,映射 = aes(x = 表达式)) + 绘图,因此上述绘图也可以使用如下代码生成: ```{r, eval=FALSE, purl=TRUE} -# 将图分配给变量 +# Assign plot to a variable rna_plot <- ggplot(data = rna, - map = aes(x = expression)) + mapping = aes(x = expression)) -# 绘制图 +# Draw the plot rna_plot + geom_histogram() ``` @@ -218,7 +218,7 @@ ggplot(rna,aes(x = expression_log)) + geom_histogram() ## 解决方案 ```{r, eval=TRUE, purl=TRUE, echo=TRUE} -ggplot(数据 = rna,映射 = aes(x = 表达式))+ +ggplot(data = rna,mapping = aes(x = expression))+ geom_histogram() + scale_x_log10() ``` @@ -279,7 +279,7 @@ rna_fc <- rna %>% 选择(基因,时间, 几何对象: ```{r create-ggplot-object, cache=FALSE, purl=TRUE} -ggplot(数据 = rna_fc,映射 = aes(x = time_4_vs_0,y = time_8_vs_0)) + +ggplot(data = rna_fc, mapping = aes(x = time_4_vs_0, y = time_8_vs_0)) + geom_point() ``` @@ -287,15 +287,15 @@ ggplot(数据 = rna_fc,映射 = aes(x = time_4_vs_0,y = time_8_vs_0)) + 例如,我们可以添加透明度(“alpha”)以避免过度绘图: ```{r adding-transparency, cache=FALSE, purl=TRUE} -ggplot(数据 = rna_fc,映射 = aes(x = time_4_vs_0,y = time_8_vs_0)) + +ggplot(data = rna_fc, mapping = aes(x = time_4_vs_0, y = time_8_vs_0)) + geom_point(alpha = 0.3) ``` 我们还可以为所有点添加颜色: ```{r adding-colors, cache=FALSE, purl=TRUE} -ggplot(数据 = rna_fc,映射 = aes(x = time_4_vs_0,y = time_8_vs_0)) + - geom_point(alpha = 0.3,颜色 = “蓝色”) +ggplot(data = rna_fc, mapping = aes(x = time_4_vs_0, y = time_8_vs_0)) + + geom_point(alpha = 0.3, color = "blue") ``` 或者为了给图中每个基因赋予不同的颜色,你可以使用一个向量作为 @@ -304,8 +304,8 @@ ggplot(数据 = rna_fc,映射 = aes(x = time_4_vs_0,y = time_8_vs_0)) + 示例,我们用 `gene_biotype` 进行着色: ```{r color-by-gene_biotype1, cache=FALSE, purl=TRUE} -ggplot(数据 = rna_fc,映射 = aes(x = time_4_vs_0,y = time_8_vs_0)) + - geom_point(alpha = 0.3,aes(颜色 = gene_biotype)) +ggplot(data = rna_fc, mapping = aes(x = time_4_vs_0, y = time_8_vs_0)) + + geom_point(alpha = 0.3, aes(color = gene_biotype)) ``` @@ -314,7 +314,7 @@ ggplot(数据 = rna_fc,映射 = aes(x = time_4_vs_0,y = time_8_vs_0)) + 映射将由 `aes()` 中设置的 x 轴和 y 轴决定。 ```{r color-by-gene_biotype2, cache=FALSE, purl=TRUE} -ggplot(数据 = rna_fc,映射 = aes(x = time_4_vs_0,y = time_8_vs_0, +ggplot(data = rna_fc, mapping = aes(x = time_4_vs_0, y = time_8_vs_0, color = gene_biotype)) + geom_point(alpha = 0.3) ``` @@ -323,20 +323,20 @@ ggplot(数据 = rna_fc,映射 = aes(x = time_4_vs_0,y = time_8_vs_0, 函数添加对角线: ```{r adding-diag, cache=FALSE, purl=TRUE} -ggplot(数据 = rna_fc,映射 = aes(x = time_4_vs_0,y = time_8_vs_0, - 颜色 = gene_biotype)) + +ggplot(data = rna_fc, mapping = aes(x = time_4_vs_0, y = time_8_vs_0, + color = gene_biotype)) + geom_point(alpha = 0.3) + - geom_abline(截距 = 0) + geom_abline(intercept = 0) ``` 请注意,我们可以将 geom 层从 `geom_point` 更改为 `geom_jitter`,颜色仍由 `gene_biotype` 决定。 ```{r color-by-gene_biotype3, cache=FALSE, purl=TRUE} -ggplot(数据 = rna_fc,映射 = aes(x = time_4_vs_0,y = time_8_vs_0, - 颜色 = gene_biotype)) + +ggplot(data = rna_fc, mapping = aes(x = time_4_vs_0, y = time_8_vs_0, + color = gene_biotype)) + geom_jitter(alpha = 0.3) + - geom_abline(截距 = 0) + geom_abline(intercept = 0) ``` ```{r, echo=FALSE, message=FALSE} @@ -373,11 +373,11 @@ ggplot(数据 = rna_fc,映射 = aes(x = time_4_vs_0,y = time_8_vs_0, ``` ```{r, purl=TRUE} -库(“hexbin”) +library("hexbin") -ggplot(数据 = rna_fc,映射 = aes(x = time_4_vs_0,y = time_8_vs_0)) + +ggplot(data = rna_fc, mapping = aes(x = time_4_vs_0, y = time_8_vs_0)) + geom_hex() + - geom_abline(截距 = 0) + geom_abline(intercept = 0) ``` @@ -398,8 +398,8 @@ ggplot(数据 = rna_fc,映射 = aes(x = time_4_vs_0,y = time_8_vs_0)) + ## 解决方案 ```{r, eval=TRUE, purl=TRUE} -ggplot(数据 = rna,映射 = aes(y = expression_log,x = 样本)) + - geom_point(aes(颜色 = 时间)) +ggplot(data = rna, mapping = aes(y = expression_log, x = sample)) + + geom_point(aes(color = time)) ``` ::::::::::::::::::::::::: @@ -412,8 +412,8 @@ ggplot(数据 = rna,映射 = aes(y = expression_log,x = 样本)) + : ```{r boxplot, cache=FALSE, purl=TRUE} -ggplot(数据 = rna, - 映射 = aes(y = expression_log, x = 样本)) + +ggplot(data = rna, + mapping = aes(y = expression_log, x = sample)) + geom_boxplot() ``` @@ -421,9 +421,9 @@ ggplot(数据 = rna, 测量的数量及其分布: ```{r boxplot-with-points, cache=FALSE, purl=TRUE} -ggplot(数据 = rna, - 映射 = aes(y = expression_log,x = 样本)) + - geom_jitter(alpha = 0.2,颜色 = “番茄”) + +ggplot(data = rna, + mapping = aes(y = expression_log, x = sample)) + + geom_jitter(alpha = 0.2, color = "tomato") + geom_boxplot(alpha = 0) ``` @@ -441,10 +441,10 @@ you need to change in the code to put the boxplot below the points? 我们应该交换这两个几何对象的顺序: ```{r boxplot-with-points2, cache=FALSE, purl=TRUE} -ggplot(数据 = rna, - 映射 = aes(y = expression_log,x = 样本)) + +ggplot(data = rna, + mapping = aes(y = expression_log, x = sample)) + geom_boxplot(alpha = 0) + - geom_jitter(alpha = 0.2,颜色 = “番茄”) + geom_jitter(alpha = 0.2, color = "tomato") ``` ::::::::::::::::::::::::: @@ -458,11 +458,11 @@ ggplot(数据 = rna, 对角线标签的角度: ```{r boxplot-xaxis-rotated, cache=FALSE, purl=TRUE} -ggplot(数据 = rna, - 映射 = aes(y = expression_log,x = 样本)) + - geom_jitter(alpha = 0.2,颜色 = “tomato”) + +ggplot(data = rna, + mapping = aes(y = expression_log, x = sample)) + + geom_jitter(alpha = 0.2, color = "tomato") + geom_boxplot(alpha = 0) + - 主题(axis.text.x = element_text(angle = 90,hjust = 0.5,vjust = 0.5)) + theme(axis.text.x = element_text(angle = 90, hjust = 0.5, vjust = 0.5)) ``` ::::::::::::::::::::::::::::::::::::::: challenge From 0537a6d3aa3eeeb8a305d34ec710afcabc5e1af5 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sat, 17 Aug 2024 07:59:25 +0900 Subject: [PATCH 251/334] New translations 40-visualization.md (Chinese Simplified) --- locale/zh/episodes/40-visualization.Rmd | 40 ++++++++++++------------- 1 file changed, 20 insertions(+), 20 deletions(-) diff --git a/locale/zh/episodes/40-visualization.Rmd b/locale/zh/episodes/40-visualization.Rmd index af17a4f86..257a69cfb 100644 --- a/locale/zh/episodes/40-visualization.Rmd +++ b/locale/zh/episodes/40-visualization.Rmd @@ -481,21 +481,21 @@ ggplot(data = rna, ## 解决方案 ```{r boxplot-color-time, cache=FALSE, purl=TRUE} -# 时间作为整数 +# time as integer ggplot(data = rna, - map = aes(y = expression_log, + mapping = aes(y = expression_log, x = sample)) + geom_jitter(alpha = 0.2, aes(color = time)) + geom_boxplot(alpha = 0) + - theme(axis.text.x = element_text(angle = 90, hjust = 0.5, vjust = 0.5)) + theme(axis.text.x = element_text(angle = 90, hjust = 0.5, vjust = 0.5)) -# 时间作为因子 +# time as factor ggplot(data = rna, - map = aes(y = expression_log, + mapping = aes(y = expression_log, x = sample)) + - geom_jitter(alpha = 0.2, aes(颜色 = as.因子(时间))) + + geom_jitter(alpha = 0.2, aes(color = as.factor(time))) + geom_boxplot(alpha = 0) + - 主题(轴.文本.x = element_text(角度 = 90, hjust = 0.5, vjust = 0.5)) + theme(axis.text.x = element_text(angle = 90, hjust = 0.5, vjust = 0.5)) ``` ::::::::::::::::::::::::: @@ -519,10 +519,10 @@ not see it in a boxplot. 箱线图的替代方法是小提琴 ## 解决方案 ```{r, eval=TRUE, echo=TRUE, cache=FALSE, purl=TRUE} -ggplot(数据 = rna, - 映射 = aes(y = expression_log,x = 样本)) + - geom_violin(aes(填充 = as.factor(时间))) + - 主题(轴.文本.x = element_text(角度 = 90,hjust = 0.5,vjust = 0.5)) +ggplot(data = rna, + mapping = aes(y = expression_log, x = sample)) + + geom_violin(aes(fill = as.factor(time))) + + theme(axis.text.x = element_text(angle = 90, hjust = 0.5, vjust = 0.5)) ``` ::::::::::::::::::::::::: @@ -540,10 +540,10 @@ ggplot(数据 = rna, ## 解决方案 ```{r, eval=TRUE, echo=TRUE, cache=FALSE, purl=TRUE} -ggplot(数据 = rna, - 映射 = aes(y = expression_log,x = 样本)) + - geom_violin(aes(填充 = 性别)) + - 主题(axis.text.x = element_text(角度 = 90,hjust = 0.5,vjust = 0.5)) +ggplot(data = rna, + mapping = aes(y = expression_log, x = sample)) + + geom_violin(aes(fill = sex)) + + theme(axis.text.x = element_text(angle = 90, hjust = 0.5, vjust = 0.5)) ``` ::::::::::::::::::::::::: @@ -577,7 +577,7 @@ mean_exp_by_time ,y 轴为平均表达量: ```{r first-time-series, purl=TRUE} -ggplot(数据 = mean_exp_by_time,映射 = aes(x = 时间,y = mean_exp)) + +ggplot(data = mean_exp_by_time, mapping = aes(x = time, y = mean_exp)) + geom_line() ``` @@ -586,8 +586,8 @@ ggplot(数据 = mean_exp_by_time,映射 = aes(x = 时间,y = mean_exp)) + 修改美学函数以包含 `group = gene`: ```{r time-series-by-gene, purl=TRUE} -ggplot(数据 = mean_exp_by_time, - 映射 = aes(x = 时间, y = mean_exp, 组 = 基因)) + +ggplot(data = mean_exp_by_time, + mapping = aes(x = time, y = mean_exp, group = gene)) + geom_line() ``` @@ -595,8 +595,8 @@ ggplot(数据 = mean_exp_by_time, `color` 也会自动对数据进行分组): ```{r time-series-with-colors, purl=TRUE} -ggplot(数据 = mean_exp_by_time, - 映射 = aes(x = 时间, y = mean_exp, 颜色 = 基因)) + +ggplot(data = mean_exp_by_time, + mapping = aes(x = time, y = mean_exp, color = gene)) + geom_line() ``` From 9d950966481054e56cea15ac19bbf76f5fe8bba0 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sat, 17 Aug 2024 08:59:17 +0900 Subject: [PATCH 252/334] New translations 20-r-rstudio.md (Chinese Simplified) --- locale/zh/episodes/20-r-rstudio.Rmd | 32 ++++++++++++++--------------- 1 file changed, 16 insertions(+), 16 deletions(-) diff --git a/locale/zh/episodes/20-r-rstudio.Rmd b/locale/zh/episodes/20-r-rstudio.Rmd index d154be2fb..b4721abfa 100644 --- a/locale/zh/episodes/20-r-rstudio.Rmd +++ b/locale/zh/episodes/20-r-rstudio.Rmd @@ -95,7 +95,7 @@ R 具有用于图像分析、GIS、时间序列、人口 例如生物信息学数据分析。 ```{r, fig.cap="Exponential increase of the number of packages available on [CRAN](https://cran.r-project.org/), the Comprehensive R Archive Network. From the R Journal, Volume 10/2, December 2018.", echo=FALSE, message=FALSE} -针织::包括_图形(“图/cran.png”) +knitr::include_graphics("fig/cran.png") ``` ### R 可处理各种形状和大小的数据 @@ -243,7 +243,7 @@ knitr::include_graphics(“fig/utf8.png”) 目录的骨干。 ```{r, results="markup", fig.cap="Example of a working directory structure.", echo=FALSE, purl=FALSE, out.width="100%", fig.align="center"} -knitr::include_graphics(“图/工作目录结构.png”) +knitr::include_graphics("fig/working-directory-structure.png") ``` 对于本课程,我们将需要一个 `data/` 文件夹来存储我们的原始数据 @@ -269,7 +269,7 @@ CSV 文件时,我们将使用 `data_output/`,以及 `fig_output/` 文件夹 您的工作目录现在应如下所示: ```{r, results="markup", fig.cap="How it should look like at the beginning of this lesson", echo=FALSE, purl=FALSE, out.width="100%", fig.align="center"} -knitr::include_graphics(“fig/r-starting-how-it-should-look-like.png”) +knitr::include_graphics("fig/r-starting-how-it-should-look-like.png") ``` **项目管理**也适用于生物信息学项目, @@ -424,7 +424,7 @@ address your actual research questions. 对于初学者来说,从头开始编 的丰富内容的表面。 ```{r kitten, results="markup", echo=FALSE, purl=FALSE, out.width="400px", fig.align="center"} -knitr::include_graphics(“fig/kitten-try-things.jpg”) +knitr::include_graphics("fig/kitten-try-things.jpg") ``` ## 寻求帮助 @@ -432,7 +432,7 @@ knitr::include_graphics(“fig/kitten-try-things.jpg”) ### 使用内置的 RStudio 帮助界面搜索有关 R 函数的更多信息 ```{r rstudiohelp, fig.cap="RStudio help interface.", results="markup", echo=FALSE, purl=FALSE, out.width="70%", fig.align="center"} -knitr::include_graphics(“fig/rstudiohelp.png”) +knitr::include_graphics("fig/rstudiohelp.png") ``` 获得帮助的最快方法之一是使用 RStudio 帮助 @@ -448,13 +448,13 @@ knitr::include_graphics(“fig/rstudiohelp.png”) 可以输入: ```{r, eval=FALSE, purl=TRUE} -?条形图 +?barplot ``` 如果您只需要提醒自己参数的名称,您可以使用: ```{r, eval=FALSE, purl=TRUE} -参数(lm) +args(lm) ``` ### 我想使用一个执行 X 的函数,一定有一个函数可以执行该操作,但我不知道是哪一个...... @@ -465,7 +465,7 @@ However, this only looks through the installed packages for help pages with a match to your search request ```{r, eval=FALSE, purl=TRUE} -??克鲁斯卡尔 +??kruskal ``` 如果您找不到所需内容,您可以使用 @@ -528,8 +528,8 @@ Wickham 的这篇文章。 重新创建与内存中完全相同的对象: ```{r, results="show", purl=TRUE} -## iris 是 R 附带的一个示例数据框,head() 是一个 -## 函数,返回数据框的第一部分 +## iris is an example data frame that comes with R and head() is a +## function that returns the first part of the data frame dput(head(iris)) ``` @@ -540,7 +540,7 @@ dput(head(iris)) to a data frame, you can save any R object to a file[^export]: ```{r, eval=FALSE, purl=FALSE} -保存RDS(iris,文件=“/tmp/iris.rds”) +saveRDS(iris, file="/tmp/iris.rds") ``` 但是,该文件的内容不是人类可读的,并且无法 @@ -550,7 +550,7 @@ to a data frame, you can save any R object to a file[^export]: 用户主目录中的 `Downloads` 文件夹中): ```{r, eval=FALSE, purl=FALSE} -some_data <- readRDS(file="~/Downloads/iris.rds") +some_data <- readRDS(file="~/Downloads/iris.rds") ``` 最后,但同样重要的一点是,**始终包含 `sessionInfo()`** @@ -559,7 +559,7 @@ some_data <- readRDS(file="~/Downloads/iris.rds") 理解您的问题非常有帮助的信息。 ```{r, results="show", purl=TRUE} -会话信息() +sessionInfo() ``` ### 去哪里寻求帮助? @@ -625,7 +625,7 @@ some_data <- readRDS(file="~/Downloads/iris.rds") `library()` 函数完成的。 下面,我们加载“ggplot2”。 ```{r loadp, eval=FALSE, purl=TRUE} -库(“ggplot2”) +library("ggplot2") ``` ### 安装软件包 @@ -636,7 +636,7 @@ installed with the `install.packages()` function. 下面,例如, 我们安装稍后将了解的 `dplyr` 包。 ```{r craninstall, eval=FALSE, purl=TRUE} -安装.包(“dplyr”) +install.packages("dplyr") ``` 此命令将安装“dplyr”包及其所有 @@ -646,7 +646,7 @@ installed with the `install.packages()` function. 下面,例如, 即 `BiocManager` 进行管理和安装,可以使用以下命令从 CRAN 安装: ```{r, eval=FALSE, purl=TRUE} -安装.软件包(“BiocManager”) +install.packages("BiocManager") ``` Individual packages such as `SummarizedExperiment` (we will use it From f194fbad166d8c5fe1abc9325289a1e91227d8f6 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sat, 17 Aug 2024 08:59:26 +0900 Subject: [PATCH 253/334] New translations 23-starting-with-r.md (Chinese Simplified) --- locale/zh/episodes/23-starting-with-r.Rmd | 132 +++++++++++----------- 1 file changed, 66 insertions(+), 66 deletions(-) diff --git a/locale/zh/episodes/23-starting-with-r.Rmd b/locale/zh/episodes/23-starting-with-r.Rmd index fc16b5509..4a49e5189 100644 --- a/locale/zh/episodes/23-starting-with-r.Rmd +++ b/locale/zh/episodes/23-starting-with-r.Rmd @@ -120,8 +120,8 @@ weight_kg # and so does typing the name of the object 我们还可以通过分配新值来更改对象的值: ```{r, purl=TRUE} -体重_kg <- 57.5 -2.2 * 体重_kg +weight_kg <- 57.5 +2.2 * weight_kg ``` 这意味着为一个对象分配一个值不会改变 @@ -129,13 +129,13 @@ weight_kg # and so does typing the name of the object 对象`weight_lb`中: ```{r, purl=TRUE} -体重磅 <- 2.2 * 体重公斤 +weight_lb <- 2.2 * weight_kg ``` 然后将“weight_kg”改为100。 ```{r} -体重_kg <- 100 +weight_kg <- 100 ``` ::::::::::::::::::::::::::::::::::::::: challenge @@ -145,7 +145,7 @@ weight_kg # and so does typing the name of the object 您认为对象“weight_lb”的当前内容是什么? 126\.5 还是 220? -::::::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::: ## 评论 @@ -167,14 +167,14 @@ RStudio 可以轻松注释或取消注释一个段落:在 以下每个语句后面的值是什么? ```{r, purl=TRUE} -mass <- 47.5 # 质量? -age <- 122 # 年龄? -mass <- mass * 2.0 # 质量? -age <- age - 20 # 年龄? -mass_index <- mass/age # 质量指数? +mass <- 47.5 # mass? +age <- 122 # age? +mass <- mass * 2.0 # mass? +age <- age - 20 # age? +mass_index <- mass/age # mass_index? ``` -::::::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::: ## 函数及其参数 @@ -188,7 +188,7 @@ made available by importing R _packages_ (more on that later). 函数 被称为_调用_该函数。 函数调用的一个示例是: ```{r, eval=FALSE, purl=FALSE} -b <- sqrt (a) +b <- sqrt(a) ``` 这里,将 `a` 的值传递给 `sqrt()` 函数,`sqrt()` 函数 @@ -211,7 +211,7 @@ b <- sqrt (a) 让我们尝试一个可以接受多个参数的函数:“round()”。 ```{r, results="show", purl=TRUE} -圆形(3.14159) +round(3.14159) ``` 在这里,我们仅用一个参数“3.14159”调用了“round()”,并且它 @@ -221,31 +221,31 @@ b <- sqrt (a) 帮助。 ```{r, results="show", purl=TRUE} -参数(圆形) +args(round) ``` ```{r, eval=FALSE, purl=TRUE} -?圆形的 +?round ``` 我们看到,如果我们想要不同数量的数字,我们可以 输入“digits=2”或任意我们想要的数字。 ```{r, results="show", purl=TRUE} -四舍五入(3.14159,数字 = 2) +round(3.14159, digits = 2) ``` 如果您按照定义参数的完全相同的顺序提供参数,则 不必命名它们: ```{r, results="show", purl=TRUE} -圆形(3.14159,2) +round(3.14159, 2) ``` 如果你确实命名了参数,你可以切换它们的顺序: ```{r, results="show", purl=TRUE} -四舍五入(数字 = 2,x = 3.14159) +round(digits = 2, x = 3.14159) ``` 很好的做法是,在函数调用中将非可选参数(比如 @@ -265,15 +265,15 @@ R 的主力。向量由一系列值组成,例如 分配给一个新的对象“weight_g”: ```{r, purl=TRUE} -权重_g <- c(50, 60, 65, 82) -权重_g +weight_g <- c(50, 60, 65, 82) +weight_g ``` 向量也可以包含字符: ```{r, purl=TRUE} -分子 <- c("dna", "rna", "蛋白质") -分子 +molecules <- c("dna", "rna", "protein") +molecules ``` 这里“dna”、“rna”等周围的引号至关重要。 如果没有 @@ -285,8 +285,8 @@ R 的主力。向量由一系列值组成,例如 向量的内容。 `length()` 告诉你特定向量中有多少个元素: ```{r, purl=TRUE} -长度(重量_g) -长度(分子) +length(weight_g) +length(molecules) ``` 向量的一个重要特征是,所有元素都是 @@ -294,8 +294,8 @@ R 的主力。向量由一系列值组成,例如 类型): ```{r, purl=TRUE} -类别(权重_g) -类别(分子) +class(weight_g) +class(molecules) ``` 函数“str()”概述了 @@ -303,8 +303,8 @@ R 的主力。向量由一系列值组成,例如 大型复杂对象时,它是一个很有用的函数: ```{r, purl=TRUE} -str(重量_g) -str(分子) +str(weight_g) +str(molecules) ``` 您可以使用 `c()` 函数将其他元素添加到向量中: @@ -360,7 +360,7 @@ R 隐式地将它们全部转换为同一类型 ::::::::::::::::::::::::: -::::::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::: ::::::::::::::::::::::::::::::::::::::: challenge @@ -381,19 +381,19 @@ tricky <- c(1, 2, 3, "4") ## 解决方案 ```{r, purl=TRUE} -类(num_char) +class(num_char) num_char -类(num_logical) +class(num_logical) num_logical -类(char_logical) +class(char_logical) char_logical -类(tricky) +class(tricky) tricky ``` ::::::::::::::::::::::::: -::::::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::: ::::::::::::::::::::::::::::::::::::::: challenge @@ -411,7 +411,7 @@ tricky ::::::::::::::::::::::::: -::::::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::: ::::::::::::::::::::::::::::::::::::::: challenge @@ -436,12 +436,12 @@ combined_logical <- c(num_logical, char_logical) 转换为 `"1"` 之前,会先转换为 `1`。 ```{r} -组合逻辑 +combined_logical ``` ::::::::::::::::::::::::: -::::::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::: ::::::::::::::::::::::::::::::::::::::: challenge @@ -461,7 +461,7 @@ types are coerced? ::::::::::::::::::::::::: -::::::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::: ```{r, echo=FALSE, eval=FALSE, purl=TRUE} ## We've seen that atomic vectors can be of type character, numeric, integer, and @@ -495,17 +495,17 @@ tricky <- c(1, 2, 3, "4") 在方括号中提供一个或多个索引。 例如: ```{r, results="show", purl=TRUE} -分子 <- c("dna", "rna", "肽", "蛋白质") -分子[2] -分子[c(3, 2)] +molecules <- c("dna", "rna", "peptide", "protein") +molecules[2] +molecules[c(3, 2)] ``` 我们还可以重复索引来创建一个比原始对象具有更多元素 的对象: ```{r, results="show", purl=TRUE} -更多分子 <- 分子[c(1, 2, 3, 2, 1, 4)] -更多分子 +more_molecules <- molecules[c(1, 2, 3, 2, 1, 4)] +more_molecules ``` R 索引从 1 开始。 Fortran、MATLAB、 @@ -517,10 +517,10 @@ Julia 和 R 等编程语言从 1 开始计数,因为这是人类 的所有元素,除了一些指定元素: ```{r} -分子 ## 所有分子 -分子[-1] ## 除第一个之外的所有分子 -分子[-c(1, 3)] ## 除第 1/3 个之外的所有分子 -分子[c(-1, -3)] ## 除第 1/3 个之外的所有分子 +molecules ## all molecules +molecules[-1] ## all but the first one +molecules[-c(1, 3)] ## all but 1st/3rd ones +molecules[c(-1, -3)] ## all but 1st/3rd ones ``` ## 条件子集 @@ -538,10 +538,10 @@ weight_g[c(TRUE, FALSE, TRUE, TRUE, FALSE)] 只想选择 50 以上的值: ```{r, purl=TRUE} -## 将返回满足 -条件的索引的逻辑值为 TRUE ## 条件 +## will return logicals with TRUE for the indices that meet +## the condition weight_g > 50 -## 因此我们可以使用它来仅选择高于 50 的值 +## so we can use this to select only the values above 50 weight_g[weight_g > 50] ``` @@ -549,8 +549,8 @@ weight_g[weight_g > 50] AND)或 `|`(至少有一个条件为真,OR)组合多个测试: ```{r, results="show", purl=TRUE} -权重_g[权重_g < 30 | 权重_g > 50] -权重_g[权重_g >= 30 & 权重_g == 21] +weight_g[weight_g < 30 | weight_g > 50] +weight_g[weight_g >= 30 & weight_g == 21] ``` 这里,`<` 代表“小于”,`>` 代表“大于”,`>=` 代表 @@ -566,7 +566,7 @@ AND)或 `|`(至少有一个条件为真,OR)组合多个测试: ```{r, purl=TRUE} molecules <- c("dna", "rna", "protein", "peptide") -molecules[molecules == "rna" |molecules == "dna"] # 返回 rna 和 dna +molecules[molecules == "rna" | molecules == "dna"] # returns both rna and dna molecules %in% c("rna", "dna", "metabolite", "peptide", "glycerol") molecules[molecules %in% c("rna", "dna", "metabolite", "peptide", "glycerol")] ``` @@ -582,7 +582,7 @@ molecules[molecules %in% c("rna", "dna", "metabolite", "peptide", "glycerol")] ## 解决方案 ```{r} -“四” > “五” +"four" > "five" ``` 在字符串上使用 `>` 或 `<` 时,R 会比较它们的字母顺序。 @@ -591,7 +591,7 @@ molecules[molecules %in% c("rna", "dna", "metabolite", "peptide", "glycerol")] ::::::::::::::::::::::::: -::::::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::: ## 名字 @@ -601,9 +601,9 @@ molecules[molecules %in% c("rna", "dna", "metabolite", "peptide", "glycerol")] ```{r} x <- c(1, 5, 3, 5, 10) -names(x) ## 没有名字 +names(x) ## no names names(x) <- c("A", "B", "C", "D", "E") -names(x) ## 现在我们有名字了 +names(x) ## now we have names ``` 当向量具有名称时,除了索引之外,还可以通过其 @@ -627,11 +627,11 @@ the data you are working with include missing values. 此功能 结果,同时忽略缺失值。 ```{r} -高度 <- c(2, 4, 4, NA, 6) -平均值(高度) -最大值(高度) -平均值(高度, na.rm = TRUE) -最大值(高度, na.rm = TRUE) +heights <- c(2, 4, 4, NA, 6) +mean(heights) +max(heights) +mean(heights, na.rm = TRUE) +max(heights, na.rm = TRUE) ``` If your data include missing values, you may want to become familiar @@ -872,10 +872,10 @@ before drawing the random sample. 与种子 123 相同的排列 ```{r, purl=TRUE} -设置.种子(123) -样本(1:10) -设置.种子(123) -样本(1:10) +set.seed(123) +sample(1:10) +set.seed(123) +sample(1:10) ``` 不同的种子 From caccceac0227f02be1e581dcfba98266a6fc074c Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sat, 17 Aug 2024 08:59:34 +0900 Subject: [PATCH 254/334] New translations 40-visualization.md (Chinese Simplified) --- locale/zh/episodes/40-visualization.Rmd | 102 ++++++++++++------------ 1 file changed, 51 insertions(+), 51 deletions(-) diff --git a/locale/zh/episodes/40-visualization.Rmd b/locale/zh/episodes/40-visualization.Rmd index 257a69cfb..44564a699 100644 --- a/locale/zh/episodes/40-visualization.Rmd +++ b/locale/zh/episodes/40-visualization.Rmd @@ -6,9 +6,9 @@ exercises: 60 --- ```{r loaddata_vis, echo=FALSE, purl=FALSE, message=FALSE} -如果(!file.exists(“data/rnaseq.csv”)) -下载.file(url = “https://github.com/carpentries-incubator/bioc-intro/raw/main/episodes/data/rnaseq.csv”, - 目标文件 = “data/rnaseq.csv”) +if (!file.exists("data/rnaseq.csv")) +download.file(url = "https://github.com/carpentries-incubator/bioc-intro/raw/main/episodes/data/rnaseq.csv", + destfile = "data/rnaseq.csv") ``` ::::::::::::::::::::::::::::::::::::::: objectives @@ -94,14 +94,14 @@ ggplot 图形是通过添加新元素一步步构建的。 以这种方式添加 用于不同类型的绘图: ``` -ggplot(数据 = <DATA>, 映射 = aes(<MAPPINGS>)) + <GEOM_FUNCTION>() +ggplot(data = <DATA>, mapping = aes(<MAPPINGS>)) + <GEOM_FUNCTION>() ``` - 使用 `ggplot()` 函数并使用 `data` 参数将图绑定到特定的 **data 框架** ```{r, eval=FALSE} -ggplot(数据 = rna) +ggplot(data = rna) ``` - 定义一个**映射**(使用美学(`aes`)函数),通过 @@ -110,7 +110,7 @@ ggplot(数据 = rna) 大小、形状、颜色等。 ```{r, eval=FALSE} -ggplot(数据 = rna,映射 = aes(x = 表达式)) +ggplot(data = rna, mapping = aes(x = expression)) ``` - 添加'**geoms**' - 几何图形,或图中 @@ -128,7 +128,7 @@ ggplot(数据 = rna,映射 = aes(x = 表达式)) `geom_histogram()`: ```{r first-ggplot, cache=FALSE, purl=TRUE} -ggplot(数据 = rna,映射 = aes(x = 表达式)) + +ggplot(data = rna, mapping = aes(x = expression)) + geom_histogram() ``` @@ -609,9 +609,9 @@ ggplot(data = mean_exp_by_time, 使用它为每个基因绘制一条跨时间的线图: ```{r first-facet, purl=TRUE} -ggplot(数据 = mean_exp_by_time, - 映射 = aes(x = 时间, y = mean_exp)) + geom_line() + - facet_wrap(~ 基因) +ggplot(data = mean_exp_by_time, + mapping = aes(x = time, y = mean_exp)) + geom_line() + + facet_wrap(~ gene) ``` 这里,所有子图的 x 轴和 y 轴具有相同的比例。 您 @@ -619,10 +619,10 @@ ggplot(数据 = mean_exp_by_time, y 轴自由缩放: ```{r first-facet-scales, purl=TRUE} -ggplot(数据 = mean_exp_by_time, - 映射 = aes(x = 时间,y = mean_exp)) + +ggplot(data = mean_exp_by_time, + mapping = aes(x = time, y = mean_exp)) + geom_line() + - facet_wrap(~基因,scales = "free_y") + facet_wrap(~ gene, scales = "free_y") ``` 现在我们想根据小鼠的性别来分割每个图中的线。 @@ -641,10 +641,10 @@ mean_exp_by_time_sex `color`(在单个图内)按性别进一步划分来制作分面图: ```{r facet-by-gene-and-sex, cache=FALSE, purl=TRUE} -ggplot(数据 = mean_exp_by_time_sex, - 映射 = aes(x = 时间, y = mean_exp, 颜色 = 性别)) + +ggplot(data = mean_exp_by_time_sex, + mapping = aes(x = time, y = mean_exp, color = sex)) + geom_line() + - facet_wrap(~ 基因, scales = "free_y") + facet_wrap(~ gene, scales = "free_y") ``` 通常,带有白色背景的图表在打印时看起来更易读。 我们 @@ -652,12 +652,12 @@ ggplot(数据 = mean_exp_by_time_sex, 此外,我们可以删除网格: ```{r facet-by-gene-and-sex-white-bg, cache=FALSE, purl=TRUE} -ggplot(数据 = mean_exp_by_time_sex, - 映射 = aes(x = 时间,y = mean_exp,颜色 = 性别)) + +ggplot(data = mean_exp_by_time_sex, + mapping = aes(x = time, y = mean_exp, color = sex)) + geom_line() + - facet_wrap(~ 基因,scales = "free_y") + + facet_wrap(~ gene, scales = "free_y") + theme_bw() + - 主题(panel.grid = element_blank()) + theme(panel.grid = element_blank()) ``` ::::::::::::::::::::::::::::::::::::::: challenge @@ -674,13 +674,13 @@ ggplot(数据 = mean_exp_by_time_sex, ```{r mean-exp-chromosome-time-series, purl=TRUE} mean_exp_by_chromosome <- rna %>% - group_by(chromosome_name, time) %>% - 总结(mean_exp = mean(expression_log)) + group_by(chromosome_name, time) %>% + summarize(mean_exp = mean(expression_log)) -ggplot(数据 = mean_exp_by_chromosome, 映射 = aes(x = 时间, +ggplot(data = mean_exp_by_chromosome, mapping = aes(x = time, y = mean_exp)) + geom_line() + - facet_wrap(~ chromosome_name, scales = "free_y") + facet_wrap(~ chromosome_name, scales = "free_y") ``` ::::::::::::::::::::::::: @@ -831,12 +831,12 @@ ggplot(rna, aes(x = expression_log)) + 例如,基于此图: ```{r, purl=TRUE} -ggplot(数据 = mean_exp_by_time_sex, - 映射 = aes(x = 时间,y = mean_exp,颜色 = 性别)) + +ggplot(data = mean_exp_by_time_sex, + mapping = aes(x = time, y = mean_exp, color = sex)) + geom_line() + - facet_wrap(~ 基因,scales = "free_y") + + facet_wrap(~ gene, scales = "free_y") + theme_bw() + - 主题(panel.grid = element_blank()) + theme(panel.grid = element_blank()) ``` 我们可以通过以下方式定制它: @@ -920,8 +920,8 @@ count_gene_chromosome exp_boxplot_sex <- ggplot(rna, aes(y=expression_log, x = as.factor(time), color=sex)) + geom_boxplot(alpha = 0) + - labs(y = "平均基因 exp", - x = "时间") + theme(legend.position = "无") + labs(y = "Mean gene exp", + x = "time") + theme(legend.position = "none") exp_boxplot_sex ``` @@ -937,9 +937,9 @@ exp_boxplot_sex ``` ```{r patchworkplot1, purl=TRUE} -库(“patchwork”) +library("patchwork") count_gene_chromosome + exp_boxplot_sex -## 或 count_gene_chromosome | exp_boxplot_sex +## or count_gene_chromosome | exp_boxplot_sex ``` ```{r patchwork2, purl=TRUE} @@ -980,8 +980,8 @@ count_gene_chromosome / ``` ```{r gridarrange-example, message=FALSE, fig.width=10, purl=TRUE} -库(“gridExtra”) -grid.arrange(count_gene_chromosome,exp_boxplot_sex,ncol = 2) +library("gridExtra") +grid.arrange(count_gene_chromosome, exp_boxplot_sex, ncol = 2) ``` 除了用于进行简单 @@ -1005,13 +1005,13 @@ grid.arrange(count_gene_chromosome,exp_boxplot_sex,ncol = 2) my_plot <- ggplot(data = mean_exp_by_time_sex, mapping = aes(x = time, y = mean_exp, color = sex)) + geom_line() + - facet_wrap(~gene, scales = "free_y") + - labs(title = "按感染持续时间划分的平均基因表达", - x = "感染持续时间(天)", - y = "平均基因表达") + - guides(color=guide_legend(title="性别")) + + facet_wrap(~ gene, scales = "free_y") + + labs(title = "Mean gene expression by duration of the infection", + x = "Duration of the infection (in days)", + y = "Mean gene expression") + + guides(color=guide_legend(title="Gender")) + theme_bw() + - theme(axis.text.x = element_text(colour = "royalblue4",size = 12), + theme(axis.text.x = element_text(colour = "royalblue4", size = 12), axis.text.y = element_text(colour = "royalblue4", size = 12), text = element_text(size = 16), panel.grid = element_line(colour="lightsteelblue1"), @@ -1019,11 +1019,11 @@ my_plot <- ggplot(data = mean_exp_by_time_sex, ggsave("fig_output/mean_exp_by_time_sex.png", my_plot, width = 15, height = 10) -# 这也适用于 grid.arrange() 图 +# This also works for grid.arrange() plots combo_plot <- grid.arrange(count_gene_chromosome, exp_boxplot_sex, - ncol = 2, widths = c(4, 6)) -ggsave(“fig_output/combo_plot_chromosome_sex.png”,combo_plot, - width = 10,dpi = 300) + ncol = 2, widths = c(4, 6)) +ggsave("fig_output/combo_plot_chromosome_sex.png", combo_plot, + width = 10, dpi = 300) ``` 注意:参数“width”和“height”也决定了保存的图中的字体大小 @@ -1056,12 +1056,12 @@ graphics_ ,简单而快速。 它基于_画家或画布 ```{r paintermodel, fig.width=12, fig.height=4, fig.cap="Successive layers added on top of each other."} par(mfrow = c(1, 3)) -plot(1:20, main = "第一层,用 plot(1:20) 制作") +plot(1:20, main = "First layer, produced with plot(1:20)") -plot(1:20, main = "一条水平红线,用 abline(h = 10) 添加") +plot(1:20, main = "A horizontal red line, added with abline(h = 10)") abline(h = 10, col = "red") -plot(1:20, main = "一个矩形,用 rect(5, 5, 15, 15) 添加") +plot(1:20, main = "A rectangle, added with rect(5, 5, 15, 15)") abline(h = 10, col = "red") rect(5, 5, 15, 15, lwd = 3) ``` @@ -1075,9 +1075,9 @@ rect(5, 5, 15, 15, lwd = 3) ```{r plotmethod, fig.width=8, fig.height=8, fig.cap="Plotting boxplots (top) and histograms (bottom) vectors (left) or a matrices (right)."} par(mfrow = c(2, 2)) boxplot(rnorm(100), - main = "rnorm(100) 的箱线图") + main = "Boxplot of rnorm(100)") boxplot(matrix(rnorm(100), ncol = 10), - main = "matrix(rnorm(100), ncol = 10) 的箱线图") + main = "Boxplot of matrix(rnorm(100), ncol = 10)") hist(rnorm(100)) hist(matrix(rnorm(100), ncol = 10)) ``` @@ -1098,8 +1098,8 @@ convoluted interface. `lattice` 包的一个很好的参考是@latticebook。 -:::::::::::::::::::::::::::::::::::::::: 关键点 +:::::::::::::::::::::::::::::::::::::::: keypoints - R 中的可视化 -::::::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::: From ca257645e9afcc4c56b6d38ad9cc8d04e8df197b Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sat, 17 Aug 2024 09:24:19 +0900 Subject: [PATCH 255/334] New translations 23-starting-with-r.md (Chinese Simplified) --- locale/zh/episodes/23-starting-with-r.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/locale/zh/episodes/23-starting-with-r.Rmd b/locale/zh/episodes/23-starting-with-r.Rmd index 4a49e5189..e7cfe0f99 100644 --- a/locale/zh/episodes/23-starting-with-r.Rmd +++ b/locale/zh/episodes/23-starting-with-r.Rmd @@ -918,7 +918,7 @@ rnorm(5, 100, 5) 数据结构的基础知识,我们已经准备好开始处理更大的数据,并且 了解数据框。 -:::::::::::::::::::::::::::::::::::::::: 关键点 +:::::::::::::::::::::::::::::::::::::::: keypoints - 如何与 R 交互 From e1d31e500f194ae6a9ee39b9ff38175acc4c6be5 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sat, 17 Aug 2024 09:24:26 +0900 Subject: [PATCH 256/334] New translations 25-starting-with-data.md (Chinese Simplified) --- locale/zh/episodes/25-starting-with-data.Rmd | 130 +++++++++---------- 1 file changed, 65 insertions(+), 65 deletions(-) diff --git a/locale/zh/episodes/25-starting-with-data.Rmd b/locale/zh/episodes/25-starting-with-data.Rmd index d056df592..e8c1b4ec4 100644 --- a/locale/zh/episodes/25-starting-with-data.Rmd +++ b/locale/zh/episodes/25-starting-with-data.Rmd @@ -80,7 +80,7 @@ download.file(url = "https://github.com/carpentries-incubator/bioc-intro/raw/mai 您现在可以加载数据了: ```{r, eval=TRUE, purl=TRUE} -rna <- read.csv(“数据/rnaseq.csv”) +rna <- read.csv("data/rnaseq.csv") ``` 此语句不会产生任何输出,因为您可能 @@ -89,7 +89,7 @@ rna <- read.csv(“数据/rnaseq.csv”) 输入其名称来查看数据框的内容: ```{r, eval=FALSE} -核糖核酸 +rna ``` 哇... 那是大量的输出。 至少这意味着数据已正确加载 @@ -98,7 +98,7 @@ rna <- read.csv(“数据/rnaseq.csv”) ```{r, purl=TRUE} head(rna) -## 也尝试 +## Try also ## View(rna) ``` @@ -119,9 +119,9 @@ head(rna) 以分隔参数 `,` 来加载。 代码如下: ```{r, eval=TRUE, purl=TRUE} -rna <- read.table(file = “data/rnaseq.csv”, - sep = “,”, - header = TRUE) +rna <- read.table(file = "data/rnaseq.csv", + sep = ",", + header = TRUE) ``` 必须将 header 参数设置为 TRUE 才能读取 @@ -150,7 +150,7 @@ words, when importing spreadsheets from your hard drive (or the web). 的 <b>str</b>结构时,我们可以看到这一点: ```{r} -str(RNA) +str(rna) ``` ## 检查 `data.frame` 对象 @@ -218,19 +218,19 @@ str(RNA) 坐标的不同方式会导致不同类别的结果。 ```{r, eval=FALSE, purl=TRUE} -# 数据框第一列的第一个元素(作为向量) +# first element in the first column of the data frame (as a vector) rna[1, 1] -# 第六列的第一个元素(作为向量) +# first element in the 6th column (as a vector) rna[1, 6] -# 数据框的第一列(作为向量) +# first column of the data frame (as a vector) rna[, 1] -# 数据框的第一列(作为数据框) +# first column of the data frame (as a data.frame) rna[1] -# 第七列的前三个元素(作为向量) +# first three elements in the 7th column (as a vector) rna[1:3, 7] -# 数据框的第 3 行(作为数据框) +# the 3rd row of the data frame (as a data.frame) rna[3, ] -# 等同于 head_rna <- head(rna) +# equivalent to head_rna <- head(rna) head_rna <- rna[1:6, ] head_rna ``` @@ -242,18 +242,18 @@ head_rna 您还可以使用“-”符号排除数据框的某些索引: ```{r, eval=FALSE, purl=TRUE} -rna[, -1] ## 整个数据框,除了第一列 -rna[-c(7:66465), ] ## 等同于 head(rna) +rna[, -1] ## The whole data frame, except the first column +rna[-c(7:66465), ] ## Equivalent to head(rna) ``` 数据框可以通过调用索引(如前所示) 进行子集化,也可以通过直接调用其列名进行子集化: ```{r, eval=FALSE, purl=TRUE} -rna["gene"] # 结果是一个数据框 -rna[, "gene"] # 结果是一个向量 -rna[["gene"]] # 结果是一个向量 -rna$gene # 结果是一个向量 +rna["gene"] # Result is a data.frame +rna[, "gene"] # Result is a vector +rna[["gene"]] # Result is a vector +rna$gene # Result is a vector ``` In RStudio, you can use the autocompletion feature to get the full and @@ -294,7 +294,7 @@ correct names of the columns. ## 1. rna_200 <- rna[200, ] ## 2. -## 保存 `n_rows` 以提高可读性并减少重复 +## Saving `n_rows` to improve readability and reduce duplication n_rows <- nrow(rna) rna_last <- rna[n_rows, ] ## 3. @@ -320,7 +320,7 @@ careful when treating them as strings. 对级别进行排序。 例如,如果你有一个具有 2 个级别的因子: ```{r, purl=TRUE} -性别 <- 因子(c(“男性”,“女性”,“女性”,“男性”,“女性”)) +sex <- factor(c("male", "female", "female", "male", "female")) ``` R 将为级别“女性”分配“1”,为级别 @@ -329,8 +329,8 @@ R 将为级别“女性”分配“1”,为级别 `levels()` 来看到这一点,并且可以使用 `nlevels()` 来找到级别的数量: ```{r, purl=TRUE} -水平(性别) -n水平(性别) +levels(sex) +nlevels(sex) ``` 有时,因素的顺序并不重要,有时你 @@ -340,9 +340,9 @@ n水平(性别) 的一种方法是: ```{r, purl=TRUE} -sex ## 当前顺序 +sex ## current order sex <- factor(sex, levels = c("male", "female")) -sex ## 重新排序后 +sex ## after re-ordering ``` 在 R 的内存中,这些因素由整数 (1, 2, 3) 表示, @@ -359,7 +359,7 @@ sex ## 重新排序后 和女性的数量。 ```{r firstfactorplot, fig.cap="Bar plot of the number of females and males.", purl=TRUE} -情节(性别) +plot(sex) ``` ### 转换为字符 @@ -368,7 +368,7 @@ sex ## 重新排序后 `as.character(x)`。 ```{r, purl=TRUE} -作为角色(性别) +as.character(sex) ``` <!-- ### Numeric factors --> @@ -415,13 +415,13 @@ sex ## 重新排序后 级别即可: ```{r, purl=TRUE} -水平(性别) -水平(性别)<- c("M", "F") -性别 -情节(性别) +levels(sex) +levels(sex) <- c("M", "F") +sex +plot(sex) ``` -:::::::::::::::::::::::::::::::::::::: 挑战 +:::::::::::::::::::::::::::::::::::::: challenge ## 挑战: @@ -432,8 +432,8 @@ sex ## 重新排序后 ## 解决方案 ```{r, eval=TRUE, purl=TRUE} -水平(性别) -水平(性别)<-c(“男”,“女”) +levels(sex) +levels(sex) <- c("Male", "Female") ``` ::::::::::::::::::::::::: @@ -451,8 +451,8 @@ sex ## 重新排序后 ```{r, eval=FALSE} animal_data <- data.frame( - animal = c(狗, 猫, 海参, 海胆), - feel = c("毛茸茸的", "柔软的", "多刺的"), + animal = c(dog, cat, sea cucumber, sea urchin), + feel = c("furry", "squishy", "spiny"), weight = c(45, 8 1.1, 0.8)) ``` @@ -486,12 +486,12 @@ animal_data <- data.frame( ```{r, eval=FALSE, purl=TRUE} country_climate <- data.frame( - country = c("加拿大", "巴拿马", "南非", "澳大利亚"), - Climate = c("冷", "热", "温和", "热/温和"), - Temperature = c(10, 30, 18, "15"), + country = c("Canada", "Panama", "South Africa", "Australia"), + climate = c("cold", "hot", "temperate", "hot/temperate"), + temperature = c(10, 30, 18, "15"), northern_hemisphere = c(TRUE, TRUE, FALSE, "FALSE"), has_kangaroo = c(FALSE, FALSE, FALSE, 1) -) + ) ``` ::::::::::::::: solution @@ -500,9 +500,9 @@ country_climate <- data.frame( ```{r, eval=TRUE, purl=TRUE} country_climate <- data.frame( - country = c("加拿大", "巴拿马", "南非", "澳大利亚"), - Climate = c("冷", "热", "温和", "热/温和"), - Temperature = c(10, 30, 18, "15"), + country = c("Canada", "Panama", "South Africa", "Australia"), + climate = c("cold", "hot", "temperate", "hot/temperate"), + temperature = c(10, 30, 18, "15"), northern_hemisphere = c(TRUE, TRUE, FALSE, "FALSE"), has_kangaroo = c(FALSE, FALSE, FALSE, 1) ) @@ -537,7 +537,7 @@ annoyance. 请注意它的存在,学习规则,并仔细检查您在 R 中导 。 ```{r mat1, purl=TRUE} -m <- 矩阵(1:9,ncol = 3,nrow = 3) +m <- matrix(1:9, ncol = 3, nrow = 3) m ``` @@ -556,15 +556,15 @@ m ## 解决方案: ```{r pkg_sln, eval=FALSE, purl=TRUE} -## 创建矩阵 -ip <- mounted.packages() +## create the matrix +ip <- installed.packages() head(ip) -## 也尝试 View(ip) -## 包的数量 +## try also View(ip) +## number of package nrow(ip) -## 所有已安装包的名称 +## names of all installed packages rownames(ip) -## 关于每个包我们拥有的信息类型 +## type of information we have about each package colnames(ip) ``` @@ -589,10 +589,10 @@ colnames(ip) ## 解决方案 ```{r rnormmat_sln, purl=TRUE} -设置.种子(123) -m <- 矩阵(rnorm(3000),ncol = 3) -dim(m) -head(m) +set.seed(123) +m <- matrix(rnorm(3000), ncol = 3) +dim(m) +head(m) ``` ::::::::::::::::::::::::: @@ -641,7 +641,7 @@ OCT4 等名称或标识符。 因此,如果您总体上避免使用日期格 首先加载所需的包: ```{r loadlibridate, message=FALSE, purl=TRUE} -图书馆(“lubridate”) +library("lubridate") ``` `ymd()` 采用代表年、月、日的向量,并将 @@ -660,7 +660,7 @@ str(my_date) 现在让我们分别粘贴年份、月份和日期——我们得到相同的结果: ```{r, purl=TRUE} -# sep 表示用于分隔每个组件的字符 +# sep indicates the character to use to separate each component my_date <- ymd(paste("2015", "1", "1", sep = "-")) str(my_date) ``` @@ -671,9 +671,9 @@ str(my_date) ```{r, purl=TRUE} x <- data.frame(year = c(1996, 1992, 1987, 1986, 2000, 1990, 2002, 1994, 1997, 1985), - month = c(2, 3, 3, 10, 1, 8, 3, 4, 5, 5), - day = c(24, 8, 1, 5, 8, 17, 13, 10, 11, 24), - value = c(4, 5, 1, 9, 3, 8, 10, 2, 6, 7)) + month = c(2, 3, 3, 10, 1, 8, 3, 4, 5, 5), + day = c(24, 8, 1, 5, 8, 17, 13, 10, 11, 24), + value = c(4, 5, 1, 9, 3, 8, 10, 2, 6, 7)) x ``` @@ -682,27 +682,27 @@ x 字符向量: ```{r, purl=TRUE} -粘贴(x$year,x$month,x$day,sep =“-”) +paste(x$year, x$month, x$day, sep = "-") ``` 该字符向量可用作 `ymd()` 的参数: ```{r, purl=TRUE} -ymd(粘贴(x$year, x$month, x$day, sep = "-")) +ymd(paste(x$year, x$month, x$day, sep = "-")) ``` 生成的 `Date` 向量可以添加到 `x` 作为名为 `date` 的新列: ```{r, purl=TRUE} x$date <- ymd(paste(x$year, x$month, x$day, sep = "-")) -str(x) # 注意新列,以 'date' 为类 +str(x) # notice the new column, with 'date' as the class ``` 让我们确保一切正常。 检查 新列的一种方法是使用 `summary()`: ```{r, purl=TRUE} -摘要(x$date) +summary(x$date) ``` 请注意,`ymd()` 需要按 @@ -777,7 +777,7 @@ by default surround each field with quotes, and thus we will be able to read it back into R correctly, despite also using commas as column separators. -:::::::::::::::::::::::::::::::::::::::: 关键点 +:::::::::::::::::::::::::::::::::::::::: keypoints - R 中的表格数据 From c373415ce4d652b8f1c98d1e9034132a49bdfa1f Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sat, 17 Aug 2024 09:24:34 +0900 Subject: [PATCH 257/334] New translations 30-dplyr.md (Chinese Simplified) --- locale/zh/episodes/30-dplyr.Rmd | 136 ++++++++++++++++---------------- 1 file changed, 68 insertions(+), 68 deletions(-) diff --git a/locale/zh/episodes/30-dplyr.Rmd b/locale/zh/episodes/30-dplyr.Rmd index 5067411dd..755a8b0c1 100644 --- a/locale/zh/episodes/30-dplyr.Rmd +++ b/locale/zh/episodes/30-dplyr.Rmd @@ -26,9 +26,9 @@ exercises: 75 ::::::::::::::::::::::::::::::::::::::::::::::::::::: ```{r loaddata_dplyr, echo=FALSE, purl=FALSE, message=FALSE} -如果(!file.exists(“data/rnaseq.csv”)) -下载.file(url = “https://github.com/carpentries-incubator/bioc-intro/raw/main/episodes/data/rnaseq.csv”, - 目标文件 = “data/rnaseq.csv”) +if (!file.exists("data/rnaseq.csv")) +download.file(url = "https://github.com/carpentries-incubator/bioc-intro/raw/main/episodes/data/rnaseq.csv", + destfile = "data/rnaseq.csv") ``` > 本集基于 Data Carpentries 的_面向生态学家的 R 语言数据分析和 @@ -73,7 +73,7 @@ R 会话中导入它。 尝试从库中加载以检查您是否拥有它: ```{r, message=FALSE, purl=TRUE} -## 加载 tidyverse 包,包括 dplyr +## load the tidyverse packages, incl. dplyr library("tidyverse") ``` @@ -95,7 +95,7 @@ BiocManager::install(“tidyverse”) ```{r, message=FALSE, purl=TRUE} rna <- read_csv("data/rnaseq.csv") -## 查看数据 +## view the data rna ``` @@ -128,14 +128,14 @@ Tibbles 调整了我们之前在 参数是需要保留的列。 ```{r, purl=TRUE} -选择(rna、基因、样本、组织、表达) +select(rna, gene, sample, tissue, expression) ``` 要选择除某些列之外的所有列,请在变量 前面放置“-”以将其排除。 ```{r, purl=TRUE} -选择(rna,-组织,-生物体) +select(rna, -tissue, -organism) ``` 这将选择“rna”中除“tissue” @@ -144,8 +144,8 @@ Tibbles 调整了我们之前在 要根据特定标准选择行,请使用“filter()”: ```{r, purl=TRUE} -过滤器(rna,性别 == “男性”) -过滤器(rna,性别 == “男性” & 感染 == “未感染”) +filter(rna, sex == "Male") +filter(rna, sex == "Male" & infection == "NonInfected") ``` 现在让我们假设我们对该数据集中分析的小鼠 @@ -156,8 +156,8 @@ Tibbles 调整了我们之前在 “hsapiens_homolog_associated_gene_name”。 ```{r} -基因 <- 选择(rna,基因,hsapiens_homolog_associated_gene_name) -基因 +genes <- select(rna, gene, hsapiens_homolog_associated_gene_name) +genes ``` 一些小鼠基因没有人类同源物。 可以使用 @@ -165,7 +165,7 @@ Tibbles 调整了我们之前在 某物是否为 `NA`。 ```{r, purl=TRUE} -过滤器(基因,is.na(hsapiens_homolog_associated_gene_name)) +filter(genes, is.na(hsapiens_homolog_associated_gene_name)) ``` 如果我们只想保留具有人类同源物的小鼠基因,我们可以在 @@ -174,7 +174,7 @@ Tibbles 调整了我们之前在 。 ```{r, purl=TRUE} -过滤器(基因,!is.na(hsapiens_homolog_associated_gene_name)) +filter(genes, !is.na(hsapiens_homolog_associated_gene_name)) ``` ## 管道 @@ -187,7 +187,7 @@ Tibbles 调整了我们之前在 ```{r, purl=TRUE} rna2 <- filter(rna, sex == "Male") -rna3 <- select(rna2, 基因, 样本, 组织, 表达) +rna3 <- select(rna2, gene, sample, tissue, expression) rna3 ``` @@ -339,7 +339,7 @@ _split-apply-combine_ 范式来完成:将数据分成组,对每组应用一 ```{r} rna %>% - group_by(基因) + group_by(gene) ``` `group_by()` 函数不执行任何数据处理,它 @@ -371,8 +371,8 @@ collapses each group into a single-row summary of that group. ```{r} rna %>% - group_by(基因) %>% - 总结(平均表达 = 平均(表达)) + group_by(gene) %>% + summarise(mean_expression = mean(expression)) ``` 我们还可能想计算每个样本中所有基因的平均表达水平: @@ -387,8 +387,8 @@ rna %>% ```{r} rna %>% - group_by(基因、感染、时间) %>% - 总结(平均表达 = 平均(表达)) + group_by(gene, infection, time) %>% + summarise(mean_expression = mean(expression)) ``` 一旦数据被分组,您还可以在同一 @@ -397,9 +397,9 @@ rna %>% ```{r, purl=TRUE} rna %>% - group_by(基因, 感染, 时间) %>% - 总结(平均表达 = 平均(表达), - 中位数表达 = 中位数(表达)) + group_by(gene, infection, time) %>% + summarise(mean_expression = mean(expression), + median_expression = median(expression)) ``` ::::::::::::::::::::::::::::::::::::::: challenge @@ -554,9 +554,9 @@ rna %>% ```{r, echo=FALSE} rna %>% - 选择(基因,样本,表达)%>% - pivot_wider(names_from = 样本, - values_from = 表达) + select(gene, sample, expression) %>% + pivot_wider(names_from = sample, + values_from = expression) ``` 为了将基因表达值从“rna”转换为宽格式 @@ -583,7 +583,7 @@ rna %>% ```{r, purl=TRUE} rna_exp <- rna %>% - 选择(基因,样本,表达) + select(gene, sample, expression) rna_exp ``` @@ -615,11 +615,11 @@ rna_wide ```{r, purl=TRUE} rna_with_missing_values <- rna %>% - 选择(基因,样本,表达)%>% - 过滤器(基因 %in% c(“Asl”,“Apod”,“Cyp2d22”))%>% - 过滤器(样本 %in% c(“GSM2545336”,“GSM2545337”,“GSM2545338”))%>% - 安排(样本)%>% - 过滤器(!(基因 == “Cyp2d22” & 样本 != “GSM2545338”)) + select(gene, sample, expression) %>% + filter(gene %in% c("Asl", "Apod", "Cyp2d22")) %>% + filter(sample %in% c("GSM2545336", "GSM2545337", "GSM2545338")) %>% + arrange(sample) %>% + filter(!(gene == "Cyp2d22" & sample != "GSM2545338")) rna_with_missing_values ``` @@ -729,17 +729,17 @@ wide_with_NA %>% ```{r, answer=TRUE, purl=TRUE} rna1 <- rna %>% -选择(基因,小鼠,表达)%>% -pivot_wider(names_from = 小鼠,values_from = 表达) +select(gene, mouse, expression) %>% +pivot_wider(names_from = mouse, values_from = expression) rna1 rna1 %>% -pivot_longer(names_to = "mouse_id", values_to = "counts", -gene) +pivot_longer(names_to = "mouse_id", values_to = "counts", -gene) ``` ::::::::::::::::::::::::: -::::::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::: ::::::::::::::::::::::::::::::::::::::: challenge @@ -751,7 +751,7 @@ pivot_longer(names_to = "mouse_id", values_to = "counts", -gene) 为值,将数据框扩展到以下表: ```{r, echo=FALSE, message=FALSE} -knitr::include_graphics(“fig/Exercise_pivot_W.png”) +knitr::include_graphics("fig/Exercise_pivot_W.png") ``` 重塑之前需要先总结一下! @@ -765,20 +765,20 @@ knitr::include_graphics(“fig/Exercise_pivot_W.png”) ```{r} rna %>% - 过滤器(chromosome_name == “Y” | chromosome_name == “X”)%>% - group_by(性别,chromosome_name)%>% - 总结(平均值 = 平均值(表达式)) + filter(chromosome_name == "Y" | chromosome_name == "X") %>% + group_by(sex, chromosome_name) %>% + summarise(mean = mean(expression)) ``` 并将表格旋转至宽格式 ```{r, answer=TRUE, purl=TRUE} rna_1 <- rna %>% - 过滤器(chromosome_name == "Y" | chromosome_name == "X") %>% - 分组(性别,chromosome_name) %>% - 总结(平均值 = 平均值(表达式)) %>% - 枢轴_宽(names_from = 性别, - values_from = 平均值) + filter(chromosome_name == "Y" | chromosome_name == "X") %>% + group_by(sex, chromosome_name) %>% + summarise(mean = mean(expression)) %>% + pivot_wider(names_from = sex, + values_from = mean) rna_1 ``` @@ -788,8 +788,8 @@ rna_1 ```{r, answer=TRUE, purl=TRUE} rna_1 %>% - pivot_longer(names_to = "性别", - values_to = "平均值", + pivot_longer(names_to = "gender", + values_to = "mean", -chromosome_name) ``` @@ -814,18 +814,18 @@ rna_1 %>% ```{r} rna %>% - group_by(基因,时间) %>% - 总结(mean_exp = mean(表达式)) + group_by(gene, time) %>% + summarise(mean_exp = mean(expression)) ``` 在使用pivot_wider()函数之前 ```{r} rna_time <- rna %>% - group_by(基因,时间) %>% - 总结(平均值表达式 = 平均值(表达式)) %>% - pivot_wider(names_from = 时间, - values_from = 平均值表达式) + group_by(gene, time) %>% + summarise(mean_exp = mean(expression)) %>% + pivot_wider(names_from = time, + values_from = mean_exp) rna_time ``` @@ -886,24 +886,24 @@ rna %>% 从 rna\_time tibble 开始: ```{r} -RNA时间 +rna_time ``` 计算倍数变化: ```{r} rna_time %>% - 突变(time_8_vs_0 = `8` / `0`,time_8_vs_4 = `8` / `4`) + mutate(time_8_vs_0 = `8` / `0`, time_8_vs_4 = `8` / `4`) ``` 并使用pivot\_longer()函数: ```{r} rna_time %>% - 突变(time_8_vs_0 = `8` / `0`,time_8_vs_4 = `8` / `4`)%>% - pivot_longer(names_to = "comparisons", + mutate(time_8_vs_0 = `8` / `0`, time_8_vs_4 = `8` / `4`) %>% + pivot_longer(names_to = "comparisons", values_to = "Fold_changes", - time_8_vs_0:time_8_vs_4) + time_8_vs_0:time_8_vs_4) ``` ::::::::::::::::::::::::: @@ -935,8 +935,8 @@ collected from different sources. ```{r} rna_mini <- rna %>% - 选择(基因,样本,表达) %>% - 头部(10) + select(gene, sample, expression) %>% + head(10) rna_mini ``` @@ -947,9 +947,9 @@ gene\_description。 您可以通过单击链接然后将其移动到“data/” 您可以使用下面的 R 代码将其直接下载到文件夹。 ```{r, message=FALSE} -下载文件(url = “https://carpentries-incubator.github.io/bioc-intro/data/annot1.csv”, - destfile = “data/annot1.csv”) -annot1 <- read_csv(file = “data/annot1.csv”) +download.file(url = "https://carpentries-incubator.github.io/bioc-intro/data/annot1.csv", + destfile = "data/annot1.csv") +annot1 <- read_csv(file = "data/annot1.csv") annot1 ``` @@ -961,7 +961,7 @@ annot1 观察结果。 ```{r} -全连接(rna_mini,annot1) +full_join(rna_mini, annot1) ``` 在现实生活中,基因注释有时会被标记不同。 @@ -972,9 +972,9 @@ annot1 并将其移动到 `data/`,要么使用下面的 R 代码。 ```{r, message=FALSE} -下载文件(url = “https://carpentries-incubator.github.io/bioc-intro/data/annot2.csv”, - 目标文件 = “data/annot2.csv”) -annot2 <- read_csv(file = “data/annot2.csv”) +download.file(url = "https://carpentries-incubator.github.io/bioc-intro/data/annot2.csv", + destfile = "data/annot2.csv") +annot2 <- read_csv(file = "data/annot2.csv") annot2 ``` @@ -983,7 +983,7 @@ annot2 `by` 参数设置这些变量,如下面 `rna_mini` 和 `annot2` 表所示。 ```{r} -full_join(rna_mini,annot2,by = c(“基因”=“external_gene_name”)) +full_join(rna_mini, annot2, by = c("gene" = "external_gene_name")) ``` 从上可以看出,第一个表的变量名在连接后的表中保留为 @@ -1040,7 +1040,7 @@ _mt-Rnr2_ 和 _mt-Tl1_ 仅存在于 `annot3` 表中。 表中 write_csv(rna_wide,文件 = “data_output/rna_wide.csv”) ``` -:::::::::::::::::::::::::::::::::::::::: 关键点 +:::::::::::::::::::::::::::::::::::::::: keypoints - 使用 tidyverse 元包在 R 中生成表格数据 From 9f44302cd37b508b469b0a227f19ed35a089cd31 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sat, 17 Aug 2024 09:24:42 +0900 Subject: [PATCH 258/334] New translations 40-visualization.md (Chinese Simplified) --- locale/zh/episodes/40-visualization.Rmd | 130 ++++++++++++------------ 1 file changed, 65 insertions(+), 65 deletions(-) diff --git a/locale/zh/episodes/40-visualization.Rmd b/locale/zh/episodes/40-visualization.Rmd index 44564a699..4eeb4614e 100644 --- a/locale/zh/episodes/40-visualization.Rmd +++ b/locale/zh/episodes/40-visualization.Rmd @@ -166,18 +166,18 @@ ggplot(rna,aes(x = 表达式)) + ## 解决方案 ```{r, purl=TRUE} -# 更改箱体 +# change bins ggplot(rna, aes(x = expression)) + geom_histogram(bins = 15) -# 更改箱宽 +# change binwidth ggplot(rna, aes(x = expression)) + geom_histogram(binwidth = 2000) ``` ::::::::::::::::::::::::: -::::::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::: 我们可以在这里观察到数据向右倾斜。 我们可以应用 log2 变换来获得更加对称的分布。 请注意,我们 @@ -186,14 +186,14 @@ log2 变换来获得更加对称的分布。 请注意,我们 ```{r log-transfo, cache=FALSE, purl=TRUE} rna <- rna %>% - 突变(expression_log = log2(expression + 1)) + mutate(expression_log = log2(expression + 1)) ``` 如果我们现在绘制 log2 变换表达式的直方图, 分布确实更接近正态分布。 ```{r second-ggplot, cache=FALSE, purl=TRUE} -ggplot(rna,aes(x = expression_log)) + geom_histogram() +ggplot(rna, aes(x = expression_log)) + geom_histogram() ``` 从现在开始我们将研究对数转换的表达值。 @@ -225,7 +225,7 @@ ggplot(data = rna,mapping = aes(x = expression))+ ::::::::::::::::::::::::: -::::::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::: **笔记** @@ -241,11 +241,11 @@ ggplot(data = rna,mapping = aes(x = expression))+ 消息。 ```{r, eval=FALSE} -# 这是添加层的正确语法 +# This is the correct syntax for adding layers rna_plot + geom_histogram() -# 这不会添加新层并将返回错误消息 +# This will not add the new layer and will return an error message rna_plot + geom_histogram() ``` @@ -263,13 +263,13 @@ rna_plot 倍数变化保存在名为“rna_fc”的新数据框中。 ```{r rna_fc, cache=FALSE, purl=TRUE} -rna_fc <- rna %>% 选择(基因,时间, - 基因生物型,表达日志)%>% - group_by(基因,时间,基因生物型)%>% - 总结(平均值表达式 = 平均值(表达日志))%>% - pivot_wider(names_from = 时间, - values_from = 平均值表达式)%>% - 突变(time_8_vs_0 = `8` - `0`,time_4_vs_0 = `4` - `0`) +rna_fc <- rna %>% select(gene, time, + gene_biotype, expression_log) %>% + group_by(gene, time, gene_biotype) %>% + summarize(mean_exp = mean(expression_log)) %>% + pivot_wider(names_from = time, + values_from = mean_exp) %>% + mutate(time_8_vs_0 = `8` - `0`, time_4_vs_0 = `4` - `0`) ``` @@ -340,7 +340,7 @@ ggplot(data = rna_fc, mapping = aes(x = time_4_vs_0, y = time_8_vs_0, ``` ```{r, echo=FALSE, message=FALSE} -库(“hexbin”) +library("hexbin") ``` ::::::::::::::::::::::::::::::::::::::: challenge @@ -369,7 +369,7 @@ ggplot(data = rna_fc, mapping = aes(x = time_4_vs_0, y = time_8_vs_0, ## 解决方案 ```{r, eval=FALSE, purl=TRUE} -安装.包(“hexbin”) +install.packages("hexbin") ``` ```{r, purl=TRUE} @@ -383,7 +383,7 @@ ggplot(data = rna_fc, mapping = aes(x = time_4_vs_0, y = time_8_vs_0)) + ::::::::::::::::::::::::: -::::::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::: ::::::::::::::::::::::::::::::::::::::: challenge @@ -404,7 +404,7 @@ ggplot(data = rna, mapping = aes(y = expression_log, x = sample)) + ::::::::::::::::::::::::: -::::::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::: ## 箱形图 @@ -449,7 +449,7 @@ ggplot(data = rna, ::::::::::::::::::::::::: -::::::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::: 您可能会注意到 x 轴上的值仍然无法正确读取 。 让我们改变标签的方向并垂直和水平调整它们 @@ -500,7 +500,7 @@ ggplot(data = rna, ::::::::::::::::::::::::: -::::::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::: ::::::::::::::::::::::::::::::::::::::: challenge @@ -527,7 +527,7 @@ ggplot(data = rna, ::::::::::::::::::::::::: -::::::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::: ::::::::::::::::::::::::::::::::::::::: challenge @@ -548,7 +548,7 @@ ggplot(data = rna, ::::::::::::::::::::::::: -::::::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::: ## 线图 @@ -559,16 +559,16 @@ ggplot(data = rna, 并计算每个组内的平均基因表达: ```{r, purl=TRUE} -rna_fc <- rna_fc %>% 排列(desc(time_8_vs_0)) +rna_fc <- rna_fc %>% arrange(desc(time_8_vs_0)) genes_selected <- rna_fc$gene[1:10] sub_rna <- rna %>% - 过滤(基因 %in% genes_selected) + filter(gene %in% genes_selected) mean_exp_by_time <- sub_rna %>% - group_by(基因,时间) %>% - 总结(mean_exp = mean(expression_log)) + group_by(gene,time) %>% + summarize(mean_exp = mean(expression_log)) mean_exp_by_time ``` @@ -632,7 +632,7 @@ ggplot(data = mean_exp_by_time, ```{r data-facet-by-gene-and-sex, purl=TRUE} mean_exp_by_time_sex <- sub_rna %>% group_by(gene, time, sex) %>% - 总结(mean_exp = mean(expression_log)) + summarize(mean_exp = mean(expression_log)) mean_exp_by_time_sex ``` @@ -685,7 +685,7 @@ ggplot(data = mean_exp_by_chromosome, mapping = aes(x = time, ::::::::::::::::::::::::: -::::::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::: `facet_wrap` 几何将图提取到任意数量的 维度中,以使它们能够整齐地放在一页上。 另一方面, @@ -697,17 +697,17 @@ ggplot(data = mean_exp_by_chromosome, mapping = aes(x = time, 随时间的变化: ```{r mean-exp-time-facet-sex-rows, purl=TRUE} -# 一列,按行细分 +# One column, facet by rows ggplot(data = mean_exp_by_time_sex, - map = aes(x = time, y = mean_exp, color = gene)) + + mapping = aes(x = time, y = mean_exp, color = gene)) + geom_line() + facet_grid(sex ~ .) ``` ```{r mean-exp-time-facet-sex-columns, purl=TRUE} -# 一行,逐列 +# One row, facet by column ggplot(data = mean_exp_by_time_sex, - map = aes(x = time, y = mean_exp, color = gene)) + + mapping = aes(x = time, y = mean_exp, color = gene)) + geom_line() + facet_grid(. ~ sex) ``` @@ -742,12 +742,12 @@ ggplot(data = mean_exp_by_time_sex, ggplot(data = mean_exp_by_time_sex, mapping = aes(x = time, y = mean_exp, color = sex)) + geom_line() + - facet_wrap(~gene, scales = "free_y") + + facet_wrap(~ gene, scales = "free_y") + theme_bw() + theme(panel.grid = element_blank()) + - labs(title = "按感染持续时间划分的平均基因表达", - x = "感染持续时间(天)", - y = "平均基因表达") + labs(title = "Mean gene expression by duration of the infection", + x = "Duration of the infection (in days)", + y = "Mean gene expression") ``` 轴具有更多信息名称,但可以通过增加字体大小来提高其可读性: @@ -756,12 +756,12 @@ ggplot(data = mean_exp_by_time_sex, ggplot(data = mean_exp_by_time_sex, mapping = aes(x = time, y = mean_exp, color = sex)) + geom_line() + - facet_wrap(~gene, scales = "free_y") + + facet_wrap(~ gene, scales = "free_y") + theme_bw() + theme(panel.grid = element_blank()) + - labs(title = "按感染持续时间划分的平均基因表达", - x = "感染持续时间(天)", - y = "平均基因表达") + + labs(title = "Mean gene expression by duration of the infection", + x = "Duration of the infection (in days)", + y = "Mean gene expression") + theme(text = element_text(size = 16)) ``` @@ -776,17 +776,17 @@ ggplot(data = mean_exp_by_time_sex, ggplot(data = mean_exp_by_time_sex, mapping = aes(x = time, y = mean_exp, color = sex)) + geom_line() + - facet_wrap(~gene, scales = "free_y") + + facet_wrap(~ gene, scales = "free_y") + theme_bw() + theme(panel.grid = element_blank()) + - labs(title = "按感染持续时间划分的平均基因表达", - x = "感染持续时间(天)", - y = "平均基因表达") + + labs(title = "Mean gene expression by duration of the infection", + x = "Duration of the infection (in days)", + y = "Mean gene expression") + theme(text = element_text(size = 16), - axis.text.x = element_text(colour = “royalblue4”, size = 12), - axis.text.y = element_text(colour = “royalblue4”, size = 12), + axis.text.x = element_text(colour = "royalblue4", size = 12), + axis.text.y = element_text(colour = "royalblue4", size = 12), panel.grid = element_line(colour="lightsteelblue1"), - legend.position = “top”) + legend.position = "top") ``` 如果您比默认主题更喜欢您所做的更改,您可以 @@ -842,41 +842,41 @@ ggplot(data = mean_exp_by_time_sex, 我们可以通过以下方式定制它: ```{r, purl=TRUE} -# 更改线条的粗细 +# change the thickness of the lines ggplot(data = mean_exp_by_time_sex, - map = aes(x = time, y = mean_exp, color = sex)) + + mapping = aes(x = time, y = mean_exp, color = sex)) + geom_line(size=1.5) + facet_wrap(~ gene, scales = "free_y") + theme_bw() + theme(panel.grid = element_blank()) -# 更改图例和标签的名称 +# change the name of the legend and the labels ggplot(data = mean_exp_by_time_sex, - map = aes(x = time, y = mean_exp, color = sex)) + + mapping = aes(x = time, y = mean_exp, color = sex)) + geom_line() + facet_wrap(~ gene, scales = "free_y") + theme_bw() + theme(panel.grid = element_blank()) + scale_color_discrete(name = "Gender", labels = c("F", "M")) -# 使用不同的调色板 +# using a different color palette ggplot(data = mean_exp_by_time_sex, - map = aes(x = time, y = mean_exp, color = sex)) + + mapping = aes(x = time, y = mean_exp, color = sex)) + geom_line() + facet_wrap(~ gene, scales = "free_y") + theme_bw() + theme(panel.grid = element_blank()) + scale_color_brewer(name = "Gender", labels = c("F", "M"), palette = "Dark2") -# 手动指定颜色 +# manually specifying the colors ggplot(data = mean_exp_by_time_sex, - map = aes(x = time, y = mean_exp, color = sex)) + + mapping = aes(x = time, y = mean_exp, color = sex)) + geom_line() + facet_wrap(~ gene, scales = "free_y") + theme_bw() + theme(panel.grid = element_blank()) + - scale_color_manual(name = "性别", 标签 = c("F", "M"), - 值 = c("royalblue", "deeppink")) + scale_color_manual(name = "Gender", labels = c("F", "M"), + values = c("royalblue", "deeppink")) ``` @@ -900,14 +900,14 @@ ggplot(data = mean_exp_by_time_sex, log10 比例,以提高可读性。 ```{r sub1, purl=TRUE} -rna$chromosome_name <- 因子 (rna$chromosome_name, - 水平 = c (1:19,"X","Y")) +rna$chromosome_name <- factor(rna$chromosome_name, + levels = c(1:19,"X","Y")) -count_gene_chromosome <- rna %>% 选择 (chromosome_name, 基因) %>% +count_gene_chromosome <- rna %>% select(chromosome_name, gene) %>% distinct() %>% ggplot() + - geom_bar(aes(x = chromosome_name), 填充 = "seagreen", - 位置 = "dodge", stat = "count") + - 实验室 (y = "log10(n 基因)", x = "染色体") + + geom_bar(aes(x = chromosome_name), fill = "seagreen", + position = "dodge", stat = "count") + + labs(y = "log10(n genes)", x = "chromosome") + scale_y_log10() count_gene_chromosome @@ -933,7 +933,7 @@ exp_boxplot_sex 。 ```{r install-patchwork, message=FALSE, eval=FALSE, purl=TRUE} -安装.packages(“patchwork”) +install.packages("patchwork") ``` ```{r patchworkplot1, purl=TRUE} From 315fb65b60309a2c26b417535fc3c84b06593361 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sat, 17 Aug 2024 09:24:48 +0900 Subject: [PATCH 259/334] New translations 60-next-steps.md (Chinese Simplified) --- locale/zh/episodes/60-next-steps.Rmd | 60 ++++++++++++++-------------- 1 file changed, 30 insertions(+), 30 deletions(-) diff --git a/locale/zh/episodes/60-next-steps.Rmd b/locale/zh/episodes/60-next-steps.Rmd index 59ff8c466..0f2424881 100644 --- a/locale/zh/episodes/60-next-steps.Rmd +++ b/locale/zh/episodes/60-next-steps.Rmd @@ -27,7 +27,7 @@ exercises: 四十五 ## 下一步 ```{r, echo=FALSE, message=FALSE} -图书馆(“tidyverse”) +library("tidyverse") ``` 生物信息学中的数据通常很复杂。 为了解决这个问题, @@ -94,33 +94,33 @@ SummarizedExperiment 类的对象包含: ```{r, echo=FALSE, message=FALSE} rna <- read_csv("data/rnaseq.csv") -## 计数矩阵 +## count matrix counts <- rna %>% select(gene, sample, expression) %>% pivot_wider(names_from = sample, values_from = expression) -## 转换为矩阵并设置行名称 +## convert to matrix and set row names count_matrix <- counts %>% select(-gene) %>% as.matrix() rownames(count_matrix) <- counts$gene -## 样本注释 +## sample annotation sample_metadata <- rna %>% - 选择(样本、生物体、年龄、性别、感染、菌株、时间、组织、小鼠) + select(sample, organism, age, sex, infection, strain, time, tissue, mouse) -## 消除冗余 +## remove redundancy sample_metadata <- unique(sample_metadata) -## 基因注释 +## gene annotation gene_metadata <- rna %>% - 选择(基因、ENTREZID、产品、ensembl_gene_id、external_synonym、 - chromosome_name、gene_biotype、phenotype_description、 - hsapiens_homolog_associated_gene_name) + select(gene, ENTREZID, product, ensembl_gene_id, external_synonym, + chromosome_name, gene_biotype, phenotype_description, + hsapiens_homolog_associated_gene_name) -# 消除冗余 +# remove redundancy gene_metadata <- unique(gene_metadata) -## 写入到 csv +## write to csv write.csv(count_matrix, file = "data/count_matrix.csv") write.csv(gene_metadata, file = "data/gene_metadata.csv", row.names = FALSE) write.csv(sample_metadata, file = "data/sample_metadata.csv", row.names = FALSE) @@ -224,22 +224,22 @@ preferred approach. 如果需要与不使用 R 的人 `assay` 函数访问表达矩阵: ```{r} -头部(测定(se)) -暗淡(测定(se)) +head(assay(se)) +dim(assay(se)) ``` 我们可以使用“colData”函数访问样本元数据: ```{r} colData(se) -暗淡(colData(se)) +dim(colData(se)) ``` 我们还可以使用“rowData”函数访问特征元数据: ```{r} -头(rowData(se)) -dim(rowData(se)) +head(rowData(se)) +dim(rowData(se)) ``` ### 对 SummarizedExperiment 进行子集设置 @@ -256,8 +256,8 @@ se1 ``` ```{r} -colData(se1) -rowData(se1) +colData(se1) +rowData(se1) ``` 我们还可以使用 `colData()` 函数从 @@ -269,7 +269,7 @@ rowData(se1) se1 <- se[rowData(se)$gene_biotype == "miRNA", colData(se)$infection == "NonInfected"] se1 -analysis(se1) +assay(se1) colData(se1) rowData(se1) ``` @@ -322,8 +322,8 @@ assay(se)[1:3, colData(se)$time == 0 | colData(se)$time == 8] ```{r, purl=FALSE} rna |> - 过滤器(基因 %in% c("Asl", "Apod", "Cyd2d22")) |> - 过滤器(时间 != 4)|> 选择(表达式) + filter(gene %in% c("Asl", "Apod", "Cyd2d22")) |> + filter(time != 4) |> select(expression) ``` ::::::::::::::::::::::::: @@ -343,7 +343,7 @@ rna |> 假设您想添加收集样本的中心…… ```{r} -colData(se)$center <- rep("伊利诺伊大学", nrow(colData(se))) +colData(se)$center <- rep("University of Illinois", nrow(colData(se))) colData(se) ``` @@ -359,7 +359,7 @@ affecting the other structures! 记住我们的 SummarizedExperiment 对象是什么样的: ```{r, message=FALSE} -塞 +se ``` 加载“tidySummarizedExperiment”,然后再次查看 se 对象 @@ -382,14 +382,14 @@ se 可以这样做。 ```{r} -选项(“restore_SummarizedExperiment_show” = TRUE) +options("restore_SummarizedExperiment_show" = TRUE) se ``` 但这里我们使用 tibble 视图。 ```{r} -选项(“restore_SummarizedExperiment_show” = FALSE) +options("restore_SummarizedExperiment_show" = FALSE) se ``` @@ -400,19 +400,19 @@ se 一个样本的所有行。 ```{r} -se %>% 过滤器(.sample == “GSM2545336”) +se %>% filter(.sample == "GSM2545336") ``` 我们可以使用“select”来指定我们想要查看的列。 ```{r} -se %>% 选择(.sample) +se %>% select(.sample) ``` 我们可以使用“mutate”来添加元数据信息。 ```{r} -se %>% mutate(center = "海德堡大学") +se %>% mutate(center = "Heidelberg University") ``` 我们还可以将命令与 tidyverse 管道“%>%”组合起来。 对于 @@ -422,7 +422,7 @@ se %>% mutate(center = "海德堡大学") ```{r} se %>% group_by(.sample) %>% - 汇总(total_counts=sum(counts)) + summarise(total_counts=sum(counts)) ``` 我们可以将整洁的 SummarizedExperiment 对象视为用于绘图的正常 tibble From 7909ff8984533e4b936e02207790466d46ce0e12 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sat, 17 Aug 2024 15:18:07 +0900 Subject: [PATCH 260/334] New translations 30-dplyr.md (Chinese Simplified) --- locale/zh/episodes/30-dplyr.Rmd | 22 +++++++++++----------- 1 file changed, 11 insertions(+), 11 deletions(-) diff --git a/locale/zh/episodes/30-dplyr.Rmd b/locale/zh/episodes/30-dplyr.Rmd index 755a8b0c1..4a7d1f2f6 100644 --- a/locale/zh/episodes/30-dplyr.Rmd +++ b/locale/zh/episodes/30-dplyr.Rmd @@ -432,7 +432,7 @@ rna %>% ```{r, purl=TRUE} rna %>% - 计数(感染) + count(infection) ``` `count()` 函数是我们已经见过的功能的简写:按变量分组,并通过计算该组中的观察次数进行汇总。 换句话说,`rna %>% count(infection)` 等同于: @@ -450,15 +450,15 @@ rna %>% ```{r, purl=TRUE} rna %>% - 计数(感染,时间) + count(infection, time) ``` 这相当于: ```{r, purl=TRUE} rna %>% - group_by(感染,时间) %>% - 总结(n = n()) + group_by(infection, time) %>% + summarise(n = n()) ``` 有时对结果进行排序以方便比较是很有用的。 @@ -467,24 +467,24 @@ rna %>% ```{r, purl=TRUE} rna %>% - 计数(感染,时间)%>% - 安排(时间) + count(infection, time) %>% + arrange(time) ``` 或按计数: ```{r, purl=TRUE} rna %>% - 计数(感染,时间)%>% - 安排(n) + count(infection, time) %>% + arrange(n) ``` 为了按降序排序,我们需要添加 `desc()` 函数: ```{r, purl=TRUE} rna %>% - count(感染,时间) %>% - 排列(desc(n)) + count(infection, time) %>% + arrange(desc(n)) ``` ::::::::::::::::::::::::::::::::::::::: challenge @@ -537,7 +537,7 @@ the sample (organism, age, sex, ...) 或基因(gene_biotype、ENTREZ_ID、prod ```{r} rna %>% - 排列(基因) + arrange(gene) ``` 这种结构称为“长格式”,因为一列包含所有值, From 98369a2bef5b495725b9309b40796e6b1afb7b2d Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sat, 17 Aug 2024 15:30:12 +0900 Subject: [PATCH 261/334] New translations 10-data-organisation.md (Chinese Simplified) --- locale/zh/episodes/10-data-organisation.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/locale/zh/episodes/10-data-organisation.Rmd b/locale/zh/episodes/10-data-organisation.Rmd index bc4cf8c17..3080147ee 100644 --- a/locale/zh/episodes/10-data-organisation.Rmd +++ b/locale/zh/episodes/10-data-organisation.Rmd @@ -712,7 +712,7 @@ the file export. ! ```{r, results="markup", fig.cap="Saving an Excel file to CSV.", echo=FALSE, purl=FALSE, out.width="70%", fig.align="center"} -knitr::include_graphics(“fig/excel-to-csv.png”) +knitr::include_graphics("fig/excel-to-csv.png") ``` **关于 R 和 `xls`** 的注释:有一些 R 包可以读取 `xls` From ccc34593e2814263bead46fe859ec27d238813fb Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sat, 17 Aug 2024 15:30:20 +0900 Subject: [PATCH 262/334] New translations 20-r-rstudio.md (Chinese Simplified) --- locale/zh/episodes/20-r-rstudio.Rmd | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/locale/zh/episodes/20-r-rstudio.Rmd b/locale/zh/episodes/20-r-rstudio.Rmd index b4721abfa..e6c576f75 100644 --- a/locale/zh/episodes/20-r-rstudio.Rmd +++ b/locale/zh/episodes/20-r-rstudio.Rmd @@ -149,7 +149,7 @@ RStudio IDE 还提供商业许可和 Posit, Inc. 的 我们不会在研讨会期间介绍这些事项。 ```{r, results="markup", fig.cap="RStudio interface screenshot. Clockwise from top left: Source, Environment/History, Files/Plots/Packages/Help/Viewer, Console.", echo=FALSE, purl=FALSE, out.width="100%", fig.align="center"} -knitr::include_graphics(“fig/rstudio-screenshot.png”) +knitr::include_graphics("fig/rstudio-screenshot.png") ``` RStudio 窗口分为 4 个“窗格”: @@ -203,7 +203,7 @@ RStudio 的默认首选项通常运行良好,但将工作区保存到 以将“工作区保存到.RData”。 ```{r, results="markup", fig.cap="Set 'Save workspace to .RData on exit' to 'Never'", echo=FALSE, purl=FALSE, out.width="70%", fig.align="center"} -knitr::include_graphics(“fig/rstudio-preferences.png”) +knitr::include_graphics("fig/rstudio-preferences.png") ``` 为了避免 Windows 与其他操作系统 @@ -211,7 +211,7 @@ knitr::include_graphics(“fig/rstudio-preferences.png”) 将默认设置 UTF-8: ```{r, results="markup", fig.cap="Set the default text encoding to UTF-8 to save us headache in the coming future. (Figure from the link above).", echo=FALSE, purl=FALSE, out.width="70%", fig.align="center"} -knitr::include_graphics(“fig/utf8.png”) +knitr::include_graphics("fig/utf8.png") ``` ### 组织你的工作目录 @@ -293,7 +293,7 @@ knitr::include_graphics("fig/r-starting-how-it-should-look-like.png") > 脚本调用。 ```{r bioinfoproj, fig.cap="Directory structure for a sample bioinformatics project.", out.width="100%", echo=FALSE} -knitr::include_graphics(“fig/noble-bioinfo-project.png”) +knitr::include_graphics("fig/noble-bioinfo-project.png") ``` 定义明确、记录良好的 From 2fff77a90892a0aead6df3b07a552093ec8cf809 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sat, 17 Aug 2024 15:30:28 +0900 Subject: [PATCH 263/334] New translations 23-starting-with-r.md (Chinese Simplified) --- locale/zh/episodes/23-starting-with-r.Rmd | 42 +++++++++++------------ 1 file changed, 21 insertions(+), 21 deletions(-) diff --git a/locale/zh/episodes/23-starting-with-r.Rmd b/locale/zh/episodes/23-starting-with-r.Rmd index e7cfe0f99..77004d698 100644 --- a/locale/zh/episodes/23-starting-with-r.Rmd +++ b/locale/zh/episodes/23-starting-with-r.Rmd @@ -677,7 +677,7 @@ heights_no_na <- na.omit(heights) ``` ```{r, purl=TRUE} -中位数(高度,na.rm = TRUE) +median(heights, na.rm = TRUE) ``` ```{r, purl=TRUE} @@ -692,7 +692,7 @@ heights_no_na <- na.omit(heights) ## 生成向量 {#sec:genvec} ```{r, echo=FALSE} -设置种子(1) +set.seed(1) ``` ### 构造函数 @@ -703,15 +703,15 @@ heights_no_na <- na.omit(heights) 参数。 这些值将被初始化为 0。 ```{r, purl=TRUE} -数字(3) -数字(10) +numeric(3) +numeric(10) ``` 请注意,如果我们要求长度为 0 的数字向量,我们将获得 : ```{r, purl=TRUE} -数字(0) +numeric(0) ``` 字符和逻辑值有类似的构造函数,分别名为 @@ -743,7 +743,7 @@ logical(2) ## FALSE 值 -1 来初始化一个长度为 5 的数字向量,我们可以执行以下操作: ```{r, purl=TRUE} -代表(-1,5) +rep(-1, 5) ``` 类似地,要生成一个填充了缺失值的向量, @@ -751,7 +751,7 @@ logical(2) ## FALSE 设定假设: ```{r, purl=TRUE} -代表(NA,5) +rep(NA, 5) ``` `rep` 可以将任意长度的向量作为输入(上面,我们使用了长度为 1 的向量 @@ -759,7 +759,7 @@ logical(2) ## FALSE 值 1、2 和 3 五次,我们可以执行以下操作: ```{r, purl=TRUE} -代表(c(1,2,3),5) +rep(c(1, 2, 3), 5) ``` ::::::::::::::::::::::::::::::::::::::: challenge @@ -790,7 +790,7 @@ sort(rep(c(1, 2, 3), 5)) 的整数序列,可以使用: ```{r, purl=TRUE} -seq(从 = 1,到 = 20,按 = 2) +seq(from = 1, to = 20, by = 2) ``` `by` 的默认值为 1,并且鉴于经常使用以 1 为步长生成一个值到另一个值的 @@ -807,7 +807,7 @@ seq(1, 5) ## 默认为 的从 1 到 20 的数字序列,可以使用: ```{r, purl=TRUE} -seq(从 = 1,到 = 20,长度.out = 3) +seq(from = 1, to = 20, length.out = 3) ``` ### 随机样本和排列 @@ -819,7 +819,7 @@ seq(从 = 1,到 = 20,长度.out = 3) 根据他们姓名的字母顺序排列),然后: ```{r, purl=TRUE} -样品(1:10) +sample(1:10) ``` 如果没有进一步的参数,“sample”将返回向量中所有 @@ -828,7 +828,7 @@ seq(从 = 1,到 = 20,长度.out = 3) 个字母: ```{r, purl=TRUE} -样本(字母,5) +sample(letters, 5) ``` 如果我想要一个大于输入向量的输出,或者能够 @@ -836,7 +836,7 @@ seq(从 = 1,到 = 20,长度.out = 3) 参数设置为`TRUE`: ```{r, purl=TRUE} -样本(1:5,10,替换=TRUE) +sample(1:5, 10, replace = TRUE) ``` ::::::::::::::::::::::::::::::::::::::: challenge @@ -865,8 +865,8 @@ before drawing the random sample. 不同的排列 ```{r, purl=TRUE} -样品(1:10) -样品(1:10) +sample(1:10) +sample(1:10) ``` 与种子 123 相同的排列 @@ -881,10 +881,10 @@ sample(1:10) 不同的种子 ```{r, purl=TRUE} -设置.种子(1) -样本(1:10) -设置.种子(1) -样本(1:10) +set.seed(1) +sample(1:10) +set.seed(1) +sample(1:10) ``` ::::::::::::::::::::::::: @@ -900,8 +900,8 @@ _N(100, 5)_。 ```{r, echo=FALSE, fig.width=12, fig.height=6, fig.cap="Two normal distributions: *N(0, 1)* on the left and *N(100, 5)* on the right."} par(mfrow = c(1, 2)) -图(密度(rnorm(1000)),main = "", sub = "N(0, 1)") -图(密度(rnorm(1000, 100, 5)),main = "", sub = "N(100, 5)") +plot(density(rnorm(1000)), main = "", sub = "N(0, 1)") +plot(density(rnorm(1000, 100, 5)), main = "", sub = "N(100, 5)") ``` 三个参数 `n`、`mean` 和 `sd` 定义了 From 2050372dac650f529b3b6e386fc16b600a735514 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sat, 17 Aug 2024 15:30:37 +0900 Subject: [PATCH 264/334] New translations 25-starting-with-data.md (Chinese Simplified) --- locale/zh/episodes/25-starting-with-data.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/locale/zh/episodes/25-starting-with-data.Rmd b/locale/zh/episodes/25-starting-with-data.Rmd index e8c1b4ec4..e4a1705c7 100644 --- a/locale/zh/episodes/25-starting-with-data.Rmd +++ b/locale/zh/episodes/25-starting-with-data.Rmd @@ -74,7 +74,7 @@ CSV 文件,并使用 ```{r, eval=TRUE} download.file(url = "https://github.com/carpentries-incubator/bioc-intro/raw/main/episodes/data/rnaseq.csv", - destfile = "data/rnaseq.csv" ) + destfile = "data/rnaseq.csv") ``` 您现在可以加载数据了: From 6806262656f2843d857a751cec0198b38fc59322 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sat, 17 Aug 2024 15:30:41 +0900 Subject: [PATCH 265/334] New translations 30-dplyr.md (Chinese Simplified) --- locale/zh/episodes/30-dplyr.Rmd | 36 ++++++++++++++++----------------- 1 file changed, 18 insertions(+), 18 deletions(-) diff --git a/locale/zh/episodes/30-dplyr.Rmd b/locale/zh/episodes/30-dplyr.Rmd index 4a7d1f2f6..ec9464796 100644 --- a/locale/zh/episodes/30-dplyr.Rmd +++ b/locale/zh/episodes/30-dplyr.Rmd @@ -229,7 +229,7 @@ R 中的管道看起来像 `%>%`(通过\*\*`magrittr`\*\* ```{r, purl=TRUE} rna %>% filter(sex == "Male") %>% - select(基因,样本,组织,表达) + select(gene, sample, tissue, expression) ``` 有些人可能会发现将管道读成“then”这个词很有帮助。 例如,在上面的例子中, @@ -266,10 +266,10 @@ rna3 ```{r} rna %>% - 过滤器(表达 > 50000、 - 性别 == “女性”、 - 时间 == 0 )%>% - 选择(基因、样本、时间、表达、年龄) + filter(expression > 50000, + sex == "Female", + time == 0 ) %>% + select(gene, sample, time, expression, age) ``` ::::::::::::::::::::::::: @@ -286,17 +286,17 @@ rna %>% ```{r, purl=TRUE} rna %>% - 突变(time_hours = time * 24)%>% - 选择(时间,time_hours) + mutate(time_hours = time * 24) %>% + select(time, time_hours) ``` 您还可以在同一个 `mutate()` 调用中根据第一个新列创建第二个新列: ```{r, purl=TRUE} rna %>% - 突变(time_hours = time * 24, - time_mn = time_hours * 60)%>% - 选择(时间,time_hours,time_mn) + mutate(time_hours = time * 24, + time_mn = time_hours * 60) %>% + select(time, time_hours, time_mn) ``` ::::::::::::::::::::::::::::::::::::::: challenge @@ -319,11 +319,11 @@ rna %>% ```{r, eval=TRUE, purl=TRUE} rna %>% - 突变(表达式 = log(表达式))%>% - 选择(基因、染色体名称、表型描述、样本、表达)%>% - 过滤器(染色体名称 == “X” | 染色体名称 == “Y”)%>% - 过滤器(!is.na(表型描述))%>% - 过滤器(表达式 > 5) + mutate(expression = log(expression)) %>% + select(gene, chromosome_name, phenotype_description, sample, expression) %>% + filter(chromosome_name == "X" | chromosome_name == "Y") %>% + filter(!is.na(phenotype_description)) %>% + filter(expression > 5) ``` ::::::::::::::::::::::::: @@ -414,9 +414,9 @@ rna %>% ```{r, purl=TRUE} rna %>% - 过滤器(基因 == “Dok3”)%>% - group_by(时间)%>% - 总结(平均值 = 平均值(表达)) + filter(gene == "Dok3") %>% + group_by(time) %>% + summarise(mean = mean(expression)) ``` ::::::::::::::::::::::::: From be658b217484635cc7fa71212bcf3e10c983d39e Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sat, 17 Aug 2024 15:30:50 +0900 Subject: [PATCH 266/334] New translations 40-visualization.md (Chinese Simplified) --- locale/zh/episodes/40-visualization.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/locale/zh/episodes/40-visualization.Rmd b/locale/zh/episodes/40-visualization.Rmd index 4eeb4614e..67376333e 100644 --- a/locale/zh/episodes/40-visualization.Rmd +++ b/locale/zh/episodes/40-visualization.Rmd @@ -154,7 +154,7 @@ rna_plot + geom_histogram() 绘制直方图时会出现一条自动消息: ```{r, echo=FALSE, fig.show="hide"} -ggplot(rna,aes(x = 表达式)) + +ggplot(rna, aes(x = expression)) + geom_histogram() ``` From bbda52972e2ed8ca0aee0a767813b874760c92b9 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sat, 17 Aug 2024 15:48:29 +0900 Subject: [PATCH 267/334] New translations 23-starting-with-r.md (Japanese) --- locale/ja/episodes/23-starting-with-r.Rmd | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/locale/ja/episodes/23-starting-with-r.Rmd b/locale/ja/episodes/23-starting-with-r.Rmd index b0c517156..1d8e0ec4b 100644 --- a/locale/ja/episodes/23-starting-with-r.Rmd +++ b/locale/ja/episodes/23-starting-with-r.Rmd @@ -534,11 +534,11 @@ weight_g[c(TRUE, FALSE, TRUE, TRUE, FALSE)] のようにします。 ```{r, purl=TRUE} -## は、 -を満たすインデックスに対して TRUE の論理値を返します。 ## 条件 -Weight_g > 50 -## したがって、これを使用して 50 -Weight_g[weight_g > 50] を超える値のみを選択できます。 +## will return logicals with TRUE for the indices that meet +## the condition +weight_g > 50 +## so we can use this to select only the values above 50 +weight_g[weight_g > 50] ``` `&` (両方の条件が true、 From 039632b68721d39d00caaabf391a861d5d5e7f64 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sat, 17 Aug 2024 15:48:33 +0900 Subject: [PATCH 268/334] New translations 25-starting-with-data.md (Japanese) --- locale/ja/episodes/25-starting-with-data.Rmd | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/locale/ja/episodes/25-starting-with-data.Rmd b/locale/ja/episodes/25-starting-with-data.Rmd index 4da684c75..45eb1f894 100644 --- a/locale/ja/episodes/25-starting-with-data.Rmd +++ b/locale/ja/episodes/25-starting-with-data.Rmd @@ -286,16 +286,16 @@ RStudio では、オートコンプリート機能を使用して、列の完全 ## ソリューション ```{r, purl=TRUE} -## +## 1. rna_200 <- rna[200, ] ## 2. -## +## Saving `n_rows` to improve readability and reduce duplication n_rows <- nrow(rna) rna_last <- rna[n_rows, ] ## 3. rna_middle <- rna[n_rows / 2, ] ## 4. -rna_head <- rna[-(7:n_rows), ]。 +rna_head <- rna[-(7:n_rows), ] ``` ::::::::::::::::::::::::: @@ -497,7 +497,7 @@ country_climate <- data.frame( country_climate <- data.frame( country = c("Canada", "Panama", "South Africa", "Australia"), climate = c("cold", "hot", "temperate", "hot/temperate"), - temperature = c(10、30, 18, "15"), + temperature = c(10, 30, 18, "15"), northern_hemisphere = c(TRUE, TRUE, FALSE, "FALSE"), has_kangaroo = c(FALSE, FALSE, FALSE, 1) ) @@ -584,7 +584,7 @@ colnames(ip) ```{r rnormmat_sln, purl=TRUE} set.seed(123) -m <- matrix(rnorm(3000, ncol = 3) +m <- matrix(rnorm(3000), ncol = 3) dim(m) head(m) ``` @@ -734,7 +734,7 @@ l <- list(1:10, ## numeric letters, ## character installed.packages(), ## a matrix cars, ## a data.frame - list(1, 2, 3)).## リスト + list(1, 2, 3)) ## a list length(l) str(l) ``` From a812301e01ca9353afe3077b05fd91e37a6af385 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sat, 17 Aug 2024 15:48:38 +0900 Subject: [PATCH 269/334] New translations 30-dplyr.md (Japanese) --- locale/ja/episodes/30-dplyr.Rmd | 18 +++++++++--------- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/locale/ja/episodes/30-dplyr.Rmd b/locale/ja/episodes/30-dplyr.Rmd index a347645ef..e81d6a514 100644 --- a/locale/ja/episodes/30-dplyr.Rmd +++ b/locale/ja/episodes/30-dplyr.Rmd @@ -443,14 +443,14 @@ rna %>% ```{r, purl=TRUE} rna %>% - count(感染、時間) + count(infection, time) ``` これと等価である: ```{r, purl=TRUE} rna %>% - group_by(感染、時間) %>% + group_by(infection, time) %>% summarise(n = n()) ``` @@ -589,14 +589,14 @@ pivot_wider\`は主に3つの引数を取る: \`\`{r, fig.cap="`rna`データのワイドピボット。", echo=FALSE, message=FALSE} knitr::include_graphics("fig/pivot_wider.png") -``` +```` -``{r, purl=TRUE}. -rna_wide<- rna_exp %>% +```{r, purl=TRUE} +rna_wide <- rna_exp %>% pivot_wider(names_from = sample, values_from = expression) rna_wide -``` +```` デフォルトでは、`pivot_wider()` 関数は欠損値に対して `NA` を追加することに注意してください。 @@ -605,12 +605,12 @@ rna_wide 。 ```{r, purl=TRUE} -rna_with_missing_values<- rna %>% +rna_with_missing_values <- rna %>% select(gene, sample, expression) %>% filter(gene %in% c("Asl", "Apod", "Cyp2d22")) %>% filter(sample %in% c("GSM2545336", "GSM2545337", "GSM2545338")) %>% arrange(sample) %>% - filter(!(gene == "Cyp2d22" & sample != "GSM2545338")) + filter(!(gene == "Cyp2d22" & sample != "GSM2545338")) rna_with_missing_values ``` @@ -677,7 +677,7 @@ rna_long rna_wide %>% pivot_longer(names_to = "sample", values_to = "expression", - cols = starts_with("GSM") + cols = starts_with("GSM")) rna_wide %>% pivot_longer(names_to = "sample", values_to = "expression", From 55263dace979497c1461398b225e756820e749e2 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sat, 17 Aug 2024 15:48:43 +0900 Subject: [PATCH 270/334] New translations 40-visualization.md (Japanese) --- locale/ja/episodes/40-visualization.Rmd | 88 ++++++++++++------------- 1 file changed, 44 insertions(+), 44 deletions(-) diff --git a/locale/ja/episodes/40-visualization.Rmd b/locale/ja/episodes/40-visualization.Rmd index 9d5439486..b7523c920 100644 --- a/locale/ja/episodes/40-visualization.Rmd +++ b/locale/ja/episodes/40-visualization.Rmd @@ -126,7 +126,7 @@ ggplot(data = rna, mapping = aes(x = expression)) `geom_histogram()` をまず使ってみよう: ```{r first-ggplot, cache=FALSE, purl=TRUE} -ggplot(data = rna, mapping = aes(x = expression))+ +ggplot(data = rna, mapping = aes(x = expression)) + geom_histogram() ``` @@ -136,11 +136,11 @@ ggplot2`パッケージの`+`は特に便利で、 、さまざまなタイプのプロットを便利に調べることができる: ```{r, eval=FALSE, purl=TRUE} -# プロットを変数に代入 +# Assign plot to a variable rna_plot <- ggplot(data = rna, mapping = aes(x = expression)) -# プロットを描く +# Draw the plot rna_plot + geom_histogram() ``` @@ -151,7 +151,7 @@ rna_plot + geom_histogram() ヒストグラムを描画するときに表示される自動メッセージにお気づきでしょう: ```{r, echo=FALSE, fig.show="hide"} -ggplot(rna, aes(x = expression))+ +ggplot(rna, aes(x = expression)) + geom_histogram() ``` @@ -163,12 +163,12 @@ geom_histogram()`の引数`bins`または`binwidth\` を変更して、 ## ソリューション ```{r, purl=TRUE} -# ビンを変更する -ggplot(rna, aes(x = expression))+ +# change bins +ggplot(rna, aes(x = expression)) + geom_histogram(bins = 15) -# binwidth を変更 -ggplot(rna, aes(x = expression))+ +# change binwidth +ggplot(rna, aes(x = expression)) + geom_histogram(binwidth = 2000) ``` @@ -190,7 +190,7 @@ rna<- rna %>% の分布は確かに正規分布に近くなっている。 ```{r second-ggplot, cache=FALSE, purl=TRUE} -ggplot(rna, aes(x = expression_log))+ geom_histogram() +ggplot(rna, aes(x = expression_log)) + geom_histogram() ``` これからは対数変換した発現値を扱うことにする。 @@ -259,9 +259,9 @@ rna_plot フォールドの変化を `rna_fc.` という新しいデータフレームに保存する。 ```{r rna_fc, cache=FALSE, purl=TRUE} -rna_fc<- rna %>% select(gene, time, +rna_fc <- rna %>% select(gene, time, gene_biotype, expression_log) %>% - group_by(gene, time、gene_biotype) %>% + group_by(gene, time, gene_biotype) %>% summarize(mean_exp = mean(expression_log)) %>% pivot_wider(names_from = time, values_from = mean_exp) %>% @@ -311,7 +311,7 @@ ggplot(data = rna_fc, mapping = aes(x = time_4_vs_0, y = time_8_vs_0)) + ```{r color-by-gene_biotype2, cache=FALSE, purl=TRUE} ggplot(data = rna_fc, mapping = aes(x = time_4_vs_0, y = time_8_vs_0, - color = gene_biotype))+ + color = gene_biotype)) + geom_point(alpha = 0.3) ``` @@ -320,7 +320,7 @@ ggplot(data = rna_fc, mapping = aes(x = time_4_vs_0, y = time_8_vs_0, ```{r adding-diag, cache=FALSE, purl=TRUE} ggplot(data = rna_fc, mapping = aes(x = time_4_vs_0, y = time_8_vs_0, - color = gene_biotype))+ + color = gene_biotype)) + geom_point(alpha = 0.3) + geom_abline(intercept = 0) ``` @@ -330,7 +330,7 @@ ggplot(data = rna_fc, mapping = aes(x = time_4_vs_0, y = time_8_vs_0, ```{r color-by-gene_biotype3, cache=FALSE, purl=TRUE} ggplot(data = rna_fc, mapping = aes(x = time_4_vs_0, y = time_8_vs_0, - color = gene_biotype))+ + color = gene_biotype)) + geom_jitter(alpha = 0.3) + geom_abline(intercept = 0) ``` @@ -394,7 +394,7 @@ ggplot(data = rna_fc, mapping = aes(x = time_4_vs_0, y = time_8_vs_0)) + ## ソリューション ```{r, eval=TRUE, purl=TRUE} -ggplot(data = rna, mapping = aes(y = expression_log, x = sample))+ +ggplot(data = rna, mapping = aes(y = expression_log, x = sample)) + geom_point(aes(color = time)) ``` @@ -409,7 +409,7 @@ ggplot(data = rna, mapping = aes(y = expression_log, x = sample))+ ```{r boxplot, cache=FALSE, purl=TRUE} ggplot(data = rna, - mapping = aes(y = expression_log, x = sample))+ + mapping = aes(y = expression_log, x = sample)) + geom_boxplot() ``` @@ -418,7 +418,7 @@ boxplotにポイントを追加することで、 ```{r boxplot-with-points, cache=FALSE, purl=TRUE} ggplot(data = rna, - mapping = aes(y = expression_log, x = sample))+ + mapping = aes(y = expression_log, x = sample)) + geom_jitter(alpha = 0.2, color = "tomato") + geom_boxplot(alpha = 0) ``` @@ -835,41 +835,41 @@ ggplot(data = mean_exp_by_time_sex, 以下のようなカスタマイズが可能です: ```{r, purl=TRUE} -# 線の太さを変更する +# change the thickness of the lines ggplot(data = mean_exp_by_time_sex, - mapping = aes(x = time, y = mean_exp, color = sex))+ + mapping = aes(x = time, y = mean_exp, color = sex)) + geom_line(size=1.5) + facet_wrap(~ gene, scales = "free_y") + theme_bw() + theme(panel.grid = element_blank()) -# 凡例とラベルの名前を変更する +# change the name of the legend and the labels ggplot(data = mean_exp_by_time_sex, - mapping = aes(x = time, y = mean_exp, color = sex)).+ + mapping = aes(x = time, y = mean_exp, color = sex)) + geom_line() + facet_wrap(~ gene, scales = "free_y") + theme_bw() + theme(panel.grid = element_blank()) + - scale_color_discrete(name = "Gender", labels = c("F", "M")). + scale_color_discrete(name = "Gender", labels = c("F", "M")) # using a different color palette ggplot(data = mean_exp_by_time_sex, - mapping = aes(x = time, y = mean_exp, color = sex))+ + mapping = aes(x = time, y = mean_exp, color = sex)) + geom_line() + facet_wrap(~ gene, scales = "free_y") + theme_bw() + theme(panel.grid = element_blank()) + - scale_color_brewer(name = "Gender", labels = c("F", "M"), palette = "Dark2"). + scale_color_brewer(name = "Gender", labels = c("F", "M"), palette = "Dark2") -# 手動で色を指定 +# manually specifying the colors ggplot(data = mean_exp_by_time_sex, - mapping = aes(x = time, y = mean_exp, color = sex))+ + mapping = aes(x = time, y = mean_exp, color = sex)) + geom_line() + facet_wrap(~ gene, scales = "free_y") + theme_bw() + theme(panel.grid = element_blank()) + - scale_color_manual(name = "Gender", labels = c("F", "M"), - values = c("royalblue", "deeppink") + scale_color_manual(name = "Gender", labels = c("F", "M"), + values = c("royalblue", "deeppink")) ``` @@ -893,11 +893,11 @@ log10スケールに変更した。 ```{r sub1, purl=TRUE} rna$chromosome_name <- factor(rna$chromosome_name, - levels = c(1:19, "X", "Y") + levels = c(1:19,"X","Y")) -count_gene_chromosome<- rna %>% select(chromosome_name, gene) %>% +count_gene_chromosome <- rna %>% select(chromosome_name, gene) %>% distinct() %>% ggplot() + - geom_bar(aes(x = chromosome_name), fill = "seagreen"、 + geom_bar(aes(x = chromosome_name), fill = "seagreen", position = "dodge", stat = "count") + labs(y = "log10(n genes)", x = "chromosome") + scale_y_log10() @@ -910,7 +910,7 @@ count_gene_chromosome ```{r sub2, purl=TRUE} exp_boxplot_sex <- ggplot(rna, aes(y=expression_log, x = as.factor(time), - color=sex)).+ + color=sex)) + geom_boxplot(alpha = 0) + labs(y = "Mean gene exp", x = "time") + theme(legend.position = "none") @@ -994,23 +994,23 @@ grid.arrange(count_gene_chromosome, exp_boxplot_sex, ncol = 2) ```{r ggsave-example, eval=FALSE, purl=TRUE} my_plot <- ggplot(data = mean_exp_by_time_sex, - mapping = aes(x = time, y = mean_exp, color = sex)).+ + mapping = aes(x = time, y = mean_exp, color = sex)) + geom_line() + facet_wrap(~ gene, scales = "free_y") + - labs(title = "感染期間別の平均遺伝子発現", - x = "感染期間(日)", - y = "平均遺伝子発現") + - guides(color=guide_legend(title="Gender"))+ + labs(title = "Mean gene expression by duration of the infection", + x = "Duration of the infection (in days)", + y = "Mean gene expression") + + guides(color=guide_legend(title="Gender")) + theme_bw() + - theme(axis.text.x = element_text(color = "royalblue4", size = 12), - axis.text.y = element_text(color = "royalblue4", size = 12), + theme(axis.text.x = element_text(colour = "royalblue4", size = 12), + axis.text.y = element_text(colour = "royalblue4", size = 12), text = element_text(size = 16), - panel.grid = element_line(color="lightsteelblue1"), + panel.grid = element_line(colour="lightsteelblue1"), legend.position = "top") ggsave("fig_output/mean_exp_by_time_sex.png", my_plot, width = 15, height = 10) -# これは grid.arrange() プロットでも動作します +# This also works for grid.arrange() plots combo_plot <- grid.arrange(count_gene_chromosome, exp_boxplot_sex, ncol = 2, widths = c(4, 6)) ggsave("fig_output/combo_plot_chromosome_sex.png", combo_plot, @@ -1047,12 +1047,12 @@ Rに付属するデフォルトのグラフィックス・システムは、し ```{r paintermodel, fig.width=12, fig.height=4, fig.cap="Successive layers added on top of each other."} par(mfrow = c(1, 3)) -plot(1:20, main = "最初のレイヤー、plot(1:20)で作成") +plot(1:20, main = "First layer, produced with plot(1:20)") -plot(1:20, main = "赤の水平線、abline(h = 10)で追加") +plot(1:20, main = "A horizontal red line, added with abline(h = 10)") abline(h = 10, col = "red") -plot(1:20, main = "A rectangle, added with rect(5, 5, 15, 15)") +plot(1:20, main = "A rectangle, added with rect(5, 5, 15, 15)") abline(h = 10, col = "red") rect(5, 5, 15, 15, lwd = 3) ``` From 23a3ffb8c3194fb140c3fabc202b837e538713c9 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sat, 17 Aug 2024 15:48:46 +0900 Subject: [PATCH 271/334] New translations 60-next-steps.md (Japanese) --- locale/ja/episodes/60-next-steps.Rmd | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/locale/ja/episodes/60-next-steps.Rmd b/locale/ja/episodes/60-next-steps.Rmd index eb68df4d6..da0ea2fa8 100644 --- a/locale/ja/episodes/60-next-steps.Rmd +++ b/locale/ja/episodes/60-next-steps.Rmd @@ -92,25 +92,25 @@ SummarizedExperiment\`を作成するために、 rna <- read_csv("data/rnaseq.csv") ## count matrix -counts<- rna %>% +counts <- rna %>% select(gene, sample, expression) %>% pivot_wider(names_from = sample, values_from = expression) -## matrix に変換して行名を設定 -count_matrix<- counts %>% select(-gene) %>% as.matrix() +## convert to matrix and set row names +count_matrix <- counts %>% select(-gene) %>% as.matrix() rownames(count_matrix) <- counts$gene ## sample annotation -sample_metadata<- rna %>% +sample_metadata <- rna %>% select(sample, organism, age, sex, infection, strain, time, tissue, mouse) ## remove redundancy sample_metadata <- unique(sample_metadata) ## gene annotation -gene_metadata<- rna %>% - select(gene、ENTREZID, product, ensembl_gene_id, external_synonym, +gene_metadata <- rna %>% + select(gene, ENTREZID, product, ensembl_gene_id, external_synonym, chromosome_name, gene_biotype, phenotype_description, hsapiens_homolog_associated_gene_name) @@ -299,8 +299,8 @@ function.--> ```{r, purl=FALSE} assay(se)[1:3, colData(se)$time != 4] -# -と等価 assay(se)[1:3, colData(se)$time == 0 | colData(se)$time == 8]. +# Equivalent to +assay(se)[1:3, colData(se)$time == 0 | colData(se)$time == 8] ``` ::::::::::::::::::::::::: From 7f7ebb3b603dd2b373f2d2a1cb964be8f9d0228b Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sat, 17 Aug 2024 20:09:49 +0900 Subject: [PATCH 272/334] New translations 30-dplyr.md (Spanish) --- locale/es/episodes/30-dplyr.Rmd | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/locale/es/episodes/30-dplyr.Rmd b/locale/es/episodes/30-dplyr.Rmd index 9a12a01ea..b497761d9 100644 --- a/locale/es/episodes/30-dplyr.Rmd +++ b/locale/es/episodes/30-dplyr.Rmd @@ -318,12 +318,12 @@ este marco de datos. ## Solución ```{r, eval=TRUE, purl=TRUE} -arn %>% - mutar(expresión = log(expresión)) %>% - seleccionar(gen, nombre_cromosoma, descripción_fenotipo, muestra, expresión) %>% - filtrar(nombre_cromosoma = = "X" | nombre_cromosoma == "Y") %>% - filtro(!is.na(descripción_fenotipo)) %>% - filtro(expresión > 5) +rna %>% + mutate(expression = log(expression)) %>% + select(gene, chromosome_name, phenotype_description, sample, expression) %>% + filter(chromosome_name == "X" | chromosome_name == "Y") %>% + filter(!is.na(phenotype_description)) %>% + filter(expression > 5) ``` :::::::::::::::::::::::::::: From d5a714314624dc833fe275f666c95404a5b46e05 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sat, 17 Aug 2024 20:24:47 +0900 Subject: [PATCH 273/334] New translations 23-starting-with-r.md (French) --- locale/fr/episodes/23-starting-with-r.Rmd | 58 +++++++++++------------ 1 file changed, 29 insertions(+), 29 deletions(-) diff --git a/locale/fr/episodes/23-starting-with-r.Rmd b/locale/fr/episodes/23-starting-with-r.Rmd index a74715b27..734e781d0 100644 --- a/locale/fr/episodes/23-starting-with-r.Rmd +++ b/locale/fr/episodes/23-starting-with-r.Rmd @@ -113,14 +113,14 @@ Maintenant que R a « weight_kg » en mémoire, nous pouvons faire de l'arithm exemple, nous pouvons vouloir convertir ce poids en livres (le poids en livres est 2,2 fois le poids en kg) : ```{r, purl=TRUE} -2,2 * poids_kg +2.2 * weight_kg ``` On peut également changer la valeur d'un objet en lui attribuant une nouvelle : ```{r, purl=TRUE} -poids_kg <- 57,5 -2,2 * poids_kg +weight_kg <- 57.5 +2.2 * weight_kg ``` Cela signifie que l'attribution d'une valeur à un objet ne modifie pas les valeurs de @@ -128,7 +128,7 @@ autres objets. Par exemple, stockons le poids de l'animal en livres dans un nouv , `weight_lb` : ```{r, purl=TRUE} -poids_lb <- 2,2 * poids_kg +weight_lb <- 2.2 * weight_kg ``` puis remplacez « weight_kg » par 100. @@ -244,7 +244,7 @@ rond(3.14159, 2) Et si vous nommez les arguments, vous pouvez changer leur ordre : ```{r, results="show", purl=TRUE} -rond(chiffres = 2, x = 3,14159) +round(digits = 2, x = 3.14159) ``` Il est recommandé de placer les arguments non facultatifs (comme le nombre que vous arrondissez @@ -371,8 +371,8 @@ vérifier le type de données de vos objets et tapez leurs noms pour voir ce qui ```{r, eval=TRUE} num_char <- c(1, 2, 3, "a") num_logical <- c(1, 2, 3, TRUE, FALSE) -char_logical <- c("a", " b", "c", VRAI) -délicat <- c(1, 2, 3, "4") +char_logical <- c("a", "b", "c", TRUE) +tricky <- c(1, 2, 3, "4") ``` ::::::::::::::: solution @@ -422,7 +422,7 @@ dans l'exemple suivant : ```{r, eval=TRUE} num_logical <- c(1, 2, 3, TRUE) char_logical <- c("a", "b", "c", TRUE) -combiné_logique <- c(num_logical, char_logical ) +combined_logical <- c(num_logical, char_logical) ``` ::::::::::::::: solution @@ -463,29 +463,29 @@ logique → numérique → caractère ← logique :::::::::::::::::::::::::::::::::::::::::::::::: ```{r, echo=FALSE, eval=FALSE, purl=TRUE} -## Nous avons vu que les vecteurs atomiques peuvent être de type caractère, numérique, entier et -## logique. Mais que se passe-t-il si nous essayons de mélanger ces types dans un seul vecteur -## ? +## We've seen that atomic vectors can be of type character, numeric, integer, and +## logical. But what happens if we try to mix these types in a single +## vector? -## Que va-t-il se passer dans chacun de ces exemples ? (indice : utilisez `class()` pour -## vérifier le type de données de votre objet) +## What will happen in each of these examples? (hint: use `class()` to +## check the data type of your object) num_char <- c(1, 2, 3, "a") num_logical <- c(1, 2, 3, TRUE) char_logical <- c("a", "b", "c", TRUE) -délicat <- c(1, 2 , 3, "4") +tricky <- c(1, 2, 3, "4") -## Pourquoi pensez-vous que cela arrive ? +## Why do you think it happens? -## Vous avez probablement remarqué que des objets de types différents sont -## convertis en un seul type partagé au sein d'un vecteur. Dans R, nous appelons -## convertir des objets d'une classe en une autre classe -## _coercion_. Ces conversions se produisent selon une hiérarchie, -## selon laquelle certains types sont préférentiellement contraints vers d'autres types. Pouvez-vous -## dessiner un diagramme qui représente la hiérarchie de la façon dont ces types de données -## sont forcés ? +## You've probably noticed that objects of different types get +## converted into a single, shared type within a vector. In R, we call +## converting objects from one class into another class +## _coercion_. These conversions happen according to a hierarchy, +## whereby some types get preferentially coerced into other types. Can +## you draw a diagram that represents the hierarchy of how these data +## types are coerced? ``` ## Vecteurs de sous-ensemble @@ -503,7 +503,7 @@ On peut également répéter les indices pour créer un objet avec plus d'élém que celui d'origine : ```{r, results="show", purl=TRUE} -more_molecules <- molécules[c(1, 2, 3, 2, 1, 4)] +more_molecules <- molecules[c(1, 2, 3, 2, 1, 4)] more_molecules ``` @@ -659,7 +659,7 @@ hauteurs[complete.cases(heights)] 1. En utilisant ce vecteur de hauteurs en pouces, créez un nouveau vecteur en supprimant les NA. ```{r} -hauteurs <- c(63, 69, 60, 65, NA, 68, 61, 70, 61, 59, 64, 69, 63, 63, NA, 72, 65, 64, 70, 63, 65) +heights <- c(63, 69, 60, 65, NA, 68, 61, 70, 61, 59, 64, 69, 63, 63, NA, 72, 65, 64, 70, 63, 65) ``` 2. Utilisez la fonction `median()` pour calculer la médiane du vecteur `heights`. @@ -758,7 +758,7 @@ de longueur 1) et de n'importe quel type. Par exemple, si nous voulons répéter 1, 2 et 3, nous procéderions comme suit : ```{r, purl=TRUE} -représentant(c(1, 2, 3), 5) +rep(c(1, 2, 3), 5) ``` ::::::::::::::::::::::::::::::::::::::: défi @@ -774,7 +774,7 @@ possibilités - voir `?rep` ou `?sort` pour obtenir de l'aide. ## Solution ```{r, purl=TRUE} -rep(c(1, 2, 3), chacun = 5) +rep(c(1, 2, 3), each = 5) sort(rep(c(1, 2, 3), 5)) ``` @@ -908,9 +908,9 @@ Les trois arguments, `n`, `mean` et `sd`, définissent la taille de l'échantill et son écart type. Les valeurs par défaut de ce dernier sont 0 et 1. ```{r, purl=TRUE} -rnorme(5) -rnorme(5, 2, 2) -rnorme(5, 100, 5) +rnorm(5) +rnorm(5, 2, 2) +rnorm(5, 100, 5) ``` Maintenant que nous avons appris à écrire des scripts et les bases des structures de données From 676ed0af07990107dcdc25f21d01209eb4c4f40d Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sat, 17 Aug 2024 20:24:50 +0900 Subject: [PATCH 274/334] New translations 23-starting-with-r.md (Spanish) --- locale/es/episodes/23-starting-with-r.Rmd | 18 +++++++++--------- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/locale/es/episodes/23-starting-with-r.Rmd b/locale/es/episodes/23-starting-with-r.Rmd index 9ba73cd6f..c873f3fc6 100644 --- a/locale/es/episodes/23-starting-with-r.Rmd +++ b/locale/es/episodes/23-starting-with-r.Rmd @@ -119,8 +119,8 @@ Ahora que R tiene `weight_kg` en la memoria, podemos hacer aritmética con él. También podemos cambiar el valor de un objeto asignándole uno nuevo: ```{r, purl=TRUE} -weight_kg <- 57,5 -2,2 * weight_kg +weight_kg <- 57.5 +2.2 * weight_kg ``` Esto significa que asignar un valor a un objeto no cambia los valores de @@ -128,7 +128,7 @@ otros objetos. Por ejemplo, almacenemos el peso del animal en libras en un nuevo , `weight_lb`: ```{r, purl=TRUE} -weight_lb <- 2,2 * weight_kg +weight_lb <- 2.2 * weight_kg ``` y luego cambie `weight_kg` a 100. @@ -166,11 +166,11 @@ press <kbd>Ctrl</kbd> + <kbd>Shift</kbd> + <kbd>C</kbd>. ¿Cuáles son los valores después de cada declaración en lo siguiente? ```{r, purl=TRUE} -masa <- 47,5 # masa? -edad <- 122 # edad? -masa <- masa * 2.0 # masa? -edad <- edad - 20 # edad? -índice_masa <- masa/edad # índice_masa? +mass <- 47.5 # mass? +age <- 122 # age? +mass <- mass * 2.0 # mass? +age <- age - 20 # age? +mass_index <- mass/age # mass_index? ``` :::::::::::::::::::::::::::::::::::::::::::::::::::::::: @@ -774,7 +774,7 @@ obtuviéramos cinco 1, cinco 2 y cinco 3 en ese orden? Hay dos posibilidades ## Solución ```{r, purl=TRUE} -rep(c(1, 2, 3), cada uno = 5) +rep(c(1, 2, 3), each = 5) sort(rep(c(1, 2, 3), 5)) ``` From 1a35bcc48e127d835cc77266459430e0da3889a8 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sat, 17 Aug 2024 20:25:04 +0900 Subject: [PATCH 275/334] New translations 30-dplyr.md (French) --- locale/fr/episodes/30-dplyr.Rmd | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/locale/fr/episodes/30-dplyr.Rmd b/locale/fr/episodes/30-dplyr.Rmd index 03194ce7b..f3dd742bd 100644 --- a/locale/fr/episodes/30-dplyr.Rmd +++ b/locale/fr/episodes/30-dplyr.Rmd @@ -318,12 +318,12 @@ ce bloc de données ! ## Solution ```{r, eval=TRUE, purl=TRUE} -arn %>% +rna %>% mutate(expression = log(expression)) %>% - select(gène, nom_chromosome, description_phénotype, échantillon, expression) %>% - filtre(nom_chromosome = = "X" | nom_chromo == "Y") %>% - filtre(!is.na(phenotype_description)) %>% - filtre(expression > 5) + select(gene, chromosome_name, phenotype_description, sample, expression) %>% + filter(chromosome_name == "X" | chromosome_name == "Y") %>% + filter(!is.na(phenotype_description)) %>% + filter(expression > 5) ``` ::::::::::::::::::::::::: From df8ca03cf092f3a0d5c2e1158ad68d61f5f5d685 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sat, 17 Aug 2024 20:25:13 +0900 Subject: [PATCH 276/334] New translations 40-visualization.md (French) --- locale/fr/episodes/40-visualization.Rmd | 26 ++++++++++++------------- 1 file changed, 13 insertions(+), 13 deletions(-) diff --git a/locale/fr/episodes/40-visualization.Rmd b/locale/fr/episodes/40-visualization.Rmd index 4d4b94120..a95a74924 100644 --- a/locale/fr/episodes/40-visualization.Rmd +++ b/locale/fr/episodes/40-visualization.Rmd @@ -101,7 +101,7 @@ ggplot(data = <DATA>, mapping = aes(<MAPPINGS>)) + <GEOM_FUNCTION>() \*\* spécifique en utilisant l'argument `data` ```{r, eval=FALSE} -ggplot (données = arn) +ggplot(data = rna) ``` - définir un **mapping** (en utilisant la fonction esthétique (`aes`)), en @@ -166,12 +166,12 @@ changez le nombre ou la largeur des bacs. ## Solution ```{r, purl=TRUE} -# changer les bacs +# change bins ggplot(rna, aes(x = expression)) + geom_histogram(bins = 15) -# changer la largeur de bac -ggplot(rna, aes( x = expression)) + +# change binwidth +ggplot(rna, aes(x = expression)) + geom_histogram(binwidth = 2000) ``` @@ -377,7 +377,7 @@ library("hexbin") ggplot(data = rna_fc, mapping = aes(x = time_4_vs_0, y = time_8_vs_0)) + geom_hex() + - geom_abline(intercept = 0 ) + geom_abline(intercept = 0) ``` @@ -423,8 +423,8 @@ mesures et de leur répartition : ```{r boxplot-with-points, cache=FALSE, purl=TRUE} ggplot(data = rna, mapping = aes(y = expression_log, x = sample)) + - geom_jitter(alpha = 0.2, color = "tomate") + - geom_boxplot( alpha = 0) + geom_jitter(alpha = 0.2, color = "tomato") + + geom_boxplot(alpha = 0) ``` ::::::::::::::::::::::::::::::::::::::: défi @@ -444,7 +444,7 @@ Nous devrions inverser l'ordre de ces deux géométries : ggplot(data = rna, mapping = aes(y = expression_log, x = sample)) + geom_boxplot(alpha = 0) + - geom_jitter(alpha = 0.2, color = "tomate") + geom_jitter(alpha = 0.2, color = "tomato") ``` ::::::::::::::::::::::::: @@ -460,9 +460,9 @@ les étiquettes orientées en diagonale : ```{r boxplot-xaxis-rotated, cache=FALSE, purl=TRUE} ggplot(data = rna, mapping = aes(y = expression_log, x = sample)) + - geom_jitter(alpha = 0.2, color = "tomate") + - geom_boxplot( alpha = 0) + - thème(axis.text.x = element_text(angle = 90, hjust = 0,5, vjust = 0,5)) + geom_jitter(alpha = 0.2, color = "tomato") + + geom_boxplot(alpha = 0) + + theme(axis.text.x = element_text(angle = 90, hjust = 0.5, vjust = 0.5)) ``` ::::::::::::::::::::::::::::::::::::::: défi @@ -522,7 +522,7 @@ pas dans un boxplot. Une alternative au boxplot est le tracé en violon ggplot(data = rna, mapping = aes(y = expression_log, x = sample)) + geom_violin(aes(fill = as.factor(time))) + - thème (axis.text.x = element_text (angle = 90, hjust = 0,5, vjust = 0,5)) + theme(axis.text.x = element_text(angle = 90, hjust = 0.5, vjust = 0.5)) ``` ::::::::::::::::::::::::: @@ -543,7 +543,7 @@ ggplot(data = rna, ggplot(data = rna, mapping = aes(y = expression_log, x = sample)) + geom_violin(aes(fill = sex)) + - theme(axis.text .x = element_text(angle = 90, hjust = 0,5, vjust = 0,5)) + theme(axis.text.x = element_text(angle = 90, hjust = 0.5, vjust = 0.5)) ``` ::::::::::::::::::::::::: From 7e562df13f61668e5e4b5ab321ee993a87edf609 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sat, 17 Aug 2024 20:25:15 +0900 Subject: [PATCH 277/334] New translations 40-visualization.md (Spanish) --- locale/es/episodes/40-visualization.Rmd | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/locale/es/episodes/40-visualization.Rmd b/locale/es/episodes/40-visualization.Rmd index 428b1b86f..7a46578d2 100644 --- a/locale/es/episodes/40-visualization.Rmd +++ b/locale/es/episodes/40-visualization.Rmd @@ -540,10 +540,10 @@ ggplot(data = rna, ## Solución ```{r, eval=TRUE, echo=TRUE, cache=FALSE, purl=TRUE} -ggplot(datos = rna, - mapeo = aes(y = expresión_log, x = muestra)) + - geom_violin(aes(relleno = sexo)) + - tema(eje.texto .x = elemento_texto(ángulo = 90, hjust = 0.5, vjust = 0.5)) +ggplot(data = rna, + mapping = aes(y = expression_log, x = sample)) + + geom_violin(aes(fill = sex)) + + theme(axis.text.x = element_text(angle = 90, hjust = 0.5, vjust = 0.5)) ``` :::::::::::::::::::::::::::: From 6b1adb6f76cd3883d4d8a93f6268a2163a20af97 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sat, 17 Aug 2024 21:27:11 +0900 Subject: [PATCH 278/334] New translations 23-starting-with-r.md (French) --- locale/fr/episodes/23-starting-with-r.Rmd | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/locale/fr/episodes/23-starting-with-r.Rmd b/locale/fr/episodes/23-starting-with-r.Rmd index 734e781d0..bf6219ba6 100644 --- a/locale/fr/episodes/23-starting-with-r.Rmd +++ b/locale/fr/episodes/23-starting-with-r.Rmd @@ -166,11 +166,11 @@ press <kbd>Ctrl</kbd> + <kbd>Shift</kbd> + <kbd>C</kbd>. Quelles sont les valeurs après chaque instruction suivante ? ```{r, purl=TRUE} -masse <- 47,5 # masse ? -âge <- 122 # âge ? -masse <- masse * 2.0 # masse ? -âge <- âge - 20 # âge ? -mass_index <- masse/âge # mass_index ? +mass <- 47.5 # mass? +age <- 122 # age? +mass <- mass * 2.0 # mass? +age <- age - 20 # age? +mass_index <- mass/age # mass_index? ``` :::::::::::::::::::::::::::::::::::::::::::::::: From 1de52042448dd5f5d5fd6a4d5419fd35b9967fdf Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sat, 17 Aug 2024 23:27:26 +0900 Subject: [PATCH 279/334] New translations 23-starting-with-r.md (Spanish) --- locale/es/episodes/23-starting-with-r.Rmd | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/locale/es/episodes/23-starting-with-r.Rmd b/locale/es/episodes/23-starting-with-r.Rmd index c873f3fc6..1fc6182ab 100644 --- a/locale/es/episodes/23-starting-with-r.Rmd +++ b/locale/es/episodes/23-starting-with-r.Rmd @@ -886,9 +886,9 @@ set.seed(1) muestra(1:10) ``` -:::::::::::::::::::::::::::: +::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::: ### Extraer muestras de una distribución normal @@ -917,8 +917,8 @@ Ahora que hemos aprendido cómo escribir scripts y los conceptos básicos de las de R, estamos listos para comenzar a trabajar con datos más grandes y aprender sobre marcos de datos. -:::::::::::::::::::::::::::::::::::::::: puntos clave +:::::::::::::::::::::::::::::::::::::::: keypoints - Cómo interactuar con R -:::::::::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::: From 13251669f1726c61eec1b499d63411e2b1552139 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sat, 17 Aug 2024 23:42:13 +0900 Subject: [PATCH 280/334] New translations 23-starting-with-r.md (Spanish) --- locale/es/episodes/23-starting-with-r.Rmd | 20 ++++++++++---------- 1 file changed, 10 insertions(+), 10 deletions(-) diff --git a/locale/es/episodes/23-starting-with-r.Rmd b/locale/es/episodes/23-starting-with-r.Rmd index 1fc6182ab..72b4ffa37 100644 --- a/locale/es/episodes/23-starting-with-r.Rmd +++ b/locale/es/episodes/23-starting-with-r.Rmd @@ -351,7 +351,7 @@ Hemos visto que los vectores atómicos pueden ser de tipo carácter, numérico ( doble), entero y lógico. Pero ¿qué pasa si intentamos mezclar estos tipos en un solo vector? -::::::::::::::: solución +::::::::::::::: solution ## Solución @@ -375,7 +375,7 @@ char_logic <- c("a", " b", "c", VERDADERO) complicado <- c(1, 2, 3, "4") ``` -::::::::::::::: solución +::::::::::::::: solution ## Solución @@ -400,7 +400,7 @@ tricky ¿Por qué crees que sucede? -::::::::::::::: solución +::::::::::::::: solution ## Solución @@ -425,7 +425,7 @@ char_lógico <- c("a", "b", "c", VERDADERO) combinado_lógico <- c(núm_lógico, char_lógico ) ``` -::::::::::::::: solución +::::::::::::::: solution ## Solución @@ -452,7 +452,7 @@ _coerción_. Estas conversiones ocurren según una jerarquía, dibujar un diagrama que represente la jerarquía de cómo se coaccionan estos tipos de datos ? -::::::::::::::: solución +::::::::::::::: solution ## Solución @@ -576,7 +576,7 @@ molecules[molecules %in% c("rna", "dna", "metabolite", "peptide", "glycerol")] ¿Puedes entender por qué "cuatro" > "cinco" devuelve "VERDADERO"? -::::::::::::::: solución +::::::::::::::: solution ## Solución @@ -665,7 +665,7 @@ alturas <- c(63, 69, 60, 65, NA, 68, 61, 70, 61, 59, 64, 69, 63, 63, NA, 72, 65, 2. Utilice la función `median()` para calcular la mediana del vector `alturas`. 3. Usa R para calcular cuántas personas en el grupo miden más de 67 pulgadas. -::::::::::::::: solución +::::::::::::::: solution ## Solución @@ -722,7 +722,7 @@ Hay constructores similares para caracteres y lógicos, llamados ¿Cuáles son los valores predeterminados para los vectores lógicos y de caracteres? -::::::::::::::: solución +::::::::::::::: solution ## Solución @@ -769,7 +769,7 @@ representante(c(1, 2, 3), 5) obtuviéramos cinco 1, cinco 2 y cinco 3 en ese orden? Hay dos posibilidades ; consulte `?rep` o `?sort` para obtener ayuda. -::::::::::::::: solución +::::::::::::::: solution ## Solución @@ -857,7 +857,7 @@ Ahora establezca la semilla con, por ejemplo, `set.seed(123)` y repita el sorteo Repita colocando una semilla diferente. -::::::::::::::: solución +::::::::::::::: solution ## Solución From 47248eb70e092a79bffcaddf893f26b273f7b646 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sat, 17 Aug 2024 23:42:19 +0900 Subject: [PATCH 281/334] New translations 25-starting-with-data.md (French) --- locale/fr/episodes/25-starting-with-data.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/locale/fr/episodes/25-starting-with-data.Rmd b/locale/fr/episodes/25-starting-with-data.Rmd index 35709d1ee..a1851153f 100644 --- a/locale/fr/episodes/25-starting-with-data.Rmd +++ b/locale/fr/episodes/25-starting-with-data.Rmd @@ -777,7 +777,7 @@ by default surround each field with quotes, and thus we will be able to read it back into R correctly, despite also using commas as column separators. -:::::::::::::::::::::::::::::::::::::::: points clés +:::::::::::::::::::::::::::::::::::::::: keypoints - Données tabulaires dans R From f8d95e8314841114073413fabc07d95b265ea5ce Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sat, 17 Aug 2024 23:42:28 +0900 Subject: [PATCH 282/334] New translations 30-dplyr.md (French) --- locale/fr/episodes/30-dplyr.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/locale/fr/episodes/30-dplyr.Rmd b/locale/fr/episodes/30-dplyr.Rmd index f3dd742bd..52efc138e 100644 --- a/locale/fr/episodes/30-dplyr.Rmd +++ b/locale/fr/episodes/30-dplyr.Rmd @@ -1040,7 +1040,7 @@ Utilisons `write_csv()` pour sauvegarder la table rna\_wide que nous avons cré write_csv(rna_wide, file = "data_output/rna_wide.csv") ``` -:::::::::::::::::::::::::::::::::::::::: points clés +:::::::::::::::::::::::::::::::::::::::: keypoints - Données tabulaires dans R utilisant le méta-paquet Tidyverse From 9839fc65e5b471b2cad625e775c4c767616d69b8 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sat, 17 Aug 2024 23:42:36 +0900 Subject: [PATCH 283/334] New translations 40-visualization.md (French) --- locale/fr/episodes/40-visualization.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/locale/fr/episodes/40-visualization.Rmd b/locale/fr/episodes/40-visualization.Rmd index a95a74924..2cc9715ed 100644 --- a/locale/fr/episodes/40-visualization.Rmd +++ b/locale/fr/episodes/40-visualization.Rmd @@ -1100,7 +1100,7 @@ alambiquée. Une bonne référence pour le package `lattice` est @latticebook. -:::::::::::::::::::::::::::::::::::::::: points clés +:::::::::::::::::::::::::::::::::::::::: keypoints - Visualisation en R From 1b5fe04a3e4c1b37b27b95ba1d5238a3299b0b8b Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sat, 17 Aug 2024 23:42:40 +0900 Subject: [PATCH 284/334] New translations 40-visualization.md (Japanese) --- locale/ja/episodes/40-visualization.Rmd | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/locale/ja/episodes/40-visualization.Rmd b/locale/ja/episodes/40-visualization.Rmd index b7523c920..682044f07 100644 --- a/locale/ja/episodes/40-visualization.Rmd +++ b/locale/ja/episodes/40-visualization.Rmd @@ -788,14 +788,14 @@ ggplot(data = mean_exp_by_time_sex, 以前に作成したヒストグラムを使った例です。 ```{r mean_exp-time-with-right-labels-xfont, cache=FALSE, purl=TRUE} -blue_theme <- theme(axis.text.x = element_text(color = "royalblue4", +blue_theme <- theme(axis.text.x = element_text(colour = "royalblue4", size = 12), - axis.text.y = element_text(color = "royalblue4", + axis.text.y = element_text(colour = "royalblue4", size = 12), text = element_text(size = 16), - panel.grid = element_line(color="lightsteelblue1") + panel.grid = element_line(colour="lightsteelblue1")) -ggplot(rna, aes(x = expression_log))+ +ggplot(rna, aes(x = expression_log)) + geom_histogram(bins = 20) + blue_theme ``` From 73765eb89bd61ec2f7ceaf4a6f1a49f7972f5f05 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sun, 18 Aug 2024 00:09:23 +0900 Subject: [PATCH 285/334] New translations 23-starting-with-r.md (French) --- locale/fr/episodes/23-starting-with-r.Rmd | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/locale/fr/episodes/23-starting-with-r.Rmd b/locale/fr/episodes/23-starting-with-r.Rmd index bf6219ba6..2a88c0cd6 100644 --- a/locale/fr/episodes/23-starting-with-r.Rmd +++ b/locale/fr/episodes/23-starting-with-r.Rmd @@ -888,7 +888,7 @@ set.seed(1) ::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::: ### Extraire des échantillons à partir d'une distribution normale @@ -917,8 +917,8 @@ Maintenant que nous avons appris à écrire des scripts et les bases des structu de R, nous sommes prêts à commencer à travailler avec des données plus volumineuses et à en apprendre davantage sur les trames de données. -:::::::::::::::::::::::::::::::::::::::: points clés +:::::::::::::::::::::::::::::::::::::::: keypoints - Comment interagir avec R -:::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::: From 3b02c4ce2acfdefbe8d65d1a00810aa09bd54543 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sun, 18 Aug 2024 00:09:31 +0900 Subject: [PATCH 286/334] New translations 30-dplyr.md (French) --- locale/fr/episodes/30-dplyr.Rmd | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/locale/fr/episodes/30-dplyr.Rmd b/locale/fr/episodes/30-dplyr.Rmd index 52efc138e..f3695c247 100644 --- a/locale/fr/episodes/30-dplyr.Rmd +++ b/locale/fr/episodes/30-dplyr.Rmd @@ -999,7 +999,7 @@ et placez la table dans votre Dépôt de données. À l'aide de la fonction `fu , joignez les tables `rna_mini` et `annot3`. Que s'est-il passé pour les gènes _Klk6_, _mt-Tf_, _mt-Rnr1_, _mt-Tv_, _mt-Rnr2_ et _mt-Tl1_ ? -::::::::::::::: solution +::::::::::::::: solution ## Solution @@ -1014,7 +1014,7 @@ du tableau ont été codées comme manquantes. ::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::: ## Exporter des données @@ -1044,4 +1044,4 @@ write_csv(rna_wide, file = "data_output/rna_wide.csv") - Données tabulaires dans R utilisant le méta-paquet Tidyverse -:::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::: From 2e4e66d9b095613d4a0e024c6df7cbb92ace1c46 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sun, 18 Aug 2024 00:09:34 +0900 Subject: [PATCH 287/334] New translations 30-dplyr.md (Chinese Simplified) --- locale/zh/episodes/30-dplyr.Rmd | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/locale/zh/episodes/30-dplyr.Rmd b/locale/zh/episodes/30-dplyr.Rmd index ec9464796..9cd50e5fc 100644 --- a/locale/zh/episodes/30-dplyr.Rmd +++ b/locale/zh/episodes/30-dplyr.Rmd @@ -1014,7 +1014,7 @@ _mt-Rnr2_ 和 _mt-Tl1_ 仅存在于 `annot3` 表中。 表中 ::::::::::::::::::::::::: -::::::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::: ## 导出数据 @@ -1044,4 +1044,4 @@ write_csv(rna_wide,文件 = “data_output/rna_wide.csv”) - 使用 tidyverse 元包在 R 中生成表格数据 -::::::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::: From 5045f85dc4d39277672fd1540846da98b82be43d Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sun, 18 Aug 2024 00:25:05 +0900 Subject: [PATCH 288/334] New translations 23-starting-with-r.md (Japanese) --- locale/ja/episodes/23-starting-with-r.Rmd | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/locale/ja/episodes/23-starting-with-r.Rmd b/locale/ja/episodes/23-starting-with-r.Rmd index 1d8e0ec4b..8b73f116a 100644 --- a/locale/ja/episodes/23-starting-with-r.Rmd +++ b/locale/ja/episodes/23-starting-with-r.Rmd @@ -883,9 +883,9 @@ set.seed(1) sample(1:10) ``` -:::::::::::::::::::::::: +::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::: ### 正規分布からサンプルを抽出する @@ -918,4 +918,4 @@ rnorm(5, 100, 5) - Rと対話する方法 -:::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::: From a2b8bfe7448c725a5c7a233fd992428a074bc0d1 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sun, 18 Aug 2024 00:25:09 +0900 Subject: [PATCH 289/334] New translations 23-starting-with-r.md (Chinese Simplified) --- locale/zh/episodes/23-starting-with-r.Rmd | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/locale/zh/episodes/23-starting-with-r.Rmd b/locale/zh/episodes/23-starting-with-r.Rmd index 77004d698..6f9424a0d 100644 --- a/locale/zh/episodes/23-starting-with-r.Rmd +++ b/locale/zh/episodes/23-starting-with-r.Rmd @@ -889,7 +889,7 @@ sample(1:10) ::::::::::::::::::::::::: -::::::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::: ### 从正态分布中抽取样本 @@ -922,4 +922,4 @@ rnorm(5, 100, 5) - 如何与 R 交互 -::::::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::: From 3cd8d6537c93313025ecd5935f033c61d586be8f Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sun, 18 Aug 2024 00:44:08 +0900 Subject: [PATCH 290/334] New translations 23-starting-with-r.md (Japanese) --- locale/ja/episodes/23-starting-with-r.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/locale/ja/episodes/23-starting-with-r.Rmd b/locale/ja/episodes/23-starting-with-r.Rmd index 8b73f116a..758b39d60 100644 --- a/locale/ja/episodes/23-starting-with-r.Rmd +++ b/locale/ja/episodes/23-starting-with-r.Rmd @@ -1,5 +1,5 @@ --- -source: RMD +source: Rmd title: R の紹介 teaching: 60 exercises: 60 From 7aac25f7206156ad90bc1982a89f2b112df09c47 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sun, 18 Aug 2024 00:44:12 +0900 Subject: [PATCH 291/334] New translations 23-starting-with-r.md (Chinese Simplified) --- locale/zh/episodes/23-starting-with-r.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/locale/zh/episodes/23-starting-with-r.Rmd b/locale/zh/episodes/23-starting-with-r.Rmd index 6f9424a0d..33cdef738 100644 --- a/locale/zh/episodes/23-starting-with-r.Rmd +++ b/locale/zh/episodes/23-starting-with-r.Rmd @@ -1,5 +1,5 @@ --- -source: 放射科 +source: Rmd title: R 简介 teaching: 60 exercises: 60 From 1cf1a64d0604f5feb723284f0d49c52d075f7e7f Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sun, 18 Aug 2024 01:52:22 +0900 Subject: [PATCH 292/334] New translations 30-dplyr.md (Japanese) --- locale/ja/episodes/30-dplyr.Rmd | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/locale/ja/episodes/30-dplyr.Rmd b/locale/ja/episodes/30-dplyr.Rmd index e81d6a514..49dca8e5e 100644 --- a/locale/ja/episodes/30-dplyr.Rmd +++ b/locale/ja/episodes/30-dplyr.Rmd @@ -333,7 +333,7 @@ _split-apply-combine_パラダイムを使ってアプローチすることが ```{r} rna %>% - group_by(遺伝子) + group_by(gene) ``` group_by()`関数はデータ処理を行わず、 @@ -390,8 +390,8 @@ rna %>% ```{r, purl=TRUE} rna %>% - group_by(遺伝子, 感染, 時間) %>% - summary(mean_expression = mean(expression), + group_by(gene, infection, time) %>% + summarise(mean_expression = mean(expression), median_expression = median(expression)) ``` @@ -425,14 +425,14 @@ rna %>% ```{r, purl=TRUE} rna %>% - count(感染) + count(infection) ``` count()`関数は、すでに見たことのある、変数でグループ化し、そのグループ内のオブザベーションの数をカウントして要約する、ということの省略記法です。 言い換えれば、`rna %>% count(infection)\`は次のものと等価である: ```{r, purl=TRUE} rna %>% - group_by(感染) %>% + group_by(infection) %>% summarise(n = n()) ``` From ed89ccca8f2200be83d4390a0e63f8a3cb196391 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sun, 18 Aug 2024 02:33:55 +0900 Subject: [PATCH 293/334] Fix Crowdin Markdown parser bug --- locale/ja/episodes/30-dplyr.Rmd | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/locale/ja/episodes/30-dplyr.Rmd b/locale/ja/episodes/30-dplyr.Rmd index 49dca8e5e..cbdb351a4 100644 --- a/locale/ja/episodes/30-dplyr.Rmd +++ b/locale/ja/episodes/30-dplyr.Rmd @@ -586,17 +586,16 @@ pivot_wider\`は主に3つの引数を取る: 3. value_from\`: 新しいカラム を埋める値。 -\`\`{r, fig.cap="`rna`データのワイドピボット。", echo=FALSE, message=FALSE} +```{r, fig.cap="Wide pivot of the `rna` data.", echo=FALSE, message=FALSE} knitr::include_graphics("fig/pivot_wider.png") - -```` +``` ```{r, purl=TRUE} rna_wide <- rna_exp %>% pivot_wider(names_from = sample, values_from = expression) rna_wide -```` +``` デフォルトでは、`pivot_wider()` 関数は欠損値に対して `NA` を追加することに注意してください。 From ee9449c422bc5536c791b7a100390664334113be Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sun, 18 Aug 2024 02:40:45 +0900 Subject: [PATCH 294/334] Update 30-dplyr.Rmd --- locale/ja/episodes/30-dplyr.Rmd | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/locale/ja/episodes/30-dplyr.Rmd b/locale/ja/episodes/30-dplyr.Rmd index cbdb351a4..37c2a1052 100644 --- a/locale/ja/episodes/30-dplyr.Rmd +++ b/locale/ja/episodes/30-dplyr.Rmd @@ -336,11 +336,12 @@ rna %>% group_by(gene) ``` -group_by()`関数はデータ処理を行わず、 -データをサブセットにグループ化する。上の例では、 -`r nrow(rna)`オブザベーションの最初の`tibble`は、`r length(unique(rna$gene))`グループに`gene\` 変数に基づいて分割される。 +The `group_by()` function doesn't perform any data processing, it +groups the data into subsets: in the example above, our initial +`tibble` of `r nrow(rna)` observations is split into +`r length(unique(rna$gene))` groups based on the `gene` variable. -同様に、ティブルをサンプルごとにグループ分けすることもできる: +We could similarly decide to group the tibble by the samples: ```{r} rna %>% From 2e407fc076f554d836b2656e91f44731b6e811af Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sun, 18 Aug 2024 02:48:16 +0900 Subject: [PATCH 295/334] New translations 30-dplyr.md (Japanese) --- locale/ja/episodes/30-dplyr.Rmd | 168 ++++++++++++++++++++++---------- 1 file changed, 119 insertions(+), 49 deletions(-) diff --git a/locale/ja/episodes/30-dplyr.Rmd b/locale/ja/episodes/30-dplyr.Rmd index 37c2a1052..839936438 100644 --- a/locale/ja/episodes/30-dplyr.Rmd +++ b/locale/ja/episodes/30-dplyr.Rmd @@ -40,13 +40,19 @@ download.file(url = "https://github.com/carpentries-incubator/bioc-intro/raw/mai いくつかのパッケージは、データを操作する際に私たちの作業を大いに助けてくれる。 Rのパッケージは基本的に、 +、より多くのことができるようにする追加関数のセットである。 いくつかのパッケージは、データを操作する際に私たちの作業を大いに助けてくれる。 +Rのパッケージは基本的に、 、より多くのことができるようにする追加関数のセットである。 これまで使ってきた `str()` や `data.frame()` などの関数は、Rに組み込まれています。パッケージをロードすることで、その他の 固有の関数にアクセスできるようになります。 初めてパッケージを使用する前に、 をマシンにインストールする必要がある。その後、 +R セッションでパッケージが必要になったら、毎回インポートする必要がある。 初めてパッケージを使用する前に、 +をマシンにインストールする必要がある。その後、 R セッションでパッケージが必要になったら、毎回インポートする必要がある。 -- dplyr\`\*\* パッケージは、データ操作タスクのための強力なツールを提供します。 +- dplyr\\`\*\* パッケージは、データ操作タスクのための強力なツールを提供します。 + データフレームを直接操作できるように構築されており、多くの操作タスクが + に最適化されている。 データフレームを直接操作できるように構築されており、多くの操作タスクが に最適化されている。 @@ -60,27 +66,30 @@ R セッションでパッケージが必要になったら、毎回インポー をご覧ください。 - tidyverse`**パッケージは "umbrella-package "であり、 - 、データ解析のためのいくつかの便利なパッケージがインストールされます。 - には、**tidyr`\*\*, **dplyr`**, **ggplot2`**, \*\*tibble\`\*\*などがあります。 + 、データ解析のためのいくつかの便利なパッケージがインストールされます。 + には、**tidyr`\*\*, **dplyr`**, **ggplot2`**, \*\*tibble\\`\*\*などがあります。 これらのパッケージは、データを操作したり対話したりするのに役立ちます。 サブセット化、変換、 ビジュアライズなど、データを使ってさまざまなことができる。 + サブセット化、変換、 + ビジュアライズなど、データを使ってさまざまなことができる。 セットアップを行ったのであれば、すでにtidyverseパッケージがインストールされているはずです。 ライブラリから読み込んでみて、それがあるかどうか確認してください: +ライブラリから読み込んでみて、それがあるかどうか確認してください: ```{r, message=FALSE, purl=TRUE} ## dplyr を含む tidyverse パッケージをロード library("tidyverse") ``` -tidyverse\`\*\* パッケージをインストールするには、以下のようにタイプしてください: +tidyverse\\`\*\* パッケージをインストールするには、以下のようにタイプしてください: ```{r, eval=FALSE, purl=TRUE} BiocManager::install("tidyverse") ``` -もし、\*\*tidyverse`**パッケージをインストールしなければならなかったなら、上記の`library()\`コマンドを使って、このRセッションでロードすることを忘れないでください! +もし、\*\*tidyverse`**パッケージをインストールしなければならなかったなら、上記の`library()\\`コマンドを使って、このRセッションでロードすることを忘れないでください! ## tidyverseでデータをロードする @@ -103,24 +112,30 @@ Tibblesは、以前 1. 各列のデータ型が列名の下に表示される。 <`dbl`\> は の小数点を持つ数値を保持するために定義されたデータ型である。 + データで作業するとき、我々はしばしば、各因子または因子の組み合わせについて + 見つかったオブザベーションの数を知りたい。 このタスクのために、\*\*dplyr`** は + `count()\\` を提供している。 例えば、感染したサンプルと感染していないサンプルそれぞれについて、 + 、データの行数をカウントしたい場合、次のようにする: 2. これは、データの最初の数行と、 1画面に収まるだけの列数だけを印刷する。 -これから、最も一般的な **dplyr\`** 関数のいくつかを学びます: +これから、最も一般的な **dplyr\\`** 関数のいくつかを学びます: -- select()\`: カラムのサブセット +- select()\\`: カラムのサブセット - `filter()`: 条件で行をサブセットする。 -- mutate()\`: 他のカラムの情報を使って新しいカラムを作成する。 -- group_by()`と`summarise()\`: グループ化されたデータの要約統計量を作成する。 -- arrange()\`:結果の並べ替え -- count()\`: 離散値を数える +- mutate()\\`: 他のカラムの情報を使って新しいカラムを作成する。 +- group_by()`と`summarise()\\`: グループ化されたデータの要約統計量を作成する。 +- arrange()\\`:結果の並べ替え +- count()\\`: 離散値を数える ## 列の選択と行のフィルタリング データフレームの列を選択するには `select()` を使う。 この関数の最初の引数 はデータフレーム (`rna`) で、続く -の引数は保持する列です。 +の引数は保持する列です。 Tibblesは、以前 +で紹介したデータ・フレーム・オブジェクトの動作の一部を微調整している。 データ構造はデータフレームによく似ている。 +我々の目的にとって、唯一の違いはそれだ: ```{r, purl=TRUE} select(rna, gene, sample, tissue, expression) @@ -146,6 +161,9 @@ filter(rna, sex == "Male" & infection == "NonInfected") ここで、このデータセットで解析されたマウス 遺伝子のヒトホモログに興味があるとしよう。 この情報は、 `hsapiens_homolog_associated_gene_name` という名前の `rna` tibbleの +最後のカラムにある。 ここで、このデータセットで解析されたマウス +遺伝子のヒトホモログに興味があるとしよう。 この情報は、 +`hsapiens_homolog_associated_gene_name` という名前の `rna` tibbleの 最後のカラムにある。 簡単に視覚化するために、 、2つの列`gene`と `hsapiens_homolog_associated_gene_name`だけを含む新しいテーブルを作成する。 @@ -175,6 +193,7 @@ filter(genes, !is.na(hsapiens_homolog_associated_gene_name)) ## パイプ 選択とフィルタを同時に行いたい場合は? これを行うには、 +、中間ステップ、ネストされた関数、パイプの3つの方法がある。 これを行うには、 、中間ステップ、ネストされた関数、パイプの3つの方法がある。 中間ステップでは、一時的なデータフレームを作成し、 @@ -188,6 +207,7 @@ rna3 これは読みやすいが、 、個別に名前を付けなければならない中間オブジェクトがたくさんあるため、ワークスペースが散らかる可能性がある。 複数の +、それを把握するのは難しいかもしれない。 複数の 、それを把握するのは難しいかもしれない。 、関数を入れ子にすることもできる: @@ -204,7 +224,7 @@ Rは式を内側から外側へと評価する(この場合、フィルタリ 、次の関数に直接送ることができる。これは、同じデータセットに対して多くの処理を行う必要がある場合に便利である 。 -R のパイプは `%>%` (**`magrittr`** +ミューテート R のパイプは `%>%` (**`magrittr`** パッケージで利用可能) または `|>` (ベース R で利用可能) のように見えます。 RStudioを使用する場合は、 PCをお持ちの場合は<kbd>Ctrl</kbd>+<kbd>Shift</kbd>+<kbd>M</kbd>、 Macをお持ちの場合は<kbd>Cmd</kbd>+<kbd>Shift</kbd>+<kbd>Mで</kbd>パイプを @@ -226,12 +246,12 @@ rna %>% select(gene, sample, tissue, expression) ``` -パイプを "then "のように読むことが役に立つと思う人もいるだろう。 例えば、 +パイプを "then "のように読むことが役に立つと思う人もいるだろう。 パイプを "then "のように読むことが役に立つと思う人もいるだろう。 例えば、 上の例では、データフレーム `rna` を取得し、`sex=="Male"` の行を で `フィルター`し、`gene`, `sample`, `tissue`, `expression` の列を `選択` した。 -dplyr\`\*\*関数はそれ自体ではやや単純だが、 +dplyr\\`\*\*関数はそれ自体ではやや単純だが、 、パイプを使った線形ワークフローに組み合わせることで、 、データフレームのより複雑な操作を行うことができる。 @@ -274,7 +294,7 @@ rna %>% 例えば、単位変換をしたり、2つの 列の値の比率を求めたりするために、既存の -列の値に基づいて新しい列を作成したいことがよくあります。 これには `mutate()` を使う。 +列の値に基づいて新しい列を作成したいことがよくあります。 これには `mutate()` を使う。 これには `mutate()` を使う。 時間単位の新しい列を作成する: @@ -297,9 +317,13 @@ rna %>% ## チャレンジ -以下の -条件を満たす `rna` データから新しいデータフレームを作成する: `gene`、`chromosome_name`、 -`phenotype_description`、`sample`、`expression` 列のみを含む。 +後述するように、 +、特定の分析や視覚化を行うために、データフレームの形を変えたいことがある。 tidyr\`\*\*パッケージは、 + 、データの形を変えるというこの一般的な問題に対処し、 + データを整然と操作するためのツールを提供する。 +の値は対数変換する。 以下の +条件を満たす `rna`データから新しいデータフレームを作成する:`gene`、`chromosome_name`、 +`phenotype_description`、`sample`、`expression\` 列のみを含む。 の値は対数変換する。 このデータフレームは、 、性染色体に位置し、 phenotype_descriptionに関連し、log expressionが5より高い遺伝子のみを含んでいなければならない。 @@ -329,19 +353,19 @@ rna %>% 多くのデータ分析タスクは、 _split-apply-combine_パラダイムを使ってアプローチすることができる:データをグループに分割し、各グループにいくつかの 分析を適用し、その結果を組み合わせる。 \*\*dplyr`** -は `group_by()\` 関数を使って、これを非常に簡単にしている。 +は `group_by()\` 関数を使って、これを非常に簡単にしている。 **dplyr`\*\* +は `group_by()` 関数を使って、これを非常に簡単にしている。 ```{r} rna %>% group_by(gene) ``` -The `group_by()` function doesn't perform any data processing, it -groups the data into subsets: in the example above, our initial -`tibble` of `r nrow(rna)` observations is split into -`r length(unique(rna$gene))` groups based on the `gene` variable. +group_by()`関数はデータ処理を行わず、 +データをサブセットにグループ化する。上の例では、 +`r nrow(rna)`オブザベーションの最初の`tibble`は、`r length(unique(rna$gene))`グループに`gene\\` 変数に基づいて分割される。 -We could similarly decide to group the tibble by the samples: +同様に、ティブルをサンプルごとにグループ分けすることもできる: ```{r} rna %>% @@ -356,10 +380,11 @@ rna %>% ### summarise()\`関数 -group_by()`は`summarise()\` と一緒に使われることが多く、 +group_by()`は`summarise()\\` と一緒に使われることが多く、 は各グループを1行の要約に折りたたむ。 -group_by()\` は、 +group_by()` は、 **カテゴリー** 変数を含むカラム名を引数として取り、 +統計のサマリーを計算します。 group_by()\` は、 **カテゴリー** 変数を含むカラム名を引数として取り、 統計のサマリーを計算します。 そこで、遺伝子ごとの平均「発現」を計算する: @@ -387,6 +412,7 @@ rna %>% いったんデータがグループ化されると、同じ (必ずしも同じ変数でなくてもよい)時間に複数の変数を要約することもできる。 例えば、遺伝子別、条件別の「発現」の中央値を示す +列を追加することができる: 例えば、遺伝子別、条件別の「発現」の中央値を示す 列を追加することができる: ```{r, purl=TRUE} @@ -429,7 +455,7 @@ rna %>% count(infection) ``` -count()`関数は、すでに見たことのある、変数でグループ化し、そのグループ内のオブザベーションの数をカウントして要約する、ということの省略記法です。 言い換えれば、`rna %>% count(infection)\`は次のものと等価である: +count()`関数は、すでに見たことのある、変数でグループ化し、そのグループ内のオブザベーションの数をカウントして要約する、ということの省略記法です。 言い換えれば、`rna %>% count(infection)\`は次のものと等価である: 言い換えれば、`rna %>% count(infection)\`は次のものと等価である: ```{r, purl=TRUE} rna %>% @@ -437,6 +463,8 @@ rna %>% summarise(n = n()) ``` +先ほどの例では、`count()` を使って、_1つの_要因(つまり`感染`)について +、行数/観察数を数えている。 先ほどの例では、`count()` を使って、_1つの_要因(つまり`感染`)について 、行数/観察数を数えている。 もし、`感染`と`時間`のような_要因の組み合わせ_をカウントしたいのであれば、 @@ -486,7 +514,7 @@ rna %>% ## チャレンジ 1. 各サンプルで分析された遺伝子の数は? -2. group_by()`と `summarise()\`を使用して、各サンプルのシーケンス深度(全カウントの合計)を評価する。 シーケンス深度が最も深いサンプルはどれですか? +2. group_by()`と `summarise()\\`を使用して、各サンプルのシーケンス深度(全カウントの合計)を評価する。 シーケンス深度が最も深いサンプルはどれですか? シーケンス深度が最も深いサンプルはどれですか? 3. サンプルを1つ選び、バイオタイプ別に遺伝子数を評価する。 4. DNAメチル化異常」という表現型に関連する遺伝子を特定し、時間0、時間4、時間8における平均発現量(対数)を計算する。 @@ -522,8 +550,11 @@ rna %>% ## データの再構築 -rna`tibble の行には、`gene`と`sample\` という2つの変数の組み合わせに関連付けられた発現値(単位)が格納されている。 +rna`tibble の行には、`gene`と`sample\\` という2つの変数の組み合わせに関連付けられた発現値(単位)が格納されている。 +比較を容易にするために、結果を並べ替えると便利なことがある。 +arrange()\\`を使って表を並べ替えることができる。 +例えば、上の表を時間順に並べたいとする: または遺伝子(gene_biotype, ENTREZ_ID, product, ...)。 その他の列はすべて、 (生物、年齢、性別、...)のいずれかを記述する変数に対応している。 または遺伝子(gene_biotype, ENTREZ_ID, product, ...)。 遺伝子やサンプルによって変化しない変数は、すべての行で同じ値を持つ。 @@ -540,6 +571,8 @@ rna %>% `wide-format`がよりコンパクトにデータを表現する方法として好まれる。 これは通常、科学者が 、行が遺伝子、列がサンプルを表す行列として見るのに慣れている遺伝子発現値の場合である。 +これは通常、科学者が +、行が遺伝子、列がサンプルを表す行列として見るのに慣れている遺伝子発現値の場合である。 このフォーマットでは、 、サンプル内の遺伝子発現レベルとサンプル間の遺伝子発現レベル @@ -553,7 +586,7 @@ rna %>% ``` rna`の遺伝子発現値をワイドフォーマットに変換するには、 -、`sample\`カラムの値が +、`sample\\`カラムの値が 、カラム変数の名前になる新しいテーブルを作成する必要がある。 ここでの重要なポイントは、我々はまだ @@ -570,37 +603,39 @@ rna`の遺伝子発現値をワイドフォーマットに変換するには、 ### より広いフォーマットへのデータのピボット -rna`の最初の3列を選択し、`pivot_wider()\` +rna`の最初の3列を選択し、`pivot_wider()\\` を使ってデータをワイドフォーマットに変換してみよう。 ```{r, purl=TRUE} -rna_exp<- rna %>% +rna_exp <- rna %>% select(gene, sample, expression) rna_exp ``` -pivot_wider\`は主に3つの引数を取る: +pivot_wider\\`は主に3つの引数を取る: 1. 変換されるデータ; 2. the `names_from` : その値が新しいカラム の名前になるカラム; -3. value_from\`: 新しいカラム +3. value_from\\`: 新しいカラム を埋める値。 -```{r, fig.cap="Wide pivot of the `rna` data.", echo=FALSE, message=FALSE} +\`\`\`{r, fig.cap="Wide pivot of the `rna` data.", echo=FALSE, message=FALSE} knitr::include_graphics("fig/pivot_wider.png") -``` + +```` ```{r, purl=TRUE} rna_wide <- rna_exp %>% pivot_wider(names_from = sample, values_from = expression) rna_wide -``` +```` デフォルトでは、`pivot_wider()` 関数は欠損値に対して `NA` を追加することに注意してください。 何らかの理由で、あるサンプルで +、いくつかの遺伝子の発現値が欠落していたとしよう。 何らかの理由で、あるサンプルで 、いくつかの遺伝子の発現値が欠落していたとしよう。 以下の架空の例では、遺伝子Cyp2d22の発現値はGSM2545338サンプルの 。 @@ -616,7 +651,7 @@ rna_with_missing_values デフォルトでは、`pivot_wider()`関数は、 の値が見つからない場合に `NA` を追加する。 これは、 -`pivot_wider()` 関数の `values_fill` 引数でパラメータ化できる。 +`pivot_wider()` 関数の `values_fill` 引数でパラメータ化できる。 summarise()\\`関数 ```{r, purl=TRUE} rna_with_missing_values %>% @@ -634,14 +669,16 @@ rna_with_missing_values %>% 逆の状況では、カラム名を使い、 、新しい変数のペアに変えている。 一方の変数はカラム名を の値で表し、もう一方の変数にはカラム名に関連付けられている以前の値 +が格納されている。 一方の変数はカラム名を +の値で表し、もう一方の変数にはカラム名に関連付けられている以前の値 が格納されている。 -pivot_longer()\`は主に4つの引数を取る: +pivot_longer()\\`は主に4つの引数を取る: 1. 変換されるデータ; -2. names_to\`: +2. names_to\\`: の現在のカラム名で作成したい新しいカラム名; -3. value_to\`: 作成したい新しいカラム名で、 +3. value_to\\`: 作成したい新しいカラム名で、 の現在の値を格納する; 4. 変数 `names_to` と `values_to` に格納する(または削除する)列の名前。 @@ -669,9 +706,13 @@ rna_long また、 、どのようなカラムを含めるかという指定も使えたはずだ。 これは、 のカラムが多数あり、 +のままにしておくよりも、何を集めるかを指定する方が簡単な場合に便利である。 また、 +、どのようなカラムを含めるかという指定も使えたはずだ。 これは、 +のカラムが多数あり、 のままにしておくよりも、何を集めるかを指定する方が簡単な場合に便利である。 ここで、`starts_with()`関数を使えば、 のサンプル名をすべてリストアップすることなく取得することができる! もう一つの可能性は `:` 演算子を使うことである! +もう一つの可能性は `:` 演算子を使うことである! ```{r} rna_wide %>% @@ -710,6 +751,9 @@ wide_with_NA %>% ## 質問 +マウス遺伝子の中にはヒトにホモログがないものもある。 これらは、 +`filter()` と、 +何かが `NA` かどうかを判定する `is.na()` 関数を使って取得することができる。 rnaテーブルから始めて、`pivot_wider()`関数を使用して、 、各マウスの遺伝子発現レベルを示すワイドフォーマットのテーブルを作成する。 そして、`pivot_longer()`関数を使って、ロングフォーマットの表を復元する。 @@ -736,7 +780,7 @@ pivot_longer(names_to = "mouse_id", values_to = "counts", -gene) ## 質問 -rna`データフレームから X 染色体と Y 染色体に位置する遺伝子をサブセットし、`sex` を列、`chromosome_name\` を +rna`データフレームから X 染色体と Y 染色体に位置する遺伝子をサブセットし、`sex` を列、`chromosome_name\\` を 行、各染色体に位置する遺伝子の平均発現量を値として、 以下のようにデータフレームを広げる: @@ -791,7 +835,7 @@ rna_1 %>% ## 質問 -rna\`データセットを使って、 +rna\\`データセットを使って、 各行が遺伝子の平均発現量を表し、 各列が異なるタイムポイントを表す発現行列を作成する。 @@ -818,9 +862,10 @@ rna_time<- rna %>% rna_time ``` +これにより、数字で始まるカラム名を持つティブルが生成されることに注意。 これにより、数字で始まるカラム名を持つティブルが生成されることに注意。 タイムポイントに対応するカラムを選択したい場合、 -、カラム名を直接使うことはできない。 列4を選択するとどうなるか? +、カラム名を直接使うことはできない。 列4を選択するとどうなるか? 列4を選択するとどうなるか? ```{r} rna %>% @@ -831,7 +876,7 @@ rna %>% select(gene, 4) ``` -タイムポイント4を選択するには、"˶\`" というバックスティックを付けたカラム名を引用しなければならない。 +タイムポイント4を選択するには、"˶\\`" というバックスティックを付けたカラム名を引用しなければならない。 ```{r} rna %>% @@ -866,6 +911,7 @@ rna %>% 、タイムポイント8とタイムポイント0の間のfold-changes、およびタイムポイント8とタイムポイント4の間のfold-changes を含む新しい列を作成する。 この表を、計算されたフォールド・チェンジを集めたロングフォーマットの表に変換する。 +この表を、計算されたフォールド・チェンジを集めたロングフォーマットの表に変換する。 ::::::::::::::: solution @@ -903,18 +949,26 @@ rna_time %>% 実生活の多くの場面で、データは複数のテーブルにまたがっている。 通常このようなことが起こるのは、異なる情報源から異なるタイプの情報が 収集されるからである。 +通常このようなことが起こるのは、異なる情報源から異なるタイプの情報が +収集されるからである。 分析によっては、2つ以上のテーブル( )のデータを、すべてのテーブルに共通するカラム( )に基づいて1つのデータフレームにまとめることが望ましい場合がある。 -dplyr\` パッケージは、指定されたカラム内のマッチに基づいて、2つの +dplyr\\` パッケージは、指定されたカラム内のマッチに基づいて、2つの データフレームを結合するための結合関数のセットを提供する。 ここでは、 、結合について簡単に紹介する。 詳しくは、 テーブル ジョインの章を参照されたい。 データ変換チート シート +、テーブル結合に関する簡単な概要も提供している。 ここでは、 +、結合について簡単に紹介する。 詳しくは、 +テーブル +ジョインの章を参照されたい。 +データ変換チート +シート 、テーブル結合に関する簡単な概要も提供している。 、元の`rna`テーブルをサブセットして作成し、 @@ -928,6 +982,7 @@ rna_mini ``` 2番目のテーブル`annot1`には、遺伝子と +gene_descriptionの2つのカラムがある。 2番目のテーブル`annot1`には、遺伝子と gene_descriptionの2つのカラムがある。 [download annot1.csv](https://carpentries-incubator.github.io/bioc-intro/data/annot1.csv) リンクをクリックして`data/`フォルダに移動するか、 @@ -946,6 +1001,10 @@ annot1 に一致する共通変数を自動的に見つける。 この場合、`gene`は共通の 。 このような変数をキーと呼ぶ。 キーは、 オブザベーションを異なるテーブル間でマッチさせるために使用される。 +関数は、最初のテーブルと2番目のテーブルの列 +に一致する共通変数を自動的に見つける。 この場合、`gene`は共通の +。 このような変数をキーと呼ぶ。 キーは、 +オブザベーションを異なるテーブル間でマッチさせるために使用される。 ```{r} full_join(rna_mini, annot1) @@ -955,6 +1014,7 @@ full_join(rna_mini, annot1) annot2`テーブルは、遺伝子名を含む 変数のラベルが異なる以外は、`annot1`と全く同じである。 この場合も、 [download annot2.csv](https://carpentries-incubator.github.io/bioc-intro/data/annot2.csv) +、自分で`data/\`に移動するか、以下のRコードを使う。 この場合も、 [download annot2.csv](https://carpentries-incubator.github.io/bioc-intro/data/annot2.csv) 、自分で`data/\`に移動するか、以下のRコードを使う。 ```{r, message=FALSE} @@ -965,6 +1025,7 @@ annot2 ``` どの変数名も一致しない場合、マッチングに使用する +変数を手動で設定することができる。 どの変数名も一致しない場合、マッチングに使用する 変数を手動で設定することができる。 これらの変数は、`rna_mini` と `annot2` テーブルを使用して以下に示すように、 `by` 引数を使用して設定することができる。 @@ -981,6 +1042,8 @@ full_join(rna_mini, annot2, by = c("gene" = "external_gene_name")) [こちら](https://carpentries-incubator.github.io/bioc-intro/data/annot3.csv) をクリックして `annot3` テーブルをダウンロードし、そのテーブルをあなたの data/ リポジトリに置いてください。 full_join()`関数を使用して、テーブル`rna_mini`と`annot3\` を結合する。 +、遺伝子_Klk6_、_mt-Tf_、_mt-Rnr1_、_mt-Tv_、_mt-Rnr2_、_mt-Tl1_はどうなったのか? full_join()` +関数を使用して、テーブル `rna_mini` と `annot3` を結合する。 、遺伝子_Klk6_、_mt-Tf_、_mt-Rnr1_、_mt-Tv_、_mt-Rnr2_、_mt-Tl1_はどうなったのか? ::::::::::::::: solution @@ -994,6 +1057,7 @@ full_join(rna_mini, annot3) 遺伝子_Klk6_は`rna_mini`にのみ存在し、遺伝子_mt-Tf_、_mt-Rnr1_、_mt-Tv_、 _mt-Rnr2_、_mt-Tl1_は`annot3`テーブルにのみ存在する。 表の +変数のそれぞれの値は、欠損として符号化されている。 表の 変数のそれぞれの値は、欠損として符号化されている。 ::::::::::::::::::::::::: @@ -1002,7 +1066,7 @@ _mt-Rnr2_、_mt-Tl1_は`annot3`テーブルにのみ存在する。 表の ## データのエクスポート -dplyr\`を使って、 +dplyr\\`を使って、 から情報を抽出したり、生データを要約したりする方法を学んだので、これらの新しいデータセットをエクスポートして、 を共同研究者と共有したり、アーカイブしたりしたいと思うかもしれない。 @@ -1017,8 +1081,14 @@ write_csv()`を使う前に、生成されたデータセットを格納する 、削除したり変更したりしないように、そのままにしておく。 対照的に、このスクリプトは`data_output` ディレクトリの内容を生成するので、そこに含まれるファイルが削除されても、 再生成することができる。 +、生成されたデータセットを生データと同じディレクトリに書き込みたくない。 +別々にするのは良い習慣だ。 data`フォルダーには、 +、変更されていない生のデータだけを入れておく。 +、削除したり変更したりしないように、そのままにしておく。 対照的に、このスクリプトは`data_output\` +ディレクトリの内容を生成するので、そこに含まれるファイルが削除されても、 +再生成することができる。 -write_csv()\`を使用して、以前に作成したrna_wideテーブルを保存しよう。 +write_csv()\\`を使用して、以前に作成したrna_wideテーブルを保存しよう。 ```{r, purl=TRUE, eval=FALSE} write_csv(rna_wide, file = "data_output/rna_wide.csv") From 67ad366f2ec051b64afd76cd85b767689cd0fdaf Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Mon, 19 Aug 2024 14:17:41 +0900 Subject: [PATCH 296/334] New translations 23-starting-with-r.md (French) --- locale/fr/episodes/23-starting-with-r.Rmd | 106 +++++++++++----------- 1 file changed, 53 insertions(+), 53 deletions(-) diff --git a/locale/fr/episodes/23-starting-with-r.Rmd b/locale/fr/episodes/23-starting-with-r.Rmd index 2a88c0cd6..79b0bd6b3 100644 --- a/locale/fr/episodes/23-starting-with-r.Rmd +++ b/locale/fr/episodes/23-starting-with-r.Rmd @@ -134,10 +134,10 @@ weight_lb <- 2.2 * weight_kg puis remplacez « weight_kg » par 100. ```{r} -poids_kg <- 100 +weight_kg <- 100 ``` -::::::::::::::::::::::::::::::::::::::: défi +::::::::::::::::::::::::::::::::::::::: challenge ## Défi: @@ -159,7 +159,7 @@ you only want to comment out one line, you can put the cursor at any location of that line (i.e. no need to select the whole line), then press <kbd>Ctrl</kbd> + <kbd>Shift</kbd> + <kbd>C</kbd>. -::::::::::::::::::::::::::::::::::::::: défi +::::::::::::::::::::::::::::::::::::::: challenge ## Défi @@ -210,7 +210,7 @@ de votre choix qui sera utilisée à la place de la valeur par défaut. Essayons une fonction qui peut prendre plusieurs arguments : `round()`. ```{r, results="show", purl=TRUE} -rond(3.14159) +round(3.14159) ``` Ici, nous avons appelé `round()` avec un seul argument, `3.14159`, et il a @@ -220,25 +220,25 @@ des informations sur la fonction `round`. Nous pouvons utiliser `args(round)` o pour cette fonction en utilisant `?round`. ```{r, results="show", purl=TRUE} -arguments (rond) +args(round) ``` ```{r, eval=FALSE, purl=TRUE} -?rond +?round ``` Nous voyons que si nous voulons un nombre différent de chiffres, nous pouvons taper `digits=2` ou autant que nous le voulons. ```{r, results="show", purl=TRUE} -rond(3.14159, chiffres = 2) +round(3.14159, digits = 2) ``` Si vous fournissez les arguments exactement dans le même ordre que celui dans lequel ils sont définis, vous n'avez pas besoin de les nommer : ```{r, results="show", purl=TRUE} -rond(3.14159, 2) +round(3.14159, 2) ``` Et si vous nommez les arguments, vous pouvez changer leur ordre : @@ -264,15 +264,15 @@ la fonction `c()`. Par exemple, nous pouvons créer un vecteur de poids d'animau à un nouvel objet `weight_g` : ```{r, purl=TRUE} -poids_g <- c(50, 60, 65, 82) -poids_g +weight_g <- c(50, 60, 65, 82) +weight_g ``` Un vecteur peut également contenir des caractères : ```{r, purl=TRUE} -molécules <- c("adna", "rna", "protein") -molécules +molecules <- c("dna", "rna", "protein") +molecules ``` Les guillemets autour de « adn », « arn », etc. sont ici essentiels. Sans les guillemets @@ -284,8 +284,8 @@ Il existe de nombreuses fonctions qui vous permettent d'inspecter le contenu d'u . `length()` vous indique combien d'éléments se trouvent dans un vecteur particulier : ```{r, purl=TRUE} -longueur (poids_g) -longueur (molécules) +length(weight_g) +length(molecules) ``` Une caractéristique importante d'un vecteur est que tous les éléments sont du @@ -293,8 +293,8 @@ même type de données. La fonction `class()` indique la classe (le type d'él ) d'un objet : ```{r, purl=TRUE} -classe (poids_g) -classe (molécules) +class(weight_g) +class(molecules) ``` La fonction `str()` fournit un aperçu de la structure d'un objet @@ -302,16 +302,16 @@ et de ses éléments. C'est une fonction utile lorsque vous travaillez avec des objets volumineux et complexes : ```{r, purl=TRUE} -str(poids_g) -str(molécules) +str(weight_g) +str(molecules) ``` Vous pouvez utiliser la fonction `c()` pour ajouter d'autres éléments à votre vecteur : ```{r} -poids_g <- c(poids_g, 90) # ajouter à la fin du vecteur -poids_g <- c(30, poids_g) # ajouter au début du vecteur -poids_g +weight_g <- c(weight_g, 90) # add to the end of the vector +weight_g <- c(30, weight_g) # add to the beginning of the vector +weight_g ``` Dans la première ligne, nous prenons le vecteur d'origine `weight_g`, ajoutons la valeur @@ -343,7 +343,7 @@ Les vecteurs sont l'une des nombreuses **structures de données** utilisées par importants sont les listes (`list`), les matrices (`matrix`), les trames de données (`data.frame`), les facteurs (`factor`) et les tableaux (`array` ). -::::::::::::::::::::::::::::::::::::::: défi +::::::::::::::::::::::::::::::::::::::: challenge ## Défi: @@ -351,7 +351,7 @@ Nous avons vu que les vecteurs atomiques peuvent être de type caractère, numé double), entier et logique. Mais que se passe-t-il si nous essayons de mélanger ces types dans un seul vecteur ? -::::::::::::::: solution +::::::::::::::: solution ## Solution @@ -359,9 +359,9 @@ R les convertit implicitement pour qu'ils soient tous du même type ::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::: -::::::::::::::::::::::::::::::::::::::: défi +::::::::::::::::::::::::::::::::::::::: challenge ## Défi: @@ -375,7 +375,7 @@ char_logical <- c("a", "b", "c", TRUE) tricky <- c(1, 2, 3, "4") ``` -::::::::::::::: solution +::::::::::::::: solution ## Solution @@ -392,15 +392,15 @@ tricky ::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::: -::::::::::::::::::::::::::::::::::::::: défi +::::::::::::::::::::::::::::::::::::::: challenge ## Défi: Pourquoi pensez-vous que cela arrive ? -::::::::::::::: solution +::::::::::::::: solution ## Solution @@ -410,9 +410,9 @@ ne perd aucune information. ::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::: -::::::::::::::::::::::::::::::::::::::: défi +::::::::::::::::::::::::::::::::::::::: challenge ## Défi: @@ -435,14 +435,14 @@ dans `num_logical` est converti en `1` avant d'être converti en `"1"` dans `combined_logical`. ```{r} -combiné_logique +combined_logical ``` ::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::: -::::::::::::::::::::::::::::::::::::::: défi +::::::::::::::::::::::::::::::::::::::: challenge ## Défi: @@ -452,7 +452,7 @@ selon laquelle certains types sont préférentiellement contraints vers d'autres dessiner un diagramme qui représente la hiérarchie de la façon dont ces types de données sont forcés ? -::::::::::::::: solution +::::::::::::::: solution ## Solution @@ -460,7 +460,7 @@ logique → numérique → caractère ← logique ::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::: ```{r, echo=FALSE, eval=FALSE, purl=TRUE} ## We've seen that atomic vectors can be of type character, numeric, integer, and @@ -494,9 +494,9 @@ Si l'on veut extraire une ou plusieurs valeurs d'un vecteur, il faut fournir un ou plusieurs indices entre crochets. Par exemple: ```{r, results="show", purl=TRUE} -molécules <- c("ADN", "arn", "peptide", "protéine") -molécules[2] -molécules[c(3, 2)] +molecules <- c("dna", "rna", "peptide", "protein") +molecules[2] +molecules[c(3, 2)] ``` On peut également répéter les indices pour créer un objet avec plus d'éléments @@ -516,10 +516,10 @@ Enfin, il est également possible d'obtenir tous les éléments d'un vecteur sauf certains éléments spécifiés en utilisant des indices négatifs : ```{r} -molécules ## toutes les molécules -molécules[-1] ## toutes sauf la première -molécules[-c(1, 3)] ## toutes sauf les 1ère/3ème -molécules[c(-1, -3)] ## toutes sauf les 1ère/3ème +molecules ## all molecules +molecules[-1] ## all but the first one +molecules[-c(1, 3)] ## all but 1st/3rd ones +molecules[c(-1, -3)] ## all but 1st/3rd ones ``` ## Sous-ensemble conditionnel @@ -528,8 +528,8 @@ Une autre méthode courante de sous-ensemble consiste à utiliser un vecteur log sélectionnera l'élément avec le même index, tandis que `FALSE` ne le fera pas : ```{r, purl=TRUE} -poids_g <- c(21, 34, 39, 54, 55) -poids_g[c(VRAI, FAUX, VRAI, VRAI, FAUX)] +weight_g <- c(21, 34, 39, 54, 55) +weight_g[c(TRUE, FALSE, TRUE, TRUE, FALSE)] ``` Généralement, ces vecteurs logiques ne sont pas tapés à la main, mais sont la sortie @@ -548,8 +548,8 @@ Vous pouvez combiner plusieurs tests en utilisant `&` (les deux conditions sont AND) ou `|` (au moins une des conditions est vraie, OR) : ```{r, results="show", purl=TRUE} -poids_g[poids_g < 30 | poids_g > 50] -poids_g[poids_g >= 30 & poids_g == 21] +weight_g[weight_g < 30 | weight_g > 50] +weight_g[weight_g >= 30 & weight_g == 21] ``` Ici, `<` signifie "inférieur à", `>` pour "supérieur à", `>=` pour @@ -570,13 +570,13 @@ molecules %in% c("rna", "dna", "metabolite", "peptide", "glycerol") molecules[molecules %in% c("rna", "dna", "metabolite", "peptide", "glycerol")] ``` -::::::::::::::::::::::::::::::::::::::: défi +::::::::::::::::::::::::::::::::::::::: challenge ## Défi: Pouvez-vous comprendre pourquoi « quatre » > « cinq » renvoie « VRAI » ? -::::::::::::::: solution +::::::::::::::: solution ## Solution @@ -652,7 +652,7 @@ na.omit(heights) hauteurs[complete.cases(heights)] ``` -::::::::::::::::::::::::::::::::::::::: défi +::::::::::::::::::::::::::::::::::::::: challenge ## Défi: @@ -716,7 +716,7 @@ numérique(0) Il existe des constructeurs similaires pour les caractères et les logiques, nommés respectivement `character()` et `logical()`. -::::::::::::::::::::::::::::::::::::::: défi +::::::::::::::::::::::::::::::::::::::: challenge ## Défi: @@ -761,7 +761,7 @@ de longueur 1) et de n'importe quel type. Par exemple, si nous voulons répéter rep(c(1, 2, 3), 5) ``` -::::::::::::::::::::::::::::::::::::::: défi +::::::::::::::::::::::::::::::::::::::: challenge ## Défi: @@ -838,7 +838,7 @@ sur `TRUE` : échantillon (1:5, 10, remplacer = VRAI) ``` -::::::::::::::::::::::::::::::::::::::: défi +::::::::::::::::::::::::::::::::::::::: challenge ## Défi: From b7193aaf0eba096df9ba1c99e4555df5176868c0 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Mon, 19 Aug 2024 15:14:50 +0900 Subject: [PATCH 297/334] New translations 23-starting-with-r.md (French) --- locale/fr/episodes/23-starting-with-r.Rmd | 88 +++++++++++------------ 1 file changed, 44 insertions(+), 44 deletions(-) diff --git a/locale/fr/episodes/23-starting-with-r.Rmd b/locale/fr/episodes/23-starting-with-r.Rmd index 79b0bd6b3..e352236e0 100644 --- a/locale/fr/episodes/23-starting-with-r.Rmd +++ b/locale/fr/episodes/23-starting-with-r.Rmd @@ -581,7 +581,7 @@ Pouvez-vous comprendre pourquoi « quatre » > « cinq » renvoie « VRAI » ? ## Solution ```{r} -"quatre" > "cinq" +"four" > "five" ``` Lorsque vous utilisez `>` ou `<` sur des chaînes, R compare leur ordre alphabétique. @@ -590,7 +590,7 @@ Ici, `"quatre"` vient après `"cinq"`, et est donc _supérieur à_ ::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::: ## Des noms @@ -600,9 +600,9 @@ récupérés. ```{r} x <- c(1, 5, 3, 5, 10) -noms(x) ## pas de noms -noms(x) <- c("A", "B", " C", "D", "E") -noms(x) ## maintenant nous avons des noms +names(x) ## no names +names(x) <- c("A", "B", "C", "D", "E") +names(x) ## now we have names ``` Lorsqu'un vecteur possède des noms, il est possible d'accéder aux éléments par leur nom @@ -638,18 +638,18 @@ avec les fonctions `is.na()`, `na.omit()` et `complete.cases()`. Voir ci-dessous pour des exemples. ```{r} -## Extrayez les éléments pour lesquels il ne manque pas de valeurs. +## Extract those elements which are not missing values. heights[!is.na(heights)] -## Renvoie l'objet avec les cas incomplets supprimés. -## L'objet retourné est un vecteur atomique de type `"numeric"` -## (ou `"double"`). +## Returns the object with incomplete cases removed. +## The returned object is an atomic vector of type `"numeric"` +## (or `"double"`). na.omit(heights) -## Extrayez les éléments qui sont des cas complets. -## L'objet retourné est un vecteur atomique de type `"numeric"` -## (ou `"double"`). -hauteurs[complete.cases(heights)] +## Extract those elements which are complete cases. +## The returned object is an atomic vector of type `"numeric"` +## (or `"double"`). +heights[complete.cases(heights)] ``` ::::::::::::::::::::::::::::::::::::::: challenge @@ -665,28 +665,28 @@ heights <- c(63, 69, 60, 65, NA, 68, 61, 70, 61, 59, 64, 69, 63, 63, NA, 72, 65, 2. Utilisez la fonction `median()` pour calculer la médiane du vecteur `heights`. 3. Utilisez R pour déterminer combien de personnes dans l’ensemble mesurent plus de 67 pouces. -::::::::::::::: solution +::::::::::::::: solution ## Solution ```{r, purl=TRUE} heights_no_na <- heights[!is.na(heights)] -## ou +## or heights_no_na <- na.omit(heights) ``` ```{r, purl=TRUE} -médiane (hauteurs, na.rm = TRUE) +median(heights, na.rm = TRUE) ``` ```{r, purl=TRUE} -hauteurs_above_67 <- heights_no_na[heights_no_na > 67] -longueur(heights_above_67) +heights_above_67 <- heights_no_na[heights_no_na > 67] +length(heights_above_67) ``` ::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::: ## Génération de vecteurs {#sec:genvec} @@ -702,15 +702,15 @@ générer un vecteur de valeurs numériques, on peut utiliser le constructeur `n . Les valeurs seront initialisées à 0. ```{r, purl=TRUE} -numérique(3) -numérique(10) +numeric(3) +numeric(10) ``` Notez que si l'on demande un vecteur de numériques de longueur 0, on obtient exactement cela : ```{r, purl=TRUE} -numérique(0) +numeric(0) ``` Il existe des constructeurs similaires pour les caractères et les logiques, nommés respectivement @@ -722,18 +722,18 @@ Il existe des constructeurs similaires pour les caractères et les logiques, nom Quelles sont les valeurs par défaut pour les caractères et les vecteurs logiques ? -::::::::::::::: solution +::::::::::::::: solution ## Solution ```{r, purl=TRUE} -caractère(2) ## le caractère vide -logique(2) ## FALSE +character(2) ## the empty character +logical(2) ## FALSE ``` ::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::: ### Répliquer des éléments @@ -742,7 +742,7 @@ fois. Si nous voulons initier un vecteur de numériques de longueur 5 avec la valeur -1, par exemple, nous pourrions faire ce qui suit : ```{r, purl=TRUE} -représentant(-1, 5) +rep(-1, 5) ``` De même, pour générer un vecteur rempli de valeurs manquantes, ce qui @@ -750,7 +750,7 @@ est souvent une bonne façon de commencer, sans poser d'hypothèses sur les donn à collecter : ```{r, purl=TRUE} -représentant(NA, 5) +rep(NA, 5) ``` `rep` peut prendre en entrée des vecteurs de n'importe quelle longueur (ci-dessus, nous avons utilisé des vecteurs @@ -769,7 +769,7 @@ Et si nous voulions répéter les valeurs 1, 2 et 3 cinq fois, mais que obtenait cinq 1, cinq 2 et cinq 3 dans cet ordre ? Il existe deux possibilités - voir `?rep` ou `?sort` pour obtenir de l'aide. -::::::::::::::: solution +::::::::::::::: solution ## Solution @@ -780,7 +780,7 @@ sort(rep(c(1, 2, 3), 5)) ::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::: ### Génération de séquence @@ -789,7 +789,7 @@ nombres. Par exemple, pour générer une séquence d'entiers de 1 à 20 par pas de 2, on utiliserait : ```{r, purl=TRUE} -séq(de = 1, à = 20, par = 2) +seq(from = 1, to = 20, by = 2) ``` La valeur par défaut de `by` est 1 et, étant donné que la génération d'une séquence @@ -798,7 +798,7 @@ il existe un raccourci : ```{r, purl=TRUE} seq(1, 5, 1) -seq(1, 5) ## par défaut par +seq(1, 5) ## default by 1:5 ``` @@ -806,7 +806,7 @@ Pour générer une séquence de nombres de 1 à 20 de longueur finale de 3, on utiliserait : ```{r, purl=TRUE} -seq (de = 1, à = 20, longueur.out = 3) +seq(from = 1, to = 20, length.out = 3) ``` ### Échantillons aléatoires et permutations @@ -818,7 +818,7 @@ un autre vecteur. Par exemple, pour tirer au sort un ordre aléatoire de 10 étu en fonction de l'ordre alphabétique de son nom) puis : ```{r, purl=TRUE} -échantillon (1:10) +sample(1:10) ``` Sans autres arguments, `sample` renverra une permutation de tous les @@ -827,7 +827,7 @@ définirais cette valeur comme deuxième argument. Ci-dessous, j'échantillonne lettres aléatoires de l'alphabet contenu dans le vecteur `letters` prédéfini : ```{r, purl=TRUE} -échantillon(lettres, 5) +sample(letters, 5) ``` Si je voulais une sortie plus grande que le vecteur d'entrée, ou pouvoir @@ -835,7 +835,7 @@ dessiner certains éléments plusieurs fois, je devrais définir l'argument `rep sur `TRUE` : ```{r, purl=TRUE} -échantillon (1:5, 10, remplacer = VRAI) +sample(1:5, 10, replace = TRUE) ``` ::::::::::::::::::::::::::::::::::::::: challenge @@ -857,33 +857,33 @@ Définissez maintenant la graine avec, par exemple, `set.seed(123)` et répétez Répétez en définissant une graine différente. -::::::::::::::: solution +::::::::::::::: solution ## Solution Différentes permutations ```{r, purl=TRUE} -échantillon (1:10) -échantillon (1:10) +sample(1:10) +sample(1:10) ``` Mêmes permutations avec la graine 123 ```{r, purl=TRUE} set.seed(123) -échantillon(1:10) +sample(1:10) set.seed(123) -échantillon(1:10) +sample(1:10) ``` Une graine différente ```{r, purl=TRUE} set.seed(1) -échantillon(1:10) +sample(1:10) set.seed(1) -échantillon(1:10) +sample(1:10) ``` ::::::::::::::::::::::::: @@ -900,7 +900,7 @@ _N(100, 5)_, sont présentées ci-dessous. ```{r, echo=FALSE, fig.width=12, fig.height=6, fig.cap="Two normal distributions: *N(0, 1)* on the left and *N(100, 5)* on the right."} par(mfrow = c(1, 2)) plot(density(rnorm(1000)), main = "", sub = "N(0, 1)") -plot(densité (rnorm(1000, 100, 5)), principal = "", sous = "N(100, 5)") +plot(density(rnorm(1000, 100, 5)), main = "", sub = "N(100, 5)") ``` Les trois arguments, `n`, `mean` et `sd`, définissent la taille de l'échantillon From 416c94758ff5128dfb6aca251d71bea9a95ddf10 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Mon, 19 Aug 2024 16:15:41 +0900 Subject: [PATCH 298/334] New translations 10-data-organisation.md (French) --- locale/fr/episodes/10-data-organisation.Rmd | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/locale/fr/episodes/10-data-organisation.Rmd b/locale/fr/episodes/10-data-organisation.Rmd index 0ba2fbaf4..89936f4f6 100644 --- a/locale/fr/episodes/10-data-organisation.Rmd +++ b/locale/fr/episodes/10-data-organisation.Rmd @@ -768,11 +768,11 @@ ci-dessus en mettant les valeurs entre guillemets (""). En appliquant cette règ pourraient ressembler à ceci : ``` -id_espèce, genre, espèce, taxons -"AB", "Amphispiza", "bilineata", "Oiseau" -"AH", "Ammospermophilus", "harrisi", "Rongeur, non recensé" -"AS", "Ammodramus", "savannarum", "Oiseau" -"BA", "Baiomys", "taylori", "Rongeur" +species_id,genus,species,taxa +"AB","Amphispiza","bilineata","Bird" +"AH","Ammospermophilus","harrisi","Rodent, not censused" +"AS","Ammodramus","savannarum","Bird" +"BA","Baiomys","taylori","Rodent" ``` Désormais, l'ouverture de ce fichier en tant que « csv » dans Excel n'entraînera pas une colonne @@ -829,4 +829,4 @@ without having to look at it and/or fix it. - Une bonne organisation des données est la base de tout projet de recherche. -:::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::: From 81265cc38e8b1f6f426bd3fe99badef511d0bd42 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Mon, 19 Aug 2024 17:17:45 +0900 Subject: [PATCH 299/334] New translations 30-dplyr.md (French) --- locale/fr/episodes/30-dplyr.Rmd | 108 ++++++++++++++++---------------- 1 file changed, 54 insertions(+), 54 deletions(-) diff --git a/locale/fr/episodes/30-dplyr.Rmd b/locale/fr/episodes/30-dplyr.Rmd index f3695c247..737ccf904 100644 --- a/locale/fr/episodes/30-dplyr.Rmd +++ b/locale/fr/episodes/30-dplyr.Rmd @@ -17,17 +17,17 @@ exercises: 75 comment remodeler un bloc de données d'un format à l'autre. - Montrez comment joindre des tables. -:::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::: :::::::::::::::::::::::::::::::::::::::: questions - Analyse de données dans R à l'aide du méta-paquet Tidyverse -:::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::: ```{r loaddata_dplyr, echo=FALSE, purl=FALSE, message=FALSE} if (!file.exists("data/rnaseq.csv")) -download.file(url = "https://github.com/carpentries-incubator/bioc-intro/raw/main/episodes/ data/rnaseq.csv", +download.file(url = "https://github.com/carpentries-incubator/bioc-intro/raw/main/episodes/data/rnaseq.csv", destfile = "data/rnaseq.csv") ``` @@ -73,8 +73,8 @@ Si vous avez effectué la configuration, vous devriez déjà avoir installé le Vérifiez si vous l'avez en essayant de le charger depuis la bibliothèque : ```{r, message=FALSE, purl=TRUE} -## chargez les packages Tidyverse, incl. dplyr -bibliothèque("tidyverse") +## load the tidyverse packages, incl. dplyr +library("tidyverse") ``` Si vous recevez un message d'erreur `il n'y a pas de package appelé 'tidyverse'` alors vous n'avez pas @@ -95,7 +95,7 @@ function (notice the `_` instead of the `.`), from the tidyverse package ```{r, message=FALSE, purl=TRUE} rna <- read_csv("data/rnaseq.csv") -## afficher les données +## view the data rna ``` @@ -128,14 +128,14 @@ de cette fonction est la trame de données (`rna`), et les arguments suivants sont les colonnes à conserver. ```{r, purl=TRUE} -sélectionner (ARN, gène, échantillon, tissu, expression) +select(rna, gene, sample, tissue, expression) ``` Pour sélectionner toutes les colonnes _sauf_ certaines, mettez un "-" devant la variable pour l'exclure. ```{r, purl=TRUE} -sélectionner (arn, -tissu, -organisme) +select(rna, -tissue, -organism) ``` Cela sélectionnera toutes les variables de `rna` sauf `tissu` @@ -144,7 +144,7 @@ et `organism`. Pour choisir des lignes en fonction d'un critère spécifique, utilisez `filter()` : ```{r, purl=TRUE} -filter(arn, sex == "Male") +filter(rna, sex == "Male") filter(rna, sex == "Male" & infection == "NonInfected") ``` @@ -156,8 +156,8 @@ allons créer un nouveau tableau contenant uniquement les 2 colonnes `gene` et `hsapiens_homolog_associated_gene_name`. ```{r} -gènes <- select(arn, gene, hsapiens_homolog_associated_gene_name) -gènes +genes <- select(rna, gene, hsapiens_homolog_associated_gene_name) +genes ``` Certains gènes de souris n'ont pas d'homologues humains. Ceux-ci peuvent être récupérés en utilisant @@ -165,7 +165,7 @@ Certains gènes de souris n'ont pas d'homologues humains. Ceux-ci peuvent être quelque chose est un `NA`. ```{r, purl=TRUE} -filtre (gènes, is.na (hsapiens_homolog_associated_gene_name)) +filter(genes, is.na(hsapiens_homolog_associated_gene_name)) ``` Si on veut conserver uniquement les gènes de souris qui ont un homologue humain, on peut @@ -174,7 +174,7 @@ chaque ligne où hsapiens\_homolog\_associated\_gene\_name _n'est pas_ un `NA`. ```{r, purl=TRUE} -filtre(gènes, !is.na(hsapiens_homolog_associated_gene_name)) +filter(genes, !is.na(hsapiens_homolog_associated_gene_name)) ``` ## Tuyaux @@ -187,7 +187,7 @@ comme entrée de la fonction suivante, comme ceci : ```{r, purl=TRUE} rna2 <- filter(rna, sex == "Male") -rna3 <- select(rna2, gène, échantillon, tissu, expression) +rna3 <- select(rna2, gene, sample, tissue, expression) rna3 ``` @@ -199,7 +199,7 @@ Vous pouvez également imbriquer des fonctions (c'est-à-dire une fonction dans comme ceci : ```{r, purl=TRUE} -rna3 <- select(filter(rna, sex == "Male"), gène, échantillon, tissu, expression) +rna3 <- select(filter(rna, sex == "Male"), gene, sample, tissue, expression) rna3 ``` @@ -227,9 +227,9 @@ inclure explicitement le bloc de données comme un argument pour les fonctions ` `select()`. ```{r, purl=TRUE} -arn %>% +rna %>% filter(sex == "Male") %>% - select(gène, échantillon, tissu, expression) + select(gene, sample, tissue, expression) ``` Certains trouveront peut-être utile de lire le tube comme le mot « alors ». Par exemple, @@ -247,12 +247,12 @@ lui attribuer un nouveau nom : ```{r, purl=TRUE} rna3 <- rna %>% filter(sex == "Male") %>% - select(gène, échantillon, tissu, expression) + select(gene, sample, tissue, expression) rna3 ``` -::::::::::::::::::::::::::::::::::::::: défi +::::::::::::::::::::::::::::::::::::::: challenge ## Défi: @@ -260,21 +260,21 @@ rna3 où le gène a une expression supérieure à 50 000, et ne conservez que les colonnes `gene`, `sample `, `time`, `expression` et `age`. -::::::::::::::: solution +::::::::::::::: solution ## Solution ```{r} -arn %>% - filtre(expression > 50000, - sexe == "Femme", - temps == 0 ) %>% - select(gène, échantillon , heure, expression, âge) +rna %>% + filter(expression > 50000, + sex == "Female", + time == 0 ) %>% + select(gene, sample, time, expression, age) ``` ::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::: ## Subir une mutation @@ -286,7 +286,7 @@ Pour créer une nouvelle colonne de temps en heures : ```{r, purl=TRUE} rna %>% - muter(time_hours = time * 24) %>% + mutate(time_hours = time * 24) %>% select(time, time_hours) ``` @@ -294,12 +294,12 @@ Vous pouvez également créer une deuxième nouvelle colonne basée sur la premi ```{r, purl=TRUE} rna %>% - muter(time_hours = time * 24, + mutate(time_hours = time * 24, time_mn = time_hours * 60) %>% select(time, time_hours, time_mn) ``` -::::::::::::::::::::::::::::::::::::::: défi +::::::::::::::::::::::::::::::::::::::: challenge ## Défi @@ -313,7 +313,7 @@ contenir uniquement des gènes situés sur les chromosomes sexuels, associés à **Astuce** : réfléchissez à la façon dont les commandes doivent être ordonnées pour produire ce bloc de données ! -::::::::::::::: solution +::::::::::::::: solution ## Solution @@ -328,7 +328,7 @@ rna %>% ::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::: ## Analyse de données fractionnée-appliquée-combinée @@ -338,8 +338,8 @@ _split-apply-combine_ : divisez les données en groupes, appliquez une analyse rend cela très facile grâce à l'utilisation de la fonction `group_by()`. ```{r} -arn %>% - group_by(gène) +rna %>% + group_by(gene) ``` La fonction `group_by()` n'effectue aucun traitement de données, elle @@ -350,8 +350,8 @@ regroupe les données en sous-ensembles : dans l'exemple ci-dessus, notre On pourrait de même décider de regrouper les tibbles par échantillons : ```{r} -arn %>% - group_by(échantillon) +rna %>% + group_by(sample) ``` Ici, notre `tibble` initial d'observations `r nrow(rna)` est divisé en groupes @@ -370,9 +370,9 @@ réduit chaque groupe en un résumé sur une seule ligne de ce groupe. . Donc, pour calculer l'expression moyenne par gène : ```{r} -arn %>% +rna %>% group_by(gene) %>% - résumé(mean_expression = moyenne(expression)) + summarise(mean_expression = mean(expression)) ``` Nous pourrions également vouloir calculer les niveaux d’expression moyens de tous les gènes dans chaque échantillon : @@ -380,15 +380,15 @@ Nous pourrions également vouloir calculer les niveaux d’expression moyens de ```{r} rna %>% group_by(sample) %>% - résumé(mean_expression = moyenne(expression)) + summarise(mean_expression = mean(expression)) ``` Mais on peut aussi regrouper par plusieurs colonnes : ```{r} -arn %>% - group_by(gène, infection, temps) %>% - résumé(mean_expression = moyenne(expression)) +rna %>% + group_by(gene, infection, time) %>% + summarise(mean_expression = mean(expression)) ``` Une fois les données regroupées, vous pouvez également résumer plusieurs variables en même temps @@ -396,19 +396,19 @@ Une fois les données regroupées, vous pouvez également résumer plusieurs var indiquant l'expression médiane par gène et par condition : ```{r, purl=TRUE} -arn %>% - group_by(gène, infection, temps) %>% - résumé(mean_expression = moyenne(expression), - médiane_expression = médiane(expression)) +rna %>% + group_by(gene, infection, time) %>% + summarise(mean_expression = mean(expression), + median_expression = median(expression)) ``` -::::::::::::::::::::::::::::::::::::::: défi +::::::::::::::::::::::::::::::::::::::: challenge ## Défi Calculer le niveau d’expression moyen du gène « Dok3 » par points temporels. -::::::::::::::: solution +::::::::::::::: solution ## Solution @@ -416,7 +416,7 @@ Calculer le niveau d’expression moyen du gène « Dok3 » par points temporels rna %>% filter(gene == "Dok3") %>% group_by(time) %>% - summarise(mean = moyenne(expression)) + summarise(mean = mean(expression)) ``` ::::::::::::::::::::::::: @@ -487,7 +487,7 @@ arn %>% arrange(desc(n)) ``` -::::::::::::::::::::::::::::::::::::::: défi +::::::::::::::::::::::::::::::::::::::: challenge ## Défi @@ -715,7 +715,7 @@ wide_with_NA %>% Passer à des formats plus larges et plus longs peut être un moyen utile d'équilibrer un ensemble de données afin que chaque réplique ait la même composition. -::::::::::::::::::::::::::::::::::::::: défi +::::::::::::::::::::::::::::::::::::::: challenge ## Question @@ -741,7 +741,7 @@ pivot_longer(names_to = "mouse_id", valeurs_to = "counts", -gene) :::::::::::::::::::::::::::::::::::::::::::::::: -::::::::::::::::::::::::::::::::::::::: défi +::::::::::::::::::::::::::::::::::::::: challenge ## Question @@ -798,7 +798,7 @@ rna_1 %>% :::::::::::::::::::::::::::::::::::::::::::::::: -::::::::::::::::::::::::::::::::::::::: défi +::::::::::::::::::::::::::::::::::::::: challenge ## Question @@ -870,7 +870,7 @@ rna %>% :::::::::::::::::::::::::::::::::::::::::::::::: -::::::::::::::::::::::::::::::::::::::: défi +::::::::::::::::::::::::::::::::::::::: challenge ## Question @@ -989,7 +989,7 @@ full_join(rna_mini, annot2, by = c("gene" = "external_gene_name")) Comme on peut le voir ci-dessus, le nom de variable de la première table est conservé dans celle jointe. -::::::::::::::::::::::::::::::::::::::: défi +::::::::::::::::::::::::::::::::::::::: challenge ## Défi: From 564a92ae9df1da0735219d066981d95ac98ccc8c Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Mon, 19 Aug 2024 18:33:23 +0900 Subject: [PATCH 300/334] New translations 30-dplyr.md (French) --- locale/fr/episodes/30-dplyr.Rmd | 44 ++++++++++++++++----------------- 1 file changed, 22 insertions(+), 22 deletions(-) diff --git a/locale/fr/episodes/30-dplyr.Rmd b/locale/fr/episodes/30-dplyr.Rmd index 737ccf904..b1a0d6424 100644 --- a/locale/fr/episodes/30-dplyr.Rmd +++ b/locale/fr/episodes/30-dplyr.Rmd @@ -421,7 +421,7 @@ rna %>% ::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::: ### Compte @@ -431,16 +431,16 @@ pour chaque facteur ou combinaison de facteurs. Pour cette tâche, **`dplyr`** f chaque échantillon infecté et non infecté, nous ferions : ```{r, purl=TRUE} -arn %>% - nombre (infection) +rna %>% + count(infection) ``` La fonction `count()` est un raccourci pour quelque chose que nous avons déjà vu : regrouper par une variable et le résumer en comptant le nombre d'observations dans ce groupe. En d'autres termes, « rna %>% count(infection) » équivaut à : ```{r, purl=TRUE} -arn %>% +rna %>% group_by(infection) %>% - résumé(n = n()) + summarise(n = n()) ``` L'exemple précédent montre l'utilisation de `count()` pour compter le nombre de lignes/observations @@ -449,8 +449,8 @@ Si nous voulions compter une _combinaison de facteurs_, telle que `infection` et nous spécifierions le premier et le deuxième facteur comme arguments de `count()` : ```{r, purl=TRUE} -arn %>% - nombre (infection, temps) +rna %>% + count(infection, time) ``` ce qui équivaut à ceci : @@ -458,7 +458,7 @@ ce qui équivaut à ceci : ```{r, purl=TRUE} rna %>% group_by(infection, time) %>% - résumé(n = n()) + summarise(n = n()) ``` Il est parfois utile de trier le résultat pour faciliter les comparaisons. @@ -466,23 +466,23 @@ Nous pouvons utiliser `arrange()` pour trier le tableau. Par exemple, nous pourrions vouloir organiser le tableau ci-dessus par heure : ```{r, purl=TRUE} -arn %>% - compter (infection, temps) %>% - organiser (temps) +rna %>% + count(infection, time) %>% + arrange(time) ``` ou par comptages : ```{r, purl=TRUE} -arn %>% +rna %>% count(infection, time) %>% - arranger(n) + arrange(n) ``` Pour trier par ordre décroissant, nous devons ajouter la fonction `desc()` : ```{r, purl=TRUE} -arn %>% +rna %>% count(infection, time) %>% arrange(desc(n)) ``` @@ -496,7 +496,7 @@ arn %>% 3. Choisissez un échantillon et évaluez le nombre de gènes par biotype. 4. Identifiez les gènes associés à la description du phénotype « méthylation anormale de l’ADN » et calculez leur expression moyenne (en log) au temps 0, au temps 4 et au temps 8. -::::::::::::::: solution +::::::::::::::: solution ## Solution @@ -524,7 +524,7 @@ rna %>% ::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::: ## Remodeler les données @@ -536,8 +536,8 @@ l'échantillon (organisme, âge, sexe, ...) soit le gène (gène\_biotype, ENTRE Les variables qui ne changent pas avec les gènes ou avec les échantillons auront la même valeur dans toutes les lignes. ```{r} -arn %>% - arranger(gène) +rna %>% + arrange(gene) ``` Cette structure est appelée « format long », car une colonne contient toutes les valeurs, @@ -554,9 +554,9 @@ entre les échantillons. ```{r, echo=FALSE} rna %>% - select(gène, échantillon, expression) %>% - pivot_wider(names_from = échantillon, - valeurs_from = expression) + select(gene, sample, expression) %>% + pivot_wider(names_from = sample, + values_from = expression) ``` Pour convertir les valeurs d'expression génique de `rna` en un format large, @@ -583,7 +583,7 @@ pour transformer les données en grand format. ```{r, purl=TRUE} rna_exp <- rna %>% - select(gène, échantillon, expression) + select(gene, sample, expression) rna_exp ``` From 15159cf7043b7a04161015b2c729ba2afed03f1d Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Mon, 19 Aug 2024 22:07:11 +0900 Subject: [PATCH 301/334] New translations setup.md (French) --- locale/fr/learners/setup.md | 33 ++++++++++++++------------------- 1 file changed, 14 insertions(+), 19 deletions(-) diff --git a/locale/fr/learners/setup.md b/locale/fr/learners/setup.md index e21e15941..539c39dde 100644 --- a/locale/fr/learners/setup.md +++ b/locale/fr/learners/setup.md @@ -8,14 +8,10 @@ title: Setup ### R et RStudio -- R and RStudio are separate downloads and installations. R is the - underlying statistical computing environment, but using R alone is - no fun. RStudio is a graphical integrated development environment - (IDE) that makes using R much easier and more interactive. You need - to install R before you install RStudio. After installing both - programs, you will need to install some specific R packages within - RStudio. Follow the instructions below for your operating system, - and then follow the instructions to install packages. +- R et RStudio sont des programmes a télécharger separemment et demandent des installations distincts. R est l'environnement de calcul statistique sous-jacent, mais utiliser R seul peut être pénible. RStudio est un environnement de développement graphique intégré + (IDE) qui rend l'utilisation de R beaucoup plus simple et plus interactive. Vous avez besoin d' installer R avant d'installer RStudio. Après avoir installé les deux programmes, vous devrez installer des paquets R spécifiques depuis + RStudio. Suivez les instructions ci-dessous pour votre système d'exploitation, + puis suivez les instructions pour installer des paquets. ### You are running Windows @@ -23,19 +19,18 @@ title: Setup ::::::::::::::: solution -## If you already have R and RStudio installed +## Si vous avez déjà installé R et RStudio -- Open RStudio, and click on "Help" > "Check for updates". If a new version is - available, quit RStudio, and download the latest version for RStudio. +- Ouvrez RStudio et cliquez sur « Aide » > « Rechercher les mises à jour ». Si une nouvelle version est + disponible, quittez RStudio et téléchargez la dernière version de RStudio. -- To check which version of R you are using, start RStudio and the first thing - that appears in the console indicates the version of R you are - running. Alternatively, you can type `sessionInfo()`, which will also display - which version of R you are running. Go on - the [CRAN website](https://cran.r-project.org/bin/windows/base/) and check - whether a more recent version is available. If so, please download and install - it. You can [check here](https://cran.r-project.org/bin/windows/base/rw-FAQ.html#How-do-I-UNinstall-R_003f) for - more information on how to remove old versions from your system if you wish to do so. +- Pour vérifier quelle version de R vous utilisez, démarrez RStudio et la première chose + qui apparaît dans la console indique la version de R que vous + exécutez. Alternativement, vous pouvez taper `sessionInfo()`, qui affichera également + quelle version de R est installée. Allez sur + le [site Web du CRAN](https://cran.r-project.org/bin/windows/base/) et vérifiez + si une version plus récente est disponible. Si c'est le cas, veuillez le télécharger et l'installer. Vous pouvez [consulter ici](https://cran.r-project.org/bin/windows/base/rw-FAQ.html#How-do-I-UNinstall-R_003f) pour + plus d'informations sur la façon de supprimer les anciennes versions de votre système si vous souhaitez le faire. - Follow the steps in the instructions [for everyone](#for-everyone) at the bottom of this page. From 170c2b00eaee991ce8ec4c7764d3f5ab09ead647 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Mon, 19 Aug 2024 23:09:18 +0900 Subject: [PATCH 302/334] New translations 10-data-organisation.md (French) --- locale/fr/episodes/10-data-organisation.Rmd | 16 ++++++---------- 1 file changed, 6 insertions(+), 10 deletions(-) diff --git a/locale/fr/episodes/10-data-organisation.Rmd b/locale/fr/episodes/10-data-organisation.Rmd index 89936f4f6..6500c51f6 100644 --- a/locale/fr/episodes/10-data-organisation.Rmd +++ b/locale/fr/episodes/10-data-organisation.Rmd @@ -118,11 +118,10 @@ utiliser des tableurs pour bien plus que la saisie de données. Nous les utiliso pour créer des tableaux de données pour les publications, pour générer des statistiques récapitulatives et réaliser des chiffres. -Generating tables for publications in a spreadsheet is not -optimal - often, when formatting a data table for publication, we're -reporting key summary statistics in a way that is not really meant to -be read as data, and often involves special formatting -(merging cells, creating borders, making it pretty). Nous vous conseillons de +Générer des tableaux pour des publications dans une feuille de calcul n'est pas +optimal - souvent, lors du formatage d'un tableau de données, nous +rapportons les principales statistiques récapitulatives d'une manière qui n'est pas vraiment destinée à être lue comme des données, et implique souvent un formatage spécial +(fusion de cellules, création de bordures, préférences esthétiques pour les couleurs, etc.). Nous vous conseillons de effectuer ce genre d'opération au sein de votre logiciel d'édition de documents. Ces deux dernières applications, génératrices de statistiques et de chiffres, doivent @@ -414,7 +413,7 @@ Mais qu’en est-il des onglets du classeur ? Cela semble être un moyen simple , n'est-ce pas ? Eh bien, oui et non. Lorsque vous créez des onglets supplémentaires, vous ne parvenez pas à permettre à l'ordinateur de voir les connexions dans les données qui s'y trouvent (vous devez introduire des fonctions spécifiques à l'application de feuille de calcul ou -des scripts pour garantir cette connexion). Supposons, par exemple, que vous créiez un +des scripts pour garantir cette connexion). Supposons, par exemple, que vous aillez créé un onglet séparé pour chaque jour où vous prenez une mesure. Ce n'est pas une bonne pratique pour deux raisons : @@ -423,10 +422,7 @@ Ce n'est pas une bonne pratique pour deux raisons : si à chaque fois que vous prenez une mesure, vous commencez à enregistrer les données dans un nouvel onglet, et -2. even if you manage to prevent all inconsistencies from creeping in, - you will add an extra step for yourself before you analyse the data - because you will have to combine these data into a single - datatable. Vous devrez indiquer explicitement à l'ordinateur comment +2. même si vous parvenez à empêcher toute incohérence, vous vous ajoutez une étape supplémentaire avant même d'analyser les données car vous devrez combiner ces données en une seule table de données. Vous devrez indiquer explicitement à l'ordinateur comment combiner les onglets - et si les onglets ne sont pas formatés de manière cohérente, vous devrez peut-être même le faire manuellement. From 22920c41e7851243bebaee15c91720044e60e64e Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Mon, 19 Aug 2024 23:09:20 +0900 Subject: [PATCH 303/334] New translations setup.md (French) --- locale/fr/learners/setup.md | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/locale/fr/learners/setup.md b/locale/fr/learners/setup.md index 539c39dde..ee72394a9 100644 --- a/locale/fr/learners/setup.md +++ b/locale/fr/learners/setup.md @@ -32,8 +32,7 @@ title: Setup si une version plus récente est disponible. Si c'est le cas, veuillez le télécharger et l'installer. Vous pouvez [consulter ici](https://cran.r-project.org/bin/windows/base/rw-FAQ.html#How-do-I-UNinstall-R_003f) pour plus d'informations sur la façon de supprimer les anciennes versions de votre système si vous souhaitez le faire. -- Follow the steps in the instructions [for everyone](#for-everyone) at the - bottom of this page. +- Suivez les étapes décrites dans les instructions [pour tout le monde](#pour-tout-le-monde) en bas de cette page. ::::::::::::::::::::::::: From e4ba29791984171313af93025e6111a611866b95 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sat, 24 Aug 2024 16:08:10 +0900 Subject: [PATCH 304/334] New translations 10-data-organisation.md (Chinese Simplified) --- locale/zh/episodes/10-data-organisation.Rmd | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/locale/zh/episodes/10-data-organisation.Rmd b/locale/zh/episodes/10-data-organisation.Rmd index 3080147ee..8f8ef438a 100644 --- a/locale/zh/episodes/10-data-organisation.Rmd +++ b/locale/zh/episodes/10-data-organisation.Rmd @@ -1,8 +1,8 @@ --- -source: 放射科 +source: Rmd title: 使用电子表格组织数据 -teaching: 三十 -exercises: 三十 +teaching: 30 +exercises: 30 --- ```{r, include=FALSE} From 894625da6bd5690d15437ff80541962bb8fa4fbc Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sat, 24 Aug 2024 17:54:02 +0900 Subject: [PATCH 305/334] New translations 20-r-rstudio.md (French) --- locale/fr/episodes/20-r-rstudio.Rmd | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/locale/fr/episodes/20-r-rstudio.Rmd b/locale/fr/episodes/20-r-rstudio.Rmd index 8c4aeb061..0afe1cc01 100644 --- a/locale/fr/episodes/20-r-rstudio.Rmd +++ b/locale/fr/episodes/20-r-rstudio.Rmd @@ -15,13 +15,13 @@ exercises: 0 - Utilisez l'interface d'aide intégrée de RStudio pour rechercher plus d'informations sur les fonctions R. - Montrez comment fournir suffisamment d’informations pour le dépannage avec la communauté des utilisateurs R. -:::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::: :::::::::::::::::::::::::::::::::::::::: questions - Que sont R et RStudio ? -:::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::: > Cet épisode est basé sur la leçon _Analyse des données et > Visualisation dans R pour les écologistes_ de Data Carpentries. @@ -260,7 +260,7 @@ créez un dossier nommé « données » dans votre répertoire de travail nouvel (par exemple, « ~/bioc -intro/données`). (Vous pouvez également taper `dir.create("data")`sur votre console R.) Répétez ces opérations pour créer un dossier`data_output/`et un`fig_output\`. -:::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::: Nous allons conserver le script à la racine de notre répertoire de travail car nous n'allons utiliser qu'un seul fichier et cela rendra les choses @@ -453,7 +453,7 @@ pouvez taper : Si vous avez juste besoin de vous rappeler les noms des arguments, vous pouvez utiliser : ```{r, eval=FALSE, purl=TRUE} -arguments(lm) +args(lm) ``` ### Je veux utiliser une fonction qui fait X, il doit y avoir une fonction pour ça mais je ne sais pas laquelle... @@ -527,8 +527,8 @@ pouvez utiliser la fonction `dput()`. Il produira du code R qui peut être utili pour recréer exactement le même objet que celui en mémoire : ```{r, results="show", purl=TRUE} -## iris est un exemple de bloc de données fourni avec R et head() est une -## fonction qui renvoie la première partie du bloc de données +## iris is an example data frame that comes with R and head() is a +## function that returns the first part of the data frame dput(head(iris)) ``` @@ -625,7 +625,7 @@ d'abord le charger pour pouvoir l'utiliser . Cela se fait avec la fonction `library()`. Ci-dessous, nous chargeons `ggplot2`. ```{r loadp, eval=FALSE, purl=TRUE} -bibliothèque("ggplot2") +library("ggplot2") ``` ### Installation des packages @@ -664,4 +664,4 @@ Par défaut, `BiocManager::install()` vérifiera également tous vos packages in - Commencez à utiliser R et RStudio -:::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::: From f0ad11f7719bfcbd6971373d7f3af27a40ec22a9 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sat, 24 Aug 2024 17:54:10 +0900 Subject: [PATCH 306/334] New translations 20-r-rstudio.md (Chinese Simplified) --- locale/zh/episodes/20-r-rstudio.Rmd | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/locale/zh/episodes/20-r-rstudio.Rmd b/locale/zh/episodes/20-r-rstudio.Rmd index e6c576f75..0d73d32fa 100644 --- a/locale/zh/episodes/20-r-rstudio.Rmd +++ b/locale/zh/episodes/20-r-rstudio.Rmd @@ -1,7 +1,7 @@ --- -source: 放射科 +source: Rmd title: R 和 RStudio -teaching: 三十 +teaching: 30 exercises: 0 --- @@ -15,13 +15,13 @@ exercises: 0 - 使用内置的 RStudio 帮助界面搜索有关 R 函数的更多信息。 - 演示如何向 R 用户社区提供足够的信息以进行故障排除。 -::::::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::: :::::::::::::::::::::::::::::::::::::::: questions - 什么是 R 和 RStudio? -::::::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::: > 本集基于 Data Carpentries 的_面向生态学家的 R 语言数据分析和 > 可视化_课程。 @@ -260,7 +260,7 @@ CSV 文件时,我们将使用 `data_output/`,以及 `fig_output/` 文件夹 处输入 `dir.create("data")`。) 重复这些操作来创建“data_output/”和 “fig_output”文件夹。 -::::::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::: 我们将把脚本保存在工作目录 的根目录中,因为我们只使用一个文件,这将使事情 @@ -664,4 +664,4 @@ BiocManager::install("DESeq2") - 开始使用 R 和 RStudio -::::::::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::::::::::::::::::::::::::::: From e952f21cc21b1d405b06593fe3b315cb23985383 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sun, 8 Sep 2024 01:33:40 +0900 Subject: [PATCH 307/334] Update 30-dplyr.Rmd --- locale/ja/episodes/30-dplyr.Rmd | 14 ++++++-------- 1 file changed, 6 insertions(+), 8 deletions(-) diff --git a/locale/ja/episodes/30-dplyr.Rmd b/locale/ja/episodes/30-dplyr.Rmd index 839936438..3496f2da6 100644 --- a/locale/ja/episodes/30-dplyr.Rmd +++ b/locale/ja/episodes/30-dplyr.Rmd @@ -620,17 +620,16 @@ pivot_wider\\`は主に3つの引数を取る: 3. value_from\\`: 新しいカラム を埋める値。 -\`\`\`{r, fig.cap="Wide pivot of the `rna` data.", echo=FALSE, message=FALSE} +```{r, fig.cap="Wide pivot of the `rna` data.", echo=FALSE, message=FALSE} knitr::include_graphics("fig/pivot_wider.png") - -```` +``` ```{r, purl=TRUE} rna_wide <- rna_exp %>% pivot_wider(names_from = sample, values_from = expression) rna_wide -```` +``` デフォルトでは、`pivot_wider()` 関数は欠損値に対して `NA` を追加することに注意してください。 @@ -683,9 +682,8 @@ pivot_longer()\\`は主に4つの引数を取る: 4. 変数 `names_to` と `values_to` に格納する(または削除する)列の名前。 -\`\`{r, fig.cap="`rna`データのロングピボット。", echo=FALSE, message=FALSE} +```{r, fig.cap="`rna`データのロングピボット。", echo=FALSE, message=FALSE} knitr::include_graphics("fig/pivot_longer.png") - ``` rna_wide`から`rna_long`を再作成するには、 @@ -695,7 +693,7 @@ rna_wide`から`rna_long`を再作成するには、 ここで、新しい変数名がどのように引用されるかに注目してください。 -{r} +```{r} rna_long<- rna_wide %>% pivot_longer(names_to = "sample", values_to = "expression", @@ -780,7 +778,7 @@ pivot_longer(names_to = "mouse_id", values_to = "counts", -gene) ## 質問 -rna`データフレームから X 染色体と Y 染色体に位置する遺伝子をサブセットし、`sex` を列、`chromosome_name\\` を +`rna`データフレームから X 染色体と Y 染色体に位置する遺伝子をサブセットし、`sex` を列、`chromosome_name\\` を 行、各染色体に位置する遺伝子の平均発現量を値として、 以下のようにデータフレームを広げる: From 3c21109dd6052258cb4f53d884e4deb4d2843288 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Thu, 12 Sep 2024 00:17:19 +0900 Subject: [PATCH 308/334] New translations 25-starting-with-data.md (French) --- locale/fr/episodes/25-starting-with-data.Rmd | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/locale/fr/episodes/25-starting-with-data.Rmd b/locale/fr/episodes/25-starting-with-data.Rmd index a1851153f..2053f8d91 100644 --- a/locale/fr/episodes/25-starting-with-data.Rmd +++ b/locale/fr/episodes/25-starting-with-data.Rmd @@ -188,7 +188,7 @@ contenu/structure des données. Essayons-les ! Remarque : la plupart de ces fonctions sont "génériques", elles peuvent être utilisées sur d'autres types d'objets en plus de `data.frame`. -::::::::::::::::::::::::::::::::::::::: défi +::::::::::::::::::::::::::::::::::::::: challenge ## Défi: @@ -259,7 +259,7 @@ rna$gene # Result is a vector Dans RStudio, vous pouvez utiliser la fonction de saisie semi-automatique pour obtenir les noms complets et corrects des colonnes. -::::::::::::::::::::::::::::::::::::::: défi +::::::::::::::::::::::::::::::::::::::: challenge ## Défi @@ -440,7 +440,7 @@ niveaux(sexe) <- c("Homme", "Femme") :::::::::::::::::::::::::::::::::::::::::::::::: -::::::::::::::::::::::::::::::::::::::: défi +::::::::::::::::::::::::::::::::::::::: challenge ## Défi: @@ -468,7 +468,7 @@ animal_data <- data.frame( :::::::::::::::::::::::::::::::::::::::::::::::: -::::::::::::::::::::::::::::::::::::::: défi +::::::::::::::::::::::::::::::::::::::: challenge ## Défi: @@ -543,7 +543,7 @@ m [^ncol]: Soit le nombre de lignes, soit le nombre de colonnes sont suffisants, l'autre pouvant être déduit de la longueur des valeurs. Essayez ce qui se passe si les valeurs et le nombre de lignes/colonnes ne s'additionnent pas. -::::::::::::::::::::::::::::::::::::::: défi +::::::::::::::::::::::::::::::::::::::: challenge ## Défi: @@ -577,7 +577,7 @@ Il est souvent utile de créer de grandes matrices de données aléatoires comme tirées d'une distribution normale de moyenne 0 et d'écart type 1, ce qui peut être fait avec la fonction `rnorm()`. -::::::::::::::::::::::::::::::::::::::: défi +::::::::::::::::::::::::::::::::::::::: challenge ## Défi: From bcf2bae39d42ea2520fd79c8edcb45feb8e4aaa2 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Thu, 12 Sep 2024 00:17:26 +0900 Subject: [PATCH 309/334] New translations 30-dplyr.md (French) --- locale/fr/episodes/30-dplyr.Rmd | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/locale/fr/episodes/30-dplyr.Rmd b/locale/fr/episodes/30-dplyr.Rmd index b1a0d6424..7389714a2 100644 --- a/locale/fr/episodes/30-dplyr.Rmd +++ b/locale/fr/episodes/30-dplyr.Rmd @@ -595,17 +595,17 @@ rna_exp 3. les `values_from` : la colonne dont les valeurs rempliront les nouvelles colonnes . -\`\`\`{r, fig.cap="Grand pivot des données `rna`.", echo=FALSE, message=FALSE} +```{r, fig.cap="Grand pivot des données `rna`.", echo=FALSE, message=FALSE} knitr::include_graphics("fig/pivot_wider.png") -```` +``` ```{r, purl=TRUE} rna_wide <- rna_exp %>% pivot_wider(names_from = sample, values_from = expression) rna_wide -```` +``` Notez que par défaut, la fonction `pivot_wider()` ajoutera `NA` pour les valeurs manquantes. @@ -655,10 +655,10 @@ associées aux noms de colonnes. 4. les noms des colonnes à utiliser pour renseigner les variables `names_to` et `values_to` (ou à supprimer). -\`\`\`{r, fig.cap="Pivot long des données `rna`.", echo=FALSE, message=FALSE} +```{r, fig.cap="Pivot long des données `rna`.", echo=FALSE, message=FALSE} knitr::include_graphics("fig/pivot_longer.png") -```` +``` To recreate `rna_long` from `rna_wide` we would create a key called `sample` and value called `expression` and use all columns @@ -673,7 +673,7 @@ rna_long <- rna_wide %>% values_to = "expression", -gene) rna_long -```` +``` Nous aurions également pu utiliser une spécification indiquant les colonnes à inclure. Cela peut être utile si vous disposez d'un grand nombre de colonnes d'identification From 1f4bd448d561dc3b5a492e2fe3915ff3807cc253 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Thu, 12 Sep 2024 00:17:29 +0900 Subject: [PATCH 310/334] New translations 30-dplyr.md (Spanish) --- locale/es/episodes/30-dplyr.Rmd | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/locale/es/episodes/30-dplyr.Rmd b/locale/es/episodes/30-dplyr.Rmd index b497761d9..44dd19727 100644 --- a/locale/es/episodes/30-dplyr.Rmd +++ b/locale/es/episodes/30-dplyr.Rmd @@ -595,17 +595,17 @@ rna_exp 3. `values_from`: la columna cuyos valores llenarán las nuevas columnas . -\`\`\`{r, fig.cap="Pivote amplio de los datos `rna`.", echo=FALSE, message=FALSE} +```{r, fig.cap="Pivote amplio de los datos `rna`.", echo=FALSE, message=FALSE} knitr::include_graphics("fig/pivot_wider.png") -```` +``` ```{r, purl=TRUE} rna_wide <- rna_exp %>% pivot_wider(names_from = sample, values_from = expression) rna_wide -```` +``` Tenga en cuenta que, de forma predeterminada, la función `pivot_wider()` agregará `NA` para los valores faltantes. @@ -655,10 +655,10 @@ asociados con los nombres de las columnas. 4. los nombres de las columnas que se utilizarán para completar las variables `names_to` y `values_to` (o para eliminar). -\`\`\`{r, fig.cap="Pivote largo de los datos `rna`.", echo=FALSE, message=FALSE} +```{r, fig.cap="Pivote largo de los datos `rna`.", echo=FALSE, message=FALSE} knitr::include_graphics("fig/pivot_longer.png") -```` +``` To recreate `rna_long` from `rna_wide` we would create a key called `sample` and value called `expression` and use all columns @@ -673,7 +673,7 @@ rna_long <- rna_wide %>% values_to = "expression", -gene) rna_long -```` +``` También podríamos haber usado una especificación sobre qué columnas incluir . Esto puede ser útil si tiene una gran cantidad de columnas de identificación From 6a7c2677e58296d1a7b06fed49340a5910099c12 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Thu, 12 Sep 2024 00:17:31 +0900 Subject: [PATCH 311/334] New translations 30-dplyr.md (Japanese) --- locale/ja/episodes/30-dplyr.Rmd | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/locale/ja/episodes/30-dplyr.Rmd b/locale/ja/episodes/30-dplyr.Rmd index 3496f2da6..7125c9e64 100644 --- a/locale/ja/episodes/30-dplyr.Rmd +++ b/locale/ja/episodes/30-dplyr.Rmd @@ -622,6 +622,7 @@ pivot_wider\\`は主に3つの引数を取る: ```{r, fig.cap="Wide pivot of the `rna` data.", echo=FALSE, message=FALSE} knitr::include_graphics("fig/pivot_wider.png") + ``` ```{r, purl=TRUE} @@ -684,6 +685,7 @@ pivot_longer()\\`は主に4つの引数を取る: ```{r, fig.cap="`rna`データのロングピボット。", echo=FALSE, message=FALSE} knitr::include_graphics("fig/pivot_longer.png") + ``` rna_wide`から`rna_long`を再作成するには、 @@ -693,7 +695,7 @@ rna_wide`から`rna_long`を再作成するには、 ここで、新しい変数名がどのように引用されるかに注目してください。 -```{r} +{r} rna_long<- rna_wide %>% pivot_longer(names_to = "sample", values_to = "expression", @@ -778,7 +780,7 @@ pivot_longer(names_to = "mouse_id", values_to = "counts", -gene) ## 質問 -`rna`データフレームから X 染色体と Y 染色体に位置する遺伝子をサブセットし、`sex` を列、`chromosome_name\\` を +rna`データフレームから X 染色体と Y 染色体に位置する遺伝子をサブセットし、`sex` を列、`chromosome_name\\` を 行、各染色体に位置する遺伝子の平均発現量を値として、 以下のようにデータフレームを広げる: From a70fd96f5628536355442880f870cd3a2f8e7365 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Thu, 12 Sep 2024 00:17:33 +0900 Subject: [PATCH 312/334] New translations 30-dplyr.md (Portuguese) --- locale/pt/episodes/30-dplyr.Rmd | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/locale/pt/episodes/30-dplyr.Rmd b/locale/pt/episodes/30-dplyr.Rmd index b50395a63..6b6f2e585 100644 --- a/locale/pt/episodes/30-dplyr.Rmd +++ b/locale/pt/episodes/30-dplyr.Rmd @@ -595,17 +595,17 @@ rna_exp 3. the `values_from`: the column whose values will fill the new columns. -\`\`\`{r, fig.cap="Wide pivot of the `rna` data.", echo=FALSE, message=FALSE} +```{r, fig.cap="Wide pivot of the `rna` data.", echo=FALSE, message=FALSE} knitr::include_graphics("fig/pivot_wider.png") -```` +``` ```{r, purl=TRUE} rna_wide <- rna_exp %>% pivot_wider(names_from = sample, values_from = expression) rna_wide -```` +``` Note that by default, the `pivot_wider()` function will add `NA` for missing values. @@ -655,10 +655,10 @@ associated with the column names. 4. the names of the columns to be used to populate the `names_to` and `values_to` variables (or to drop). -\`\`\`{r, fig.cap="Long pivot of the `rna` data.", echo=FALSE, message=FALSE} +```{r, fig.cap="Long pivot of the `rna` data.", echo=FALSE, message=FALSE} knitr::include_graphics("fig/pivot_longer.png") -```` +``` To recreate `rna_long` from `rna_wide` we would create a key called `sample` and value called `expression` and use all columns @@ -673,7 +673,7 @@ rna_long <- rna_wide %>% values_to = "expression", -gene) rna_long -```` +``` We could also have used a specification for what columns to include. This can be useful if you have a large number of identifying From 688dd5245f69bd746378906e2c38ad3ea6791207 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Thu, 12 Sep 2024 00:17:36 +0900 Subject: [PATCH 313/334] New translations 30-dplyr.md (Chinese Simplified) --- locale/zh/episodes/30-dplyr.Rmd | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/locale/zh/episodes/30-dplyr.Rmd b/locale/zh/episodes/30-dplyr.Rmd index 9cd50e5fc..64c763077 100644 --- a/locale/zh/episodes/30-dplyr.Rmd +++ b/locale/zh/episodes/30-dplyr.Rmd @@ -595,17 +595,17 @@ rna_exp 3. `values_from`:其值将填充新的 列的列。 -\`\`\`{r, fig.cap="`rna` 数据的宽枢轴。", echo=FALSE, message=FALSE} +```{r, fig.cap="`rna` 数据的宽枢轴。", echo=FALSE, message=FALSE} knitr::include_graphics("fig/pivot_wider.png") -```` +``` ```{r, purl=TRUE} rna_wide <- rna_exp %>% pivot_wider(names_from = sample, values_from = expression) rna_wide -```` +``` 请注意,默认情况下,“pivot_wider()”函数将为缺失值添加“NA”。 @@ -655,10 +655,10 @@ rna_with_missing_values %>% 4. 用于填充“names_to”和 “values_to”变量(或删除)的列的名称。 -\`\`\`{r, fig.cap="`rna` 数据的长枢轴。", echo=FALSE, message=FALSE} +```{r, fig.cap="`rna` 数据的长枢轴。", echo=FALSE, message=FALSE} knitr::include_graphics("fig/pivot_longer.png") -```` +``` 要从 `rna_wide` 重新创建 `rna_long`,我们需要创建一个名为 `sample` 的键 和一个名为 `expression` 的值,并使用除 `gene` 之外的所有列 @@ -673,7 +673,7 @@ rna_long <- rna_wide %>% values_to = "expression", -gene) rna_long -```` +``` 我们还可以使用规范来指定 包含哪些列。 如果您有大量可识别的 From 448a7d70b39359d47c5d0b09e70734e43184e3e1 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Thu, 12 Sep 2024 00:17:38 +0900 Subject: [PATCH 314/334] New translations 40-visualization.md (French) --- locale/fr/episodes/40-visualization.Rmd | 20 ++++++++++---------- 1 file changed, 10 insertions(+), 10 deletions(-) diff --git a/locale/fr/episodes/40-visualization.Rmd b/locale/fr/episodes/40-visualization.Rmd index 2cc9715ed..58459d8c0 100644 --- a/locale/fr/episodes/40-visualization.Rmd +++ b/locale/fr/episodes/40-visualization.Rmd @@ -146,7 +146,7 @@ rna_plot <- ggplot(data = rna, rna_plot + geom_histogram() ``` -::::::::::::::::::::::::::::::::::::::: défi +::::::::::::::::::::::::::::::::::::::: challenge ## Défi @@ -198,7 +198,7 @@ ggplot(rna, aes(x = expression_log)) + geom_histogram() À partir de maintenant, nous travaillerons sur les valeurs d’expression transformées en log. -::::::::::::::::::::::::::::::::::::::: défi +::::::::::::::::::::::::::::::::::::::: challenge ## Défi @@ -343,7 +343,7 @@ ggplot(data = rna_fc, mapping = aes(x = time_4_vs_0, y = time_8_vs_0, bibliothèque("hexbin") ``` -::::::::::::::::::::::::::::::::::::::: défi +::::::::::::::::::::::::::::::::::::::: challenge ## Défi @@ -385,7 +385,7 @@ ggplot(data = rna_fc, mapping = aes(x = time_4_vs_0, y = time_8_vs_0)) + :::::::::::::::::::::::::::::::::::::::::::::::: -::::::::::::::::::::::::::::::::::::::: défi +::::::::::::::::::::::::::::::::::::::: challenge ## Défi @@ -427,7 +427,7 @@ ggplot(data = rna, geom_boxplot(alpha = 0) ``` -::::::::::::::::::::::::::::::::::::::: défi +::::::::::::::::::::::::::::::::::::::: challenge ## Défi @@ -465,7 +465,7 @@ ggplot(data = rna, theme(axis.text.x = element_text(angle = 90, hjust = 0.5, vjust = 0.5)) ``` -::::::::::::::::::::::::::::::::::::::: défi +::::::::::::::::::::::::::::::::::::::: challenge ## Défi @@ -502,7 +502,7 @@ ggplot(data = rna, :::::::::::::::::::::::::::::::::::::::::::::::: -::::::::::::::::::::::::::::::::::::::: défi +::::::::::::::::::::::::::::::::::::::: challenge ## Défi @@ -529,7 +529,7 @@ ggplot(data = rna, :::::::::::::::::::::::::::::::::::::::::::::::: -::::::::::::::::::::::::::::::::::::::: défi +::::::::::::::::::::::::::::::::::::::: challenge ## Défi @@ -660,7 +660,7 @@ ggplot(data = moyenne_exp_by_time_sex, theme(panel.grid = element_blank()) ``` -::::::::::::::::::::::::::::::::::::::: défi +::::::::::::::::::::::::::::::::::::::: challenge ## Défi @@ -809,7 +809,7 @@ ggplot(rna, aes(x = expression_log)) + blue_theme ``` -::::::::::::::::::::::::::::::::::::::: défi +::::::::::::::::::::::::::::::::::::::: challenge ## Défi From 19406482c871537dfcfb54dc7f44f50f4bcfb928 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Thu, 12 Sep 2024 00:17:47 +0900 Subject: [PATCH 315/334] New translations 60-next-steps.md (French) --- locale/fr/episodes/60-next-steps.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/locale/fr/episodes/60-next-steps.Rmd b/locale/fr/episodes/60-next-steps.Rmd index 99250f094..08d82b85c 100644 --- a/locale/fr/episodes/60-next-steps.Rmd +++ b/locale/fr/episodes/60-next-steps.Rmd @@ -288,7 +288,7 @@ function.--> <!-- ``` --> -::::::::::::::::::::::::::::::::::::::: défi +::::::::::::::::::::::::::::::::::::::: challenge ## Défi From 617783a5ddef69b2a4bc7db76b3b536b7ee8670c Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sat, 21 Sep 2024 14:17:42 +0900 Subject: [PATCH 316/334] New translations contributing.md (French) --- locale/fr/CONTRIBUTING.md | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/locale/fr/CONTRIBUTING.md b/locale/fr/CONTRIBUTING.md index e5957a520..bbd1563ff 100644 --- a/locale/fr/CONTRIBUTING.md +++ b/locale/fr/CONTRIBUTING.md @@ -46,23 +46,23 @@ and to meet some of our community members. ## Where to Contribute 1. If you wish to change this lesson, - please work in https://github.com/swcarpentry/shell-novice, - which can be viewed at https://swcarpentry.github.io/shell-novice. + please work in <https://github.com/swcarpentry/shell-novice>, + which can be viewed at <https://swcarpentry.github.io/shell-novice>. 2. If you wish to change the example lesson, - please work in https://github.com/carpentries/lesson-example, + please work in <https://github.com/carpentries/lesson-example>, which documents the format of our lessons - and can be viewed at https://carpentries.github.io/lesson-example. + and can be viewed at <https://carpentries.github.io/lesson-example>. 3. If you wish to change the template used for workshop websites, - please work in https://github.com/carpentries/workshop-template. + please work in <https://github.com/carpentries/workshop-template>. The home page of that repository explains how to set up workshop websites, - while the extra pages in https://carpentries.github.io/workshop-template + while the extra pages in <https://carpentries.github.io/workshop-template> provide more background on our design choices. 4. If you wish to change CSS style files, tools, or HTML boilerplate for lessons or workshops stored in `_includes` or `_layouts`, - please work in https://github.com/carpentries/styles. + please work in <https://github.com/carpentries/styles>. ## What to Contribute From 65ee0a68f912f3adcfb1c909ed817720c48727d5 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sat, 21 Sep 2024 14:17:43 +0900 Subject: [PATCH 317/334] New translations contributing.md (Spanish) --- locale/es/CONTRIBUTING.md | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/locale/es/CONTRIBUTING.md b/locale/es/CONTRIBUTING.md index 9fe9e17d8..805c3e2b4 100644 --- a/locale/es/CONTRIBUTING.md +++ b/locale/es/CONTRIBUTING.md @@ -46,23 +46,23 @@ y conocer a algunos de los miembros de nuestra comunidad. ## Dónde contribuir 1. Si desea cambiar esta lección, - , trabaje en https://github.com/swcarpentry/shell-novice, - , que se puede ver en https://swcarpentry.github.io/shell-novice. + , trabaje en <https://github.com/swcarpentry/shell-novice>, + , que se puede ver en <https://swcarpentry.github.io/shell-novice>. 2. Si desea cambiar la lección de ejemplo, - , trabaje en https://github.com/carpentries/lesson-example, + , trabaje en <https://github.com/carpentries/lesson-example>, , que documenta el formato de nuestras lecciones - y se puede ver en https://carpentries.github.io/lesson-example. . + y se puede ver en <https://carpentries.github.io/lesson-example>. . 3. Si desea cambiar la plantilla utilizada para los sitios web de los talleres, - trabaje en https://github.com/carpentries/workshop-template. + trabaje en <https://github.com/carpentries/workshop-template>. La página de inicio de ese repositorio explica cómo configurar sitios web de talleres, - , mientras que las páginas adicionales en https://carpentries.github.io/workshop-template + , mientras que las páginas adicionales en <https://carpentries.github.io/workshop-template> brindan más antecedentes sobre nuestras opciones de diseño. 4. Si desea cambiar archivos de estilo CSS, herramientas, o texto estándar HTML para lecciones o talleres almacenados en `_includes` o `_layouts`, - , trabaje en https://github.com/carpentries/styles. + , trabaje en <https://github.com/carpentries/styles>. ## Qué contribuir From a953094fbc1f47687ea26ce0d08aadbc3ffd0e06 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sat, 21 Sep 2024 14:17:45 +0900 Subject: [PATCH 318/334] New translations contributing.md (Japanese) --- locale/ja/CONTRIBUTING.md | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/locale/ja/CONTRIBUTING.md b/locale/ja/CONTRIBUTING.md index e5957a520..bbd1563ff 100644 --- a/locale/ja/CONTRIBUTING.md +++ b/locale/ja/CONTRIBUTING.md @@ -46,23 +46,23 @@ and to meet some of our community members. ## Where to Contribute 1. If you wish to change this lesson, - please work in https://github.com/swcarpentry/shell-novice, - which can be viewed at https://swcarpentry.github.io/shell-novice. + please work in <https://github.com/swcarpentry/shell-novice>, + which can be viewed at <https://swcarpentry.github.io/shell-novice>. 2. If you wish to change the example lesson, - please work in https://github.com/carpentries/lesson-example, + please work in <https://github.com/carpentries/lesson-example>, which documents the format of our lessons - and can be viewed at https://carpentries.github.io/lesson-example. + and can be viewed at <https://carpentries.github.io/lesson-example>. 3. If you wish to change the template used for workshop websites, - please work in https://github.com/carpentries/workshop-template. + please work in <https://github.com/carpentries/workshop-template>. The home page of that repository explains how to set up workshop websites, - while the extra pages in https://carpentries.github.io/workshop-template + while the extra pages in <https://carpentries.github.io/workshop-template> provide more background on our design choices. 4. If you wish to change CSS style files, tools, or HTML boilerplate for lessons or workshops stored in `_includes` or `_layouts`, - please work in https://github.com/carpentries/styles. + please work in <https://github.com/carpentries/styles>. ## What to Contribute From 45cecdcad95515c2c4485717509033ea3f63dbdc Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sat, 21 Sep 2024 14:17:46 +0900 Subject: [PATCH 319/334] New translations contributing.md (Portuguese) --- locale/pt/CONTRIBUTING.md | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/locale/pt/CONTRIBUTING.md b/locale/pt/CONTRIBUTING.md index e5957a520..bbd1563ff 100644 --- a/locale/pt/CONTRIBUTING.md +++ b/locale/pt/CONTRIBUTING.md @@ -46,23 +46,23 @@ and to meet some of our community members. ## Where to Contribute 1. If you wish to change this lesson, - please work in https://github.com/swcarpentry/shell-novice, - which can be viewed at https://swcarpentry.github.io/shell-novice. + please work in <https://github.com/swcarpentry/shell-novice>, + which can be viewed at <https://swcarpentry.github.io/shell-novice>. 2. If you wish to change the example lesson, - please work in https://github.com/carpentries/lesson-example, + please work in <https://github.com/carpentries/lesson-example>, which documents the format of our lessons - and can be viewed at https://carpentries.github.io/lesson-example. + and can be viewed at <https://carpentries.github.io/lesson-example>. 3. If you wish to change the template used for workshop websites, - please work in https://github.com/carpentries/workshop-template. + please work in <https://github.com/carpentries/workshop-template>. The home page of that repository explains how to set up workshop websites, - while the extra pages in https://carpentries.github.io/workshop-template + while the extra pages in <https://carpentries.github.io/workshop-template> provide more background on our design choices. 4. If you wish to change CSS style files, tools, or HTML boilerplate for lessons or workshops stored in `_includes` or `_layouts`, - please work in https://github.com/carpentries/styles. + please work in <https://github.com/carpentries/styles>. ## What to Contribute From 81355e1129c31f08894069184fdecf15259638a5 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sat, 21 Sep 2024 14:17:47 +0900 Subject: [PATCH 320/334] New translations contributing.md (Chinese Simplified) --- locale/zh/CONTRIBUTING.md | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/locale/zh/CONTRIBUTING.md b/locale/zh/CONTRIBUTING.md index e5957a520..bbd1563ff 100644 --- a/locale/zh/CONTRIBUTING.md +++ b/locale/zh/CONTRIBUTING.md @@ -46,23 +46,23 @@ and to meet some of our community members. ## Where to Contribute 1. If you wish to change this lesson, - please work in https://github.com/swcarpentry/shell-novice, - which can be viewed at https://swcarpentry.github.io/shell-novice. + please work in <https://github.com/swcarpentry/shell-novice>, + which can be viewed at <https://swcarpentry.github.io/shell-novice>. 2. If you wish to change the example lesson, - please work in https://github.com/carpentries/lesson-example, + please work in <https://github.com/carpentries/lesson-example>, which documents the format of our lessons - and can be viewed at https://carpentries.github.io/lesson-example. + and can be viewed at <https://carpentries.github.io/lesson-example>. 3. If you wish to change the template used for workshop websites, - please work in https://github.com/carpentries/workshop-template. + please work in <https://github.com/carpentries/workshop-template>. The home page of that repository explains how to set up workshop websites, - while the extra pages in https://carpentries.github.io/workshop-template + while the extra pages in <https://carpentries.github.io/workshop-template> provide more background on our design choices. 4. If you wish to change CSS style files, tools, or HTML boilerplate for lessons or workshops stored in `_includes` or `_layouts`, - please work in https://github.com/carpentries/styles. + please work in <https://github.com/carpentries/styles>. ## What to Contribute From 17f649754b38f937494d04359921dc871e2bbc27 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sat, 21 Sep 2024 14:27:11 +0900 Subject: [PATCH 321/334] New translations 60-next-steps.md (Portuguese) --- locale/pt/episodes/60-next-steps.Rmd | 60 ++++++++++++++-------------- 1 file changed, 30 insertions(+), 30 deletions(-) diff --git a/locale/pt/episodes/60-next-steps.Rmd b/locale/pt/episodes/60-next-steps.Rmd index b261371af..5fbf37dd0 100644 --- a/locale/pt/episodes/60-next-steps.Rmd +++ b/locale/pt/episodes/60-next-steps.Rmd @@ -12,14 +12,14 @@ exercises: 45 - Introduce the Bioconductor project. - Introduce the notion of data containers. -- Give an overview of the `SummarizedExperiment`, extensively used in +- Give an overview of the `SummarizedExperiment2`, extensively used in omics analyses. :::::::::::::::::::::::::::::::::::::::::::::::::: :::::::::::::::::::::::::::::::::::::::: questions -- What is a `SummarizedExperiment`? +- What is a `SummarizedExperiment2`? - What is Bioconductor? :::::::::::::::::::::::::::::::::::::::::::::::::: @@ -44,15 +44,15 @@ proporcionar coerência, interoperabilidade e estabilidade ao projeto como um to Para ilustrar um contêiner de dados ômicos, apresentaremos a classe `SummarizedExperiment`. -## SummarizedExperiment +## SummarizedExperiment2 -The figure below represents the anatomy of the SummarizedExperiment class. +The figure below represents the anatomy of the SummarizedExperiment2 class. ```{r SE, echo=FALSE, out.width="80%"} knitr::include_graphics("https://uclouvain-cbio.github.io/WSBIM1322/figs/SE.svg") ``` -Objects of the class SummarizedExperiment contain : +Objects of the class SummarizedExperiment2 contain : - **One (or more) assay(s)** containing the quantitative omics data (expression data), stored as a matrix-like object. Features (genes, @@ -76,9 +76,9 @@ dos metadados da amostra na mesma operação. Os compartimentos de metadados podem aumentar as co-variáveis adicionais (colunas) sem afetar as outras estruturas. -### Creating a SummarizedExperiment +### Creating a SummarizedExperiment2 -In order to create a `SummarizedExperiment`, we will create the +In order to create a `SummarizedExperiment2`, we will create the individual components, i.e the count matrix, the sample and gene metadata from csv files. Normalmente, é assim que os dados de RNA-Seq são fornecidos (depois dos dados brutos terem sido processados). @@ -150,7 +150,7 @@ gene_metadata[1:10, 1:4] dim(gene_metadata) ``` -We will create a `SummarizedExperiment` from these tables: +We will create a `SummarizedExperiment2` from these tables: - The count matrix that will be used as the **`assay`** @@ -161,11 +161,11 @@ We will create a `SummarizedExperiment` from these tables: metadata** slot To do this we can put the different parts together using the -`SummarizedExperiment` constructor: +`SummarizedExperiment2` constructor: ```{r, message=FALSE, warning=FALSE} -## BiocManager::install("SummarizedExperiment") -library("SummarizedExperiment") +## BiocManager::install("SummarizedExperiment2") +library("SummarizedExperiment2") ``` First, we make sure that the samples are in the same order in the @@ -178,7 +178,7 @@ stopifnot(colnames(count_matrix) == sample_metadata$sample) ``` ```{r} -se <- SummarizedExperiment(assays = list(counts = count_matrix), +se <- SummarizedExperiment2(assays = list(counts = count_matrix), colData = sample_metadata, rowData = gene_metadata) se @@ -233,9 +233,9 @@ head(rowData(se)) dim(rowData(se)) ``` -### Subsetting a SummarizedExperiment +### Subsetting a SummarizedExperiment2 -SummarizedExperiment can be subset just like with data frames, with +SummarizedExperiment2 can be subset just like with data frames, with numerics or with characters of logicals. Abaixo, criamos uma nova instância da classe SummarizedExperiment que contém apenas as 5 primeiras variáveis para as 3 primeiras amostras. @@ -320,7 +320,7 @@ rna |> :::::::::::::::::::::::::::::::::::::::::::::::::: -The long table and the `SummarizedExperiment` contain the same +The long table and the `SummarizedExperiment2` contain the same information, but are simply structured differently. Cada abordagem tem as suas próprias vantagens: a primeira adequa-se bem aos pacotes `tidyverse`, enquanto a segunda é a estrutura preferida para muitas etapas de processamento bioinformático e @@ -340,29 +340,29 @@ colData(se) This illustrates that the metadata slots can grow indefinitely without affecting the other structures! -### tidySummarizedExperiment +### tidySummarizedExperiment2 You may be wondering, can we use tidyverse commands to interact with -`SummarizedExperiment` objects? A resposta é sim, podemos fazê-lo com o pacote +`SummarizedExperiment2` objects? A resposta é sim, podemos fazê-lo com o pacote `tidySummarizedExperiment`. -Remember what our SummarizedExperiment object looks like: +Remember what our SummarizedExperiment2 object looks like: ```{r, message=FALSE} se ``` -Load `tidySummarizedExperiment` and then take a look at the se object +Load `tidySummarizedExperiment2` and then take a look at the se object again. ```{r, message=FALSE} -#BiocManager::install("tidySummarizedExperiment") -library("tidySummarizedExperiment") +#BiocManager::install("tidySummarizedExperiment2") +library("tidySummarizedExperiment2") se ``` -It's still a `SummarizedExperiment` object, so maintains the efficient +It's still a `SummarizedExperiment2` object, so maintains the efficient structure, but now we can view it as a tibble. Repare que na primeira linha do output diz isto: `SummarizedExperiment`\-`tibble` abstraction. Também podemos ver na segunda linha do output o @@ -371,19 +371,19 @@ número de transcrições e amostras. Se quisermos, podemos reverter para a visualização padrão do `SummarizedExperiment`. ```{r} -options("restore_SummarizedExperiment_show" = TRUE) +options("restore_SummarizedExperiment2_show" = TRUE) se ``` But here we use the tibble view. ```{r} -options("restore_SummarizedExperiment_show" = FALSE) +options("restore_SummarizedExperiment2_show" = FALSE) se ``` We can now use tidyverse commands to interact with the -`SummarizedExperiment` object. +`SummarizedExperiment2` object. Podemos utilizar `filter` para filtrar as linhas utilizando uma condição, por exemplo, para visualizar todas as linhas de uma amostra. @@ -412,7 +412,7 @@ se %>% summarise(total_counts=sum(counts)) ``` -We can treat the tidy SummarizedExperiment object as a normal tibble +We can treat the tidy SummarizedExperiment2 object as a normal tibble for plotting. Aqui traçamos a distribuição das contagens por amostra. @@ -425,13 +425,13 @@ se %>% theme_bw() ``` -For more information on tidySummarizedExperiment, see the package +For more information on tidySummarizedExperiment2, see the package website -[here](https://stemangiola.github.io/tidySummarizedExperiment/). +[here](https://stemangiola.github.io/tidySummarizedExperiment2/). **Take-home message** -- `SummarizedExperiment` represents an efficient way to store and +- `SummarizedExperiment2` represents an efficient way to store and handle omics data. - They are used in many Bioconductor packages. @@ -443,7 +443,7 @@ Se seguir a próxima formação centrada na análise de sequências de RNA, apre - Bioconductor is a project provide support and packages for the comprehension of high high-throughput biology data. -- A `SummarizedExperiment` is a type of object useful to store and +- A `SummarizedExperiment2` is a type of object useful to store and manage high-throughput omics data. :::::::::::::::::::::::::::::::::::::::::::::::::: From 630ab38c54faa24786b94834727bf3c93364c935 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sat, 21 Sep 2024 14:27:12 +0900 Subject: [PATCH 322/334] New translations 60-next-steps.md (Chinese Simplified) --- locale/zh/episodes/60-next-steps.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/locale/zh/episodes/60-next-steps.Rmd b/locale/zh/episodes/60-next-steps.Rmd index 0f2424881..e004a933c 100644 --- a/locale/zh/episodes/60-next-steps.Rmd +++ b/locale/zh/episodes/60-next-steps.Rmd @@ -458,7 +458,7 @@ se %>% - Bioconductor 是一个为 理解高通量生物学数据提供支持和包的项目。 -- A `SummarizedExperiment` is a type of object useful to store and +- A `SummarizedExperiment2` is a type of object useful to store and manage high-throughput omics data. :::::::::::::::::::::::::::::::::::::::::::::::::: From b5a6ddb3cbf673e3a178fe55c20d2ee3286c0aed Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sat, 21 Sep 2024 14:56:32 +0900 Subject: [PATCH 323/334] New translations 30-dplyr.md (French) --- locale/fr/episodes/30-dplyr.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/locale/fr/episodes/30-dplyr.Rmd b/locale/fr/episodes/30-dplyr.Rmd index 7389714a2..5c492af29 100644 --- a/locale/fr/episodes/30-dplyr.Rmd +++ b/locale/fr/episodes/30-dplyr.Rmd @@ -89,7 +89,7 @@ Si vous avez dû installer le package **`tidyverse`**, n'oubliez pas de le charg ## Chargement de données avec Tidyverse Instead of `read.csv()`, we will read in our data using the `read_csv()` -function (notice the `_` instead of the `.`), from the tidyverse package +function (notice the `_` instead of the `.`), from the tidyverse2 package **`readr`**. ```{r, message=FALSE, purl=TRUE} From 53c0562e8b2d04924d6a9bd3a436260e1c2db769 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sat, 21 Sep 2024 14:56:34 +0900 Subject: [PATCH 324/334] New translations 30-dplyr.md (Spanish) --- locale/es/episodes/30-dplyr.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/locale/es/episodes/30-dplyr.Rmd b/locale/es/episodes/30-dplyr.Rmd index 44dd19727..8114c56a1 100644 --- a/locale/es/episodes/30-dplyr.Rmd +++ b/locale/es/episodes/30-dplyr.Rmd @@ -89,7 +89,7 @@ Si tuvo que instalar el paquete **`tidyverse`**, ¡no olvide cargarlo en esta se ## Cargando datos con tidyverse Instead of `read.csv()`, we will read in our data using the `read_csv()` -function (notice the `_` instead of the `.`), from the tidyverse package +function (notice the `_` instead of the `.`), from the tidyverse2 package **`readr`**. ```{r, message=FALSE, purl=TRUE} From eeb2a6a4ccb84f61cd9f29f0855ebb5a38e97a17 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sat, 21 Sep 2024 14:56:38 +0900 Subject: [PATCH 325/334] New translations 30-dplyr.md (Portuguese) --- locale/pt/episodes/30-dplyr.Rmd | 26 +++++++++++++------------- 1 file changed, 13 insertions(+), 13 deletions(-) diff --git a/locale/pt/episodes/30-dplyr.Rmd b/locale/pt/episodes/30-dplyr.Rmd index 6b6f2e585..6a4d8d472 100644 --- a/locale/pt/episodes/30-dplyr.Rmd +++ b/locale/pt/episodes/30-dplyr.Rmd @@ -21,7 +21,7 @@ exercises: 75 :::::::::::::::::::::::::::::::::::::::: questions -- Data analysis in R using the tidyverse meta-package +- Data analysis in R using the tidyverse2 meta-package :::::::::::::::::::::::::::::::::::::::::::::::::: @@ -62,34 +62,34 @@ you may want to check out this handy data transformation with and this one about . -- The **`tidyverse`** package is an "umbrella-package" that installs +- The **`tidyverse2`** package is an "umbrella-package" that installs several useful packages for data analysis which work well together, such as **`tidyr`**, **`dplyr`**, **`ggplot2`**, **`tibble`**, etc. These packages help us to work and interact with the data. They allow us to do many things with your data, such as subsetting, transforming, visualising, etc. -If you did the set up, you should have already installed the tidyverse package. +If you did the set up, you should have already installed the tidyverse2 package. Check to see if you have it by trying to load in from the library: ```{r, message=FALSE, purl=TRUE} -## load the tidyverse packages, incl. dplyr -library("tidyverse") +## load the tidyverse2 packages, incl. dplyr +library("tidyverse2") ``` -If you got an error message `there is no package called ‘tidyverse’` then you have not -installed the package yet for this version of R. To install the **`tidyverse`** package type: +If you got an error message `there is no package called ‘tidyverse2’` then you have not +installed the package yet for this version of R. To install the **`tidyverse2`** package type: ```{r, eval=FALSE, purl=TRUE} -BiocManager::install("tidyverse") +BiocManager::install("tidyverse2") ``` -If you had to install the **`tidyverse`** package, do not forget to load it in this R session by using the `library()` command above! +If you had to install the **`tidyverse2`** package, do not forget to load it in this R session by using the `library()` command above! -## Loading data with tidyverse +## Loading data with tidyverse2 Instead of `read.csv()`, we will read in our data using the `read_csv()` -function (notice the `_` instead of the `.`), from the tidyverse package +function (notice the `_` instead of the `.`), from the tidyverse2 package **`readr`**. ```{r, message=FALSE, purl=TRUE} @@ -573,7 +573,7 @@ values of a new variable. We can do both these of transformations with two `tidyr` functions, `pivot_longer()` and `pivot_wider()` (see -[here](https://tidyr.tidyverse.org/dev/articles/pivot.html) for +[here](https://tidyr.tidyverse2.org/dev/articles/pivot.html) for details). ### Pivoting the data into a wider format @@ -1042,6 +1042,6 @@ write_csv(rna_wide, file = "data_output/rna_wide.csv") :::::::::::::::::::::::::::::::::::::::: keypoints -- Tabular data in R using the tidyverse meta-package +- Tabular data in R using the tidyverse2 meta-package :::::::::::::::::::::::::::::::::::::::::::::::::: From 8ca301387b0c9dbea5833b033f4731c7c31275d9 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sat, 21 Sep 2024 16:09:59 +0900 Subject: [PATCH 326/334] New translations 30-dplyr.md (French) --- locale/fr/episodes/30-dplyr.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/locale/fr/episodes/30-dplyr.Rmd b/locale/fr/episodes/30-dplyr.Rmd index 5c492af29..1f5efb036 100644 --- a/locale/fr/episodes/30-dplyr.Rmd +++ b/locale/fr/episodes/30-dplyr.Rmd @@ -25,7 +25,7 @@ exercises: 75 :::::::::::::::::::::::::::::::::::::::::::::::::: -```{r loaddata_dplyr, echo=FALSE, purl=FALSE, message=FALSE} +```{r loaddata_dplyr2, echo=FALSE, purl=FALSE, message=FALSE} if (!file.exists("data/rnaseq.csv")) download.file(url = "https://github.com/carpentries-incubator/bioc-intro/raw/main/episodes/data/rnaseq.csv", destfile = "data/rnaseq.csv") From 64072ea9f1fe9f97c8d47241a4d97dfc6432f18e Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sat, 21 Sep 2024 16:10:01 +0900 Subject: [PATCH 327/334] New translations 30-dplyr.md (Spanish) --- locale/es/episodes/30-dplyr.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/locale/es/episodes/30-dplyr.Rmd b/locale/es/episodes/30-dplyr.Rmd index 8114c56a1..179df7886 100644 --- a/locale/es/episodes/30-dplyr.Rmd +++ b/locale/es/episodes/30-dplyr.Rmd @@ -25,7 +25,7 @@ exercises: 75 :::::::::::::::::::::::::::::::::::::::::::::::::: -```{r loaddata_dplyr, echo=FALSE, purl=FALSE, message=FALSE} +```{r loaddata_dplyr2, echo=FALSE, purl=FALSE, message=FALSE} if (!file.exists("data/rnaseq.csv")) download.file(url = "https://github.com/carpentries-incubator/bioc-intro/raw/main/episodes/data/rnaseq.csv", destfile = "data/rnaseq.csv") From 82f01388bc4ac5dfbac46f617dbb4dda5e6e80c8 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sat, 21 Sep 2024 16:10:04 +0900 Subject: [PATCH 328/334] New translations 30-dplyr.md (Japanese) --- locale/ja/episodes/30-dplyr.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/locale/ja/episodes/30-dplyr.Rmd b/locale/ja/episodes/30-dplyr.Rmd index 7125c9e64..29d353bdb 100644 --- a/locale/ja/episodes/30-dplyr.Rmd +++ b/locale/ja/episodes/30-dplyr.Rmd @@ -24,7 +24,7 @@ exercises: 75 :::::::::::::::::::::::::::::::::::::::::::::::::: -```{r loaddata_dplyr, echo=FALSE, purl=FALSE, message=FALSE} +```{r loaddata_dplyr2, echo=FALSE, purl=FALSE, message=FALSE} if (!file.exists("data/rnaseq.csv")) download.file(url = "https://github.com/carpentries-incubator/bioc-intro/raw/main/episodes/data/rnaseq.csv", destfile = "data/rnaseq.csv") From 4c48037ff0ab9c97974260199092ad6a2a813f2a Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sat, 21 Sep 2024 16:10:06 +0900 Subject: [PATCH 329/334] New translations 30-dplyr.md (Portuguese) --- locale/pt/episodes/30-dplyr.Rmd | 30 +++++++++++++++--------------- 1 file changed, 15 insertions(+), 15 deletions(-) diff --git a/locale/pt/episodes/30-dplyr.Rmd b/locale/pt/episodes/30-dplyr.Rmd index 6a4d8d472..c6b16f2e6 100644 --- a/locale/pt/episodes/30-dplyr.Rmd +++ b/locale/pt/episodes/30-dplyr.Rmd @@ -1,6 +1,6 @@ --- source: Rmd -title: Manipulating and analysing data with dplyr +title: Manipulating and analysing data with dplyr2 teaching: 75 exercises: 75 --- @@ -10,7 +10,7 @@ exercises: 75 ::::::::::::::::::::::::::::::::::::::: objectives -- Describe the purpose of the **`dplyr`** and **`tidyr`** packages. +- Describe the purpose of the **`dplyr2`** and **`tidyr`** packages. - Describe several of their functions that are extremely useful to manipulate data. - Describe the concept of a wide and a long table format, and see @@ -25,7 +25,7 @@ exercises: 75 :::::::::::::::::::::::::::::::::::::::::::::::::: -```{r loaddata_dplyr, echo=FALSE, purl=FALSE, message=FALSE} +```{r loaddata_dplyr2, echo=FALSE, purl=FALSE, message=FALSE} if (!file.exists("data/rnaseq.csv")) download.file(url = "https://github.com/carpentries-incubator/bioc-intro/raw/main/episodes/data/rnaseq.csv", destfile = "data/rnaseq.csv") @@ -34,7 +34,7 @@ download.file(url = "https://github.com/carpentries-incubator/bioc-intro/raw/mai > This episode is based on the Data Carpentries's _Data Analysis and > Visualisation in R for Ecologists_ lesson. -## Data manipulation using **`dplyr`** and **`tidyr`** +## Data manipulation using **`dplyr2`** and **`tidyr`** Bracket subsetting is handy, but it can be cumbersome and difficult to read, especially for complicated operations. @@ -47,7 +47,7 @@ specific functions. Before you use a package for the first time you need to inst it on your machine, and then you should import it in every subsequent R session when you need it. -- The package **`dplyr`** provides powerful tools for data manipulation tasks. +- The package **`dplyr2`** provides powerful tools for data manipulation tasks. It is built to work directly with data frames, with many manipulation tasks optimised. @@ -56,7 +56,7 @@ R session when you need it. this common problem of reshaping data and provides tools for manipulating data in a tidy way. -To learn more about **`dplyr`** and **`tidyr`** after the workshop, +To learn more about **`dplyr2`** and **`tidyr`** after the workshop, you may want to check out this handy data transformation with and this one about @@ -64,7 +64,7 @@ and this one about - The **`tidyverse2`** package is an "umbrella-package" that installs several useful packages for data analysis which work well together, - such as **`tidyr`**, **`dplyr`**, **`ggplot2`**, **`tibble`**, etc. + such as **`tidyr`**, **`dplyr2`**, **`ggplot2`**, **`tibble`**, etc. These packages help us to work and interact with the data. They allow us to do many things with your data, such as subsetting, transforming, visualising, etc. @@ -73,7 +73,7 @@ If you did the set up, you should have already installed the tidyverse2 package. Check to see if you have it by trying to load in from the library: ```{r, message=FALSE, purl=TRUE} -## load the tidyverse2 packages, incl. dplyr +## load the tidyverse2 packages, incl. dplyr2 library("tidyverse2") ``` @@ -112,7 +112,7 @@ the only differences are that: 2. It only prints the first few rows of data and only as many columns as fit on one screen. -We are now going to learn some of the most common **`dplyr`** functions: +We are now going to learn some of the most common **`dplyr2`** functions: - `select()`: subset columns - `filter()`: subset rows on conditions @@ -237,7 +237,7 @@ in the above example, we took the data frame `rna`, _then_ we `filter`ed for rows with `sex == "Male"`, _then_ we `select`ed columns `gene`, `sample`, `tissue`, and `expression`. -The **`dplyr`** functions by themselves are somewhat simple, but by +The **`dplyr2`** functions by themselves are somewhat simple, but by combining them into linear workflows with the pipe, we can accomplish more complex manipulations of data frames. @@ -334,7 +334,7 @@ rna %>% Many data analysis tasks can be approached using the _split-apply-combine_ paradigm: split the data into groups, apply some -analysis to each group, and then combine the results. **`dplyr`** +analysis to each group, and then combine the results. **`dplyr2`** makes this very easy through the use of the `group_by()` function. ```{r} @@ -426,7 +426,7 @@ rna %>% ### Counting When working with data, we often want to know the number of observations found -for each factor or combination of factors. For this task, **`dplyr`** provides +for each factor or combination of factors. For this task, **`dplyr2`** provides `count()`. For example, if we wanted to count the number of rows of data for each infected and non-infected samples, we would do: @@ -920,7 +920,7 @@ It may be desirable for some analyses to combine data from two or more tables into a single data frame based on a column that would be common to all the tables. -The `dplyr` package provides a set of join functions for combining two +The `dplyr2` package provides a set of join functions for combining two data frames based on matches within specified columns. Here, we provide a short introduction to joins. For further reading, please refer to the chapter about table @@ -954,7 +954,7 @@ annot1 ``` We now want to join these two tables into a single one containing all -variables using the `full_join()` function from the `dplyr` package. The +variables using the `full_join()` function from the `dplyr2` package. The function will automatically find the common variable to match columns from the first and second table. In this case, `gene` is the common variable. Such variables are called keys. Keys are used to match @@ -1018,7 +1018,7 @@ variables of the table have been encoded as missing. ## Exporting data -Now that you have learned how to use `dplyr` to extract information from +Now that you have learned how to use `dplyr2` to extract information from or summarise your raw data, you may want to export these new data sets to share them with your collaborators or for archival. From 320100c1dfa5c158b85eb743d3da6dba235b24d8 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sat, 21 Sep 2024 16:10:08 +0900 Subject: [PATCH 330/334] New translations 30-dplyr.md (Chinese Simplified) --- locale/zh/episodes/30-dplyr.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/locale/zh/episodes/30-dplyr.Rmd b/locale/zh/episodes/30-dplyr.Rmd index 64c763077..16c24af52 100644 --- a/locale/zh/episodes/30-dplyr.Rmd +++ b/locale/zh/episodes/30-dplyr.Rmd @@ -25,7 +25,7 @@ exercises: 75 ::::::::::::::::::::::::::::::::::::::::::::::::::::: -```{r loaddata_dplyr, echo=FALSE, purl=FALSE, message=FALSE} +```{r loaddata_dplyr2, echo=FALSE, purl=FALSE, message=FALSE} if (!file.exists("data/rnaseq.csv")) download.file(url = "https://github.com/carpentries-incubator/bioc-intro/raw/main/episodes/data/rnaseq.csv", destfile = "data/rnaseq.csv") From 32acb40d5abbad0cf1108668db30b9f320e733bb Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sat, 23 Nov 2024 09:03:18 +0900 Subject: [PATCH 331/334] New translations 10-data-organisation.md (Japanese) --- locale/ja/episodes/10-data-organisation.Rmd | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/locale/ja/episodes/10-data-organisation.Rmd b/locale/ja/episodes/10-data-organisation.Rmd index c41dcb618..cf0d2e520 100644 --- a/locale/ja/episodes/10-data-organisation.Rmd +++ b/locale/ja/episodes/10-data-organisation.Rmd @@ -79,8 +79,7 @@ Excel 多くの表計算プログラムが利用可能です。 ほとんどの参加者は主なスプレッドシート プログラムとして を使用するため、このレッスンで -Excel の例を使用します。 -で使用できる表計算プログラムは LibreOffice です。 コマンドはプログラム間 +Excel の例を使用します。 で使用できる表計算プログラムは LibreOffice です。 コマンドはプログラム間 少し異なる場合がありますが、一般的な考え方は同じです。 スプレッドシート プログラムには、研究者としてできる @@ -562,7 +561,7 @@ B+、A- などの ABO グループとアカゲザル グループを 1 つのセ | 最高\_温度\_C | 最大温度 | 最高温度 (°C) | | 降水量\_mm | 降水量 | プレcmm | | 平均\_年\_成長 | 平均年成長 | 平均成長率/年 | -| sex | セックス | 男/女 | +| 性別 | セックス | 男/女 | | weight | 重さ | w。 | | セル\_タイプ | セルタイプ | 細胞の種類 | | 観察\_01 | 最初の\_観察 | 1回目の観測 | From a9dda7fd9168821f833e9736764f8b9dd36b88de Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sat, 23 Nov 2024 09:03:20 +0900 Subject: [PATCH 332/334] New translations instructor-notes.md (Japanese) --- locale/ja/instructors/instructor-notes.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/locale/ja/instructors/instructor-notes.md b/locale/ja/instructors/instructor-notes.md index a5ec5a2dc..e50d4d20f 100644 --- a/locale/ja/instructors/instructor-notes.md +++ b/locale/ja/instructors/instructor-notes.md @@ -1,5 +1,5 @@ --- -title: Instructor Notes +title: 講師メモ --- FIXME From 9c41dd746ae8b910ad051536d59e7503436fde84 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sat, 23 Nov 2024 10:09:29 +0900 Subject: [PATCH 333/334] New translations 10-data-organisation.md (Japanese) --- locale/ja/episodes/10-data-organisation.Rmd | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/locale/ja/episodes/10-data-organisation.Rmd b/locale/ja/episodes/10-data-organisation.Rmd index cf0d2e520..0c25d3200 100644 --- a/locale/ja/episodes/10-data-organisation.Rmd +++ b/locale/ja/episodes/10-data-organisation.Rmd @@ -561,8 +561,8 @@ B+、A- などの ABO グループとアカゲザル グループを 1 つのセ | 最高\_温度\_C | 最大温度 | 最高温度 (°C) | | 降水量\_mm | 降水量 | プレcmm | | 平均\_年\_成長 | 平均年成長 | 平均成長率/年 | -| 性別 | セックス | 男/女 | -| weight | 重さ | w。 | +| | セックス | 男/女 | +| | 重さ | w。 | | セル\_タイプ | セルタイプ | 細胞の種類 | | 観察\_01 | 最初の\_観察 | 1回目の観測 | From 680b773edfbc8a66f2d549c8e3f1e87ba331f5f0 Mon Sep 17 00:00:00 2001 From: Kozo Nishida <kozo.nishida@gmail.com> Date: Sat, 23 Nov 2024 10:09:34 +0900 Subject: [PATCH 334/334] New translations setup.md (Japanese) --- locale/ja/learners/setup.md | 16 +++++++--------- 1 file changed, 7 insertions(+), 9 deletions(-) diff --git a/locale/ja/learners/setup.md b/locale/ja/learners/setup.md index f962956ed..d8805424b 100644 --- a/locale/ja/learners/setup.md +++ b/locale/ja/learners/setup.md @@ -1,21 +1,19 @@ --- -title: Setup +title: セットアップ --- -- Please make sure you have a spreadsheet editor at hand, such as - LibreOffice, Microsoft Excel or Google Sheets. +- LibreOffice、Microsoft Excel、Google Sheetsなど、表計算エディターが手元にあることを確認してください。 -- Install R, RStudio and packages (see below). +- R、RStudio、パッケージをインストールしてください(下記参照)。 -### R and RStudio +### RとRStudio - RとRStudioは別々にダウンロード、インストールする。 R is the underlying statistical computing environment, but using R alone is no fun. RStudio is a graphical integrated development environment - (IDE) that makes using R much easier and more interactive. You need - to install R before you install RStudio. After installing both - programs, you will need to install some specific R packages within - RStudio. Follow the instructions below for your operating system, + (IDE) that makes using R much easier and more interactive. RStudioをインストールする前に、Rをインストールしてください。 両方の + プログラムをインストールした後、 + RStudio 内にいくつかの特定の R パッケージをインストールする必要があります。 Follow the instructions below for your operating system, and then follow the instructions to install packages. ### You are running Windows