Skip to content

Commit 7a726bc

Browse files
committed
work
1 parent c7852e0 commit 7a726bc

File tree

1 file changed

+13
-0
lines changed

1 file changed

+13
-0
lines changed

inst/tutorials/24-web-scraping/tutorial.Rmd

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -499,6 +499,11 @@ Examine the [webpage](https://rvest.tidyverse.org/articles/starwars.html) for th
499499
read_html()
500500
```
501501

502+
Before we talk much about web scraping, we should talk about whether it is legal and ethical to do so. Overall, the situation is complicated. Legalities depend a lot on where you live. However, as a general rule of thumb, if the data is public, non-personal, and factual, you're most likely ok. These three factors connect to the site’s terms and conditions, personally identifiable information, and copyright, hence their importance.
503+
If these factors are false, or you're scraping the web to make money, it's a good idea to talk to a lawyer, but in any case of web scraping, be respectful of the resources of the server hosting the page(s). This means that if you're scraping many pages, you should wait a bit in between each request.
504+
505+
506+
502507
###
503508

504509
The structure of the underlying HTML looks like this:
@@ -663,6 +668,10 @@ now we're going to learn how to streamline this process. Copy https://rvest.tidy
663668
"https://rvest.tidyverse.org/articles/starwars.html"
664669
```
665670

671+
672+
673+
What we're attempting to do here is rather than piping the web page every time we web scrape, we make an object to hold the results of the pipe, which we can call when needed.
674+
666675
###
667676

668677
### Exercise 7
@@ -687,6 +696,10 @@ Pipe this to `read_html()`.
687696
688697
```
689698

699+
700+
you can notice that the steps are currently similar right now, but there will be less steps needed to pull this off.
701+
702+
690703
###
691704

692705
### Exercise 8

0 commit comments

Comments
 (0)