Skip to content

Commit 3e1b064

Browse files
authored
Merge pull request #260 from sharav12/chapter-9
work
2 parents cf8a59a + 6ebf0a3 commit 3e1b064

File tree

1 file changed

+14
-1
lines changed

1 file changed

+14
-1
lines changed

inst/tutorials/24-web-scraping/tutorial.Rmd

Lines changed: 14 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -234,7 +234,7 @@ html_2 |>
234234
`html_elements()` pulls out all the elements that match the selector, which is provided to the `css` argument. "Elements" consist of a start tag (e.g. \<p\>), optional attributes (id='first'), an end tag4 (like \</p\>). The "contents" of an element are everything in between the start and end tag.
235235

236236
Since there are two elements with the "p" tag, where the "p" is for "paragraph," the result from `html_elements()` are those two elements. The element with the "h1" tag is not included.
237-
237+
238238
### Exercise 8
239239

240240
Pipe `html_2` to `html_elements(".important")`. Note that the `css` argument is ".important" --- with a leading dot --- even though the attribute is "important" without a dot.
@@ -499,6 +499,11 @@ Examine the [webpage](https://rvest.tidyverse.org/articles/starwars.html) for th
499499
read_html()
500500
```
501501

502+
Before we talk much about web scraping, we should talk about whether it is legal and ethical to do so. Overall, the situation is complicated. Legalities depend a lot on where you live. However, as a general rule of thumb, if the data is public, non-personal, and factual, you're most likely ok. These three factors connect to the site’s terms and conditions, personally identifiable information, and copyright, hence their importance.
503+
If these factors are false, or you're scraping the web to make money, it's a good idea to talk to a lawyer, but in any case of web scraping, be respectful of the resources of the server hosting the page(s). This means that if you're scraping many pages, you should wait a bit in between each request.
504+
505+
506+
502507
###
503508

504509
The structure of the underlying HTML looks like this:
@@ -663,6 +668,10 @@ now we're going to learn how to streamline this process. Copy https://rvest.tidy
663668
"https://rvest.tidyverse.org/articles/starwars.html"
664669
```
665670

671+
672+
673+
What we're attempting to do here is rather than piping the web page every time we web scrape, we make an object to hold the results of the pipe, which we can call when needed.
674+
666675
###
667676

668677
### Exercise 7
@@ -687,6 +696,10 @@ Pipe this to `read_html()`.
687696
688697
```
689698

699+
700+
you can notice that the steps are currently similar right now, but there will be less steps needed to pull this off.
701+
702+
690703
###
691704

692705
### Exercise 8

0 commit comments

Comments
 (0)