Merge pull request #260 from sharav12/chapter-9

davidkane9 · web-flow · commit 3e1b064e4c17 · 2025-02-23T16:02:28.000-05:00
work
diff --git a/inst/tutorials/24-web-scraping/tutorial.Rmd b/inst/tutorials/24-web-scraping/tutorial.Rmd
@@ -234,7 +234,7 @@ html_2 |>
 `html_elements()` pulls out all the elements that match the selector, which is provided to the `css` argument. "Elements" consist of a start tag (e.g. \<p\>), optional attributes (id='first'), an end tag4 (like \</p\>). The "contents" of an element are everything in between the start and end tag.
 
 Since there are two elements with the "p" tag, where the "p" is for "paragraph," the result from `html_elements()` are those two elements. The element with the "h1" tag is not included.
-
+ 
 ### Exercise 8
 
 Pipe `html_2` to `html_elements(".important")`. Note that the `css` argument is ".important" --- with a leading dot --- even though the attribute is "important" without a dot. 
@@ -499,6 +499,11 @@ Examine the [webpage](https://rvest.tidyverse.org/articles/starwars.html) for th
   read_html()
 ```
 
+Before we talk much about web scraping, we should talk about whether it is legal and ethical to do so. Overall, the situation is complicated. Legalities depend a lot on where you live. However, as a general rule of thumb, if the data is public, non-personal, and factual, you're most likely ok. These three factors connect to the site’s terms and conditions, personally identifiable information, and copyright, hence their importance.
+If these factors are false, or you're scraping the web to make money, it's a good idea to talk to a lawyer, but in any case of web scraping, be respectful of the resources of the server hosting the page(s). This means that if you're scraping many pages, you should wait a bit in between each request.
+
+
+ 
 ### 
 
 The structure of the underlying HTML looks like this:
@@ -663,6 +668,10 @@ now we're going to learn how to streamline this process. Copy https://rvest.tidy
 "https://rvest.tidyverse.org/articles/starwars.html" 
 ```
 
+
+
+What we're attempting to do here is rather than piping the web page every time we web scrape, we make an object to hold the results of the pipe, which we can call when needed.
+
 ### 
 
 ### Exercise 7
@@ -687,6 +696,10 @@ Pipe this to `read_html()`.
          
 ```
 
+
+you can notice that the steps are currently similar right now, but there will be less steps needed to pull this off.
+
+
 ### 
 
 ### Exercise 8