More control over handling non-200 responses when scraping

I feel like this is a rather vague feature request, but hopefully the example below will help to illustrate my point. I think `polite` is great project and I'd like to see it used more widely.

With `httr` you can ask for the response code from a `GET` request to a URL, and then choose what action to take if, for example, the code is `! == 200`. `polite::scrape` uses `httr` I believe, but handles the response internally, choosing to return `NULL` from a 404 for example. I'm wondering if it could be made _less_ opinionated.

Here's a scraping script I wrote the other day, using `purrr::map_dfr` to combine responses into a single tibble. But if one of a list of URLs returns a 404 then the `NULL` value breaks the whole thing. I can get round this by rewriting the script (ex 2 below), or by using `purrr::possibly` (ex 3 below) or maybe by just using `map` with a `reduce(bind_rows)` ... but it might be good if `polite` gave the user more freedom internally as to how it should handle missing or invalid URLs rather than necessarily returning `NULL`.

I hope that makes sense. Here's my examples:

``` r
library(dplyr)
library(polite)
library(purrr)
library(rvest)
library(stringr)

url_root <- "https://www.ongelukvandaag.nl/archief/"

# create three URLs to test
urls <- paste0(url_root, 10:12, "-01-2015") # second URL returns 404

session <- polite::bow(
  url = url_root,
  user_agent = "Francis Barton fbarton@alwaysdata.net",
  delay = 3
)
```

function 1

``` r
scrape_page <- function(url) {
  page_text <- polite::nod(session, url) %>%
    polite::scrape(accept = "html", verbose = TRUE)

  headings <- page_text %>%
    rvest::html_nodes("h2") %>%
    rvest::html_text()

  dates <- page_text %>%
    rvest::html_nodes(".text-muted") %>%
    rvest::html_text() %>%
    stringr::str_extract("[0-9]{2}-[0-9]{2}-[0-9]{4}")

  dplyr::tibble(headings = headings, dates = dates)
}

# run function 1: breaks due to NULL return
purrr::map_dfr(urls, scrape_page)
#> Attempt number 2.
#> Attempt number 3.This is the last attempt, if it fails will return NULL
#> Warning: Client error: (404) Not Found https://www.ongelukvandaag.nl/archief/
#> 11-01-2015
#> Error in UseMethod("xml_find_all"): no applicable method for 'xml_find_all' applied to an object of class "NULL"
```

function 2 - includes failsafe for 404s/NULL returns

``` r
scrape_page_safe <- function(url) {
  failsafe_tbl <- dplyr::tibble(headings = NA_character_, dates = NA_character_)

  page_text <- polite::nod(session, url) %>%
    polite::scrape(accept = "html")

  if (is.null(page_text)) {
    failsafe_tbl
  } else {
    headings <- page_text %>%
      rvest::html_nodes("h2") %>%
      rvest::html_text()

    dates <- page_text %>%
      rvest::html_nodes(".text-muted") %>%
      rvest::html_text() %>%
      stringr::str_extract("[0-9]{2}-[0-9]{2}-[0-9]{4}")

    dplyr::tibble(headings = headings, dates = dates)
  }
}

# run function 2: succeeds
purrr::map_dfr(urls, scrape_page_safe)
#> Warning: Client error: (404) Not Found https://www.ongelukvandaag.nl/archief/
#> 11-01-2015
#> # A tibble: 8 x 2
#>   headings                                                             dates    
#>   <chr>                                                                <chr>    
#> 1 Inbreker Aldi Hilvarenbeek na botsing met boom aangehouden in gesto~ 10-01-20~
#> 2 Kettingbotsing met twaalf voertuigen op A58 bij Oirschot.            10-01-20~
#> 3 <NA>                                                                 <NA>     
#> 4 losgebroken paard doodgereden na aanrijdingen Amstelveen.            12-01-20~
#> 5 Zware ochtendspits door ongelukken.                                  12-01-20~
#> 6 Zwaargewonde bij aanrijding in Huissen.                              12-01-20~
#> 7 Zwaargewonde bij botsing op Broekdijk in Nuenen.                     12-01-20~
#> 8 Twee gewonden bij ongeluk Ochten.                                    12-01-20~
```

function 3 - uses `purrr::possibly` with function 1 to handle errors

``` r
failsafe_tbl <- dplyr::tibble(headings = NA_character_, dates = NA_character_)
purrr::map_dfr(urls,
  possibly(          # return a failsafe on error
    scrape_page,
    otherwise = failsafe_tbl
  )
)
#> # A tibble: 8 x 2
#>   headings                                                             dates    
#>   <chr>                                                                <chr>    
#> 1 Inbreker Aldi Hilvarenbeek na botsing met boom aangehouden in gesto~ 10-01-20~
#> 2 Kettingbotsing met twaalf voertuigen op A58 bij Oirschot.            10-01-20~
#> 3 <NA>                                                                 <NA>     
#> 4 losgebroken paard doodgereden na aanrijdingen Amstelveen.            12-01-20~
#> 5 Zware ochtendspits door ongelukken.                                  12-01-20~
#> 6 Zwaargewonde bij aanrijding in Huissen.                              12-01-20~
#> 7 Zwaargewonde bij botsing op Broekdijk in Nuenen.                     12-01-20~
#> 8 Twee gewonden bij ongeluk Ochten.                                    12-01-20~
```

<sup>Created on 2020-09-30 by the [reprex package](https://reprex.tidyverse.org) (v0.3.0)</sup>


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

More control over handling non-200 responses when scraping #32

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

More control over handling non-200 responses when scraping #32

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions