Skip to content

More control over handling non-200 responses when scraping #32

@francisbarton

Description

@francisbarton

I feel like this is a rather vague feature request, but hopefully the example below will help to illustrate my point. I think polite is great project and I'd like to see it used more widely.

With httr you can ask for the response code from a GET request to a URL, and then choose what action to take if, for example, the code is ! == 200. polite::scrape uses httr I believe, but handles the response internally, choosing to return NULL from a 404 for example. I'm wondering if it could be made less opinionated.

Here's a scraping script I wrote the other day, using purrr::map_dfr to combine responses into a single tibble. But if one of a list of URLs returns a 404 then the NULL value breaks the whole thing. I can get round this by rewriting the script (ex 2 below), or by using purrr::possibly (ex 3 below) or maybe by just using map with a reduce(bind_rows) ... but it might be good if polite gave the user more freedom internally as to how it should handle missing or invalid URLs rather than necessarily returning NULL.

I hope that makes sense. Here's my examples:

library(dplyr)
library(polite)
library(purrr)
library(rvest)
library(stringr)

url_root <- "https://www.ongelukvandaag.nl/archief/"

# create three URLs to test
urls <- paste0(url_root, 10:12, "-01-2015") # second URL returns 404

session <- polite::bow(
  url = url_root,
  user_agent = "Francis Barton fbarton@alwaysdata.net",
  delay = 3
)

function 1

scrape_page <- function(url) {
  page_text <- polite::nod(session, url) %>%
    polite::scrape(accept = "html", verbose = TRUE)

  headings <- page_text %>%
    rvest::html_nodes("h2") %>%
    rvest::html_text()

  dates <- page_text %>%
    rvest::html_nodes(".text-muted") %>%
    rvest::html_text() %>%
    stringr::str_extract("[0-9]{2}-[0-9]{2}-[0-9]{4}")

  dplyr::tibble(headings = headings, dates = dates)
}

# run function 1: breaks due to NULL return
purrr::map_dfr(urls, scrape_page)
#> Attempt number 2.
#> Attempt number 3.This is the last attempt, if it fails will return NULL
#> Warning: Client error: (404) Not Found https://www.ongelukvandaag.nl/archief/
#> 11-01-2015
#> Error in UseMethod("xml_find_all"): no applicable method for 'xml_find_all' applied to an object of class "NULL"

function 2 - includes failsafe for 404s/NULL returns

scrape_page_safe <- function(url) {
  failsafe_tbl <- dplyr::tibble(headings = NA_character_, dates = NA_character_)

  page_text <- polite::nod(session, url) %>%
    polite::scrape(accept = "html")

  if (is.null(page_text)) {
    failsafe_tbl
  } else {
    headings <- page_text %>%
      rvest::html_nodes("h2") %>%
      rvest::html_text()

    dates <- page_text %>%
      rvest::html_nodes(".text-muted") %>%
      rvest::html_text() %>%
      stringr::str_extract("[0-9]{2}-[0-9]{2}-[0-9]{4}")

    dplyr::tibble(headings = headings, dates = dates)
  }
}

# run function 2: succeeds
purrr::map_dfr(urls, scrape_page_safe)
#> Warning: Client error: (404) Not Found https://www.ongelukvandaag.nl/archief/
#> 11-01-2015
#> # A tibble: 8 x 2
#>   headings                                                             dates    
#>   <chr>                                                                <chr>    
#> 1 Inbreker Aldi Hilvarenbeek na botsing met boom aangehouden in gesto~ 10-01-20~
#> 2 Kettingbotsing met twaalf voertuigen op A58 bij Oirschot.            10-01-20~
#> 3 <NA>                                                                 <NA>     
#> 4 losgebroken paard doodgereden na aanrijdingen Amstelveen.            12-01-20~
#> 5 Zware ochtendspits door ongelukken.                                  12-01-20~
#> 6 Zwaargewonde bij aanrijding in Huissen.                              12-01-20~
#> 7 Zwaargewonde bij botsing op Broekdijk in Nuenen.                     12-01-20~
#> 8 Twee gewonden bij ongeluk Ochten.                                    12-01-20~

function 3 - uses purrr::possibly with function 1 to handle errors

failsafe_tbl <- dplyr::tibble(headings = NA_character_, dates = NA_character_)
purrr::map_dfr(urls,
  possibly(          # return a failsafe on error
    scrape_page,
    otherwise = failsafe_tbl
  )
)
#> # A tibble: 8 x 2
#>   headings                                                             dates    
#>   <chr>                                                                <chr>    
#> 1 Inbreker Aldi Hilvarenbeek na botsing met boom aangehouden in gesto~ 10-01-20~
#> 2 Kettingbotsing met twaalf voertuigen op A58 bij Oirschot.            10-01-20~
#> 3 <NA>                                                                 <NA>     
#> 4 losgebroken paard doodgereden na aanrijdingen Amstelveen.            12-01-20~
#> 5 Zware ochtendspits door ongelukken.                                  12-01-20~
#> 6 Zwaargewonde bij aanrijding in Huissen.                              12-01-20~
#> 7 Zwaargewonde bij botsing op Broekdijk in Nuenen.                     12-01-20~
#> 8 Twee gewonden bij ongeluk Ochten.                                    12-01-20~

Created on 2020-09-30 by the reprex package (v0.3.0)

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions