Search code examples
javascripthtmlcssrrselenium

RSelenium: Unable to extract hrefs from page after button click


I'm trying to automate web scraping using RSelenium in R. I've successfully located and clicked a button on a webpage using RSelenium, but I'm having trouble extracting href attributes from the page after the button click.

I actually have a list of 4000 species, but here is an example:

Species <- c("Abies balsamea", "Alchemilla glomerulans", "Antennaria dioica", 
"Atriplex glabriuscula", "Brachythecium salebrosum")

Here's the code I'm using:

library(RSelenium)
remDr <- remoteDriver(
  remoteServerAddr = "localhost",
  port = 4445L,
  browserName = "firefox"
)

remDr$open()

remDr$navigate("https://ser-sid.org/")

webElem <- remDr$findElement(using = "class", "flex")

# Find the input field and button within webElem
input_element <- webElem$findChildElement(using = "css selector", value = "input[type='text']")
button_element <- webElem$findChildElement(using = "css selector", value = "button")

# Enter species name into the input field

input_element$sendKeysToElement(list("Abies balsamea"))

# Click the button to submit the form
button_element$clickElement()


Sys.sleep(5)

# Find all <a> elements with species information
species_links <- remDr$findElements(using = "css selector", value = "a[href^='/species/']")

# Extract the href attributes from the species links
hrefs <- sapply(species_links, function(link) {
  link$getElementAttribute("href")
})

# Filter out NULL values (in case some links don't have href attributes)
hrefs <- hrefs[!is.na(hrefs)]

# Print the extracted hrefs
print(hrefs)

The code runs without errors, but species_links is empty, indicating that the elements with species information are not being located.

I've tried waiting for the page to load after clicking the button, but it seems like the page content isn't fully loading or isn't as expected.

When I do it manually and search for Abies balsamea in the webpage I get this

enter image description here

And from there I would like to at least get this link:

https://ser-sid.org/species/ef741ce8-6911-4286-b79e-3ff0804520fb

I can see it when I inspect it in the webpage as seen in the following image

enter image description here

How can I troubleshoot this issue and ensure that I can extract hrefs from the page after clicking the button?

Ideally I would like to loop through a species list such as Species, and get a data.frame with the links to each species

Edit based on Brett Donald's answer

Clearly his take is a better idea, but I have not found the documentation for the API.

Here is what I tried

library(httr)

# Define the API endpoint URL
url <- "https://fyxheguykvewpdeysvoh.supabase.co/rest/v1/species_summary"

# Define the query parameters
params <- list(
  select = "*",
  or = "(has_germination.eq.true,has_oil.eq.true,has_protein.eq.true,has_dispersal.eq.true,has_seed_weights.eq.true,has_storage_behaviour.eq.true,has_morphology.eq.true)",
  genus = "ilike.Abies%",
  epithet = "ilike.balsamea%",
  order = "genus.asc.nullslast,epithet.asc.nullslast"
)

# Define the request headers with the correct API key
headers <- add_headers(
  `Content-Type` = "application/json",
  Authorization = "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJzdXBhYmFzZSIsInJlZiI6ImZ5eGhlZ3V5a3Zld3BkZXlzdm9oIiwicm9sZSI6ImFub24iLCJpYXQiOjE2NDc0MTY1MzQsImV4cCI6MTk2Mjk5MjUzNH0.XhJKVijhMUidqeTbH62zQ6r8cS6j22TYAKfbbRHMTZ8"
)

# Make the GET request
response <- GET(url, query = params, headers = headers)

# Check if the request was successful
if (http_type(response) == "application/json") {
  # Parse the JSON response
  data <- content(response, "parsed")
  print(data)
} else {
  print("Error: Failed to retrieve data")
}

But I get

$message
[1] "No API key found in request"

$hint
[1] "No `apikey` request header or url param was found."


Solution

  • I can’t see anything wrong with your code, although I am not a user of RSelenium.

    I must admit I’d be tempted to obtain the data not by scraping the website with a robotic browser, but by cloning the API calls which the website uses to retrieve data when you search.

    When you do a search on ser-sid.org, you can discover the API endpoint URL being called and API key from the network tab of your browser inspector.

    API endpoint URL (including all parameters)

    https://fyxheguykvewpdeysvoh.supabase.co/rest/v1/species_summary?select=*&or=%28has_germination.eq.true%2Chas_oil.eq.true%2Chas_protein.eq.true%2Chas_dispersal.eq.true%2Chas_seed_weights.eq.true%2Chas_storage_behaviour.eq.true%2Chas_morphology.eq.true%29&genus=ilike.Abies%25&epithet=ilike.balsamea%25&order=genus.asc.nullslast%2Cepithet.asc.nullslast
    

    API key (a request header)

    eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJzdXBhYmFzZSIsInJlZiI6ImZ5eGhlZ3V5a3Zld3BkZXlzdm9oIiwicm9sZSI6ImFub24iLCJpYXQiOjE2NDc0MTY1MzQsImV4cCI6MTk2Mjk5MjUzNH0.XhJKVijhMUidqeTbH62zQ6r8cS6j22TYAKfbbRHMTZ8
    

    I copied these into a new Get request in Postman and was able to get back the following JSON response:

    [
      {
        "genus": "Abies",
        "epithet": "balsamea",
        "id": "ef741ce8-6911-4286-b79e-3ff0804520fb",
        "infraspecies_rank": null,
        "infraspecies_epithet": null,
        "has_germination": false,
        "has_oil": true,
        "has_protein": false,
        "has_dispersal": true,
        "has_seed_weights": true,
        "has_storage_behaviour": true,
        "has_morphology": false
      },
      {
        "genus": "Abies",
        "epithet": "balsamea",
        "id": "024cde5f-7cc5-48b7-89fd-be95638c8f2a",
        "infraspecies_rank": "var.",
        "infraspecies_epithet": "balsamea",
        "has_germination": true,
        "has_oil": false,
        "has_protein": false,
        "has_dispersal": false,
        "has_seed_weights": true,
        "has_storage_behaviour": true,
        "has_morphology": false
      }
    ]
    

    So you could write a simple robot to make these requests in pretty much any language you like. I would do it in Node.js, for example. Isn’t that much easier than trying to scrape the website using a robotic browser?

    PS. As the data in this database is apparently distributed with a Creative Commons Licence, you may be able to contact the Society for Ecological Restoration and get a copy of the data you need directly, rather trying to ingest the data one species at a time.