Batch downloading with Selenium
By Bas Machielsen
May 29, 2020
Introduction
In this blogpost, I briefly explain how to batch download files in RSelenium. This can be super useful if you want to download some pdf’s or other files, but you don’t want to click ‘download’ a 1000 times, and there is no other option available.
Step 1: Setting up a Docker Container
In this case, we have to deviate from the standard case of setting up a Docker container. We have to make sure that there is a mapping between the Docker folder where the downloads will end up, and the Download folder on our ‘real’ machine.
$ docker run -d -p 4445:4444 -p 5901:5900 -v /home/bas/Downloads:/home/seluser/Downloads selenium/standalone-firefox
As usual, we assign one set of ports to the Docker machine, and we assign another set of ports to serve as the ‘means of transport’ between the Docker container and our own directory. The syntax tells us that we have to first place our down directory /home/bas/Downloads/
and then the directory where the downloads end up on the Selenium Image in the container /home/seluser/Downloads
.
Step 2: Specifying Firefox Preferences
We also have to specify our (virtual) browser’s preferences. In particular, we have to specify the download folder, and we have to specify that the browser shouldn’t open download windows before downloading something (because Selenium can’t handle that).
ePrefs <- RSelenium::makeFirefoxProfile(
list(
"browser.download.dir" = "/home/seluser/Downloads",
"browser.download.folderList" = 2L,
"browser.download.manager.showWhenStarting" = FALSE,
"browser.helperApps.neverAsk.saveToDisk" = "multipart/x-zip,application/zip,application/x-zip-compressed,application/x-compressed,application/msword,application/csv,text/csv,image/png ,image/jpeg, application/pdf, text/html,text/plain, application/excel, application/vnd.ms-excel, application/x-excel, application/x-msexcel, application/octet-stream"))
Note that you should leave the download directory as ‘/home/seluser/Downloads’, because that it the standard directory the Selenium image creates, and also because you’ve specified a map from that directory to your own downloads folder when you set up the Docker container.
Step 3: Downloading a file
Next, we can connect to the server we’ve just created, instructing the browser client to take into consideration the preferences (settings) we just created in the list ePrefs
:
remDr <- RSelenium::remoteDriver(browserName = "firefox",
port = 4445L,
extraCapabilities = ePrefs)
remDr$open()
Let’s now navigate to an example website (this website), and download a .csv
file which I’ve hidden in there:
remDr$navigate("https://bas-m.netlify.app")
download.file("https://bas-m.netlify.app/iranianmps.csv", destfile = "iranianmps.csv")
#click <- remDr$findElement("css", "#step-3-cleaning-the-data > p:nth-child(2) > a:nth-child(1)")
#click$clickElement()
You can check whether you can see the file in your Downloads folder (or any other folder yo might have specified) now!
Step 4: Example: Download batch files
Let’s now proceed to a more interesting application: batch downloading pictures of archival data from CBS Historisch. We will use the “Jaarcijfers voor Nederland 1943 (500 p.)” and we will start from page 1, and scrape until page 100! We will execute a for loop over several (tens, hundreds of) pages, and download a picture on every page!
#Navigate to page 1
remDr$navigate("https://www.historisch.cbs.nl/detail.php?nav_id=5-1&id=102092112")
#accept the cookies
clickhere <- remDr$findElement(using = "css", "a.cb-enable")
clickhere$clickElement()
This is the preliminary work. Now, we can start a for loop over 100 pages.
for(i in 1:5) {
#Switch to the correct frame
webElem <- remDr$findElements("css", "iframe")
remDr$switchToFrame(webElem[[1]])
#Find the two download subsequent buttons and download the file
remDr$findElement(using = "css", "a#downloadDirect") -> download
download$clickElement()
remDr$findElement("css", "a#downloadResLink") -> download2
download2$clickElement()
#Now navigate to the next page:
#First, switch back to the original frame
remDr$switchToFrame(NULL)
#Then, find the button for page i:
#Find the relevant Xpath
path <- "//a[contains(@class, 'custom-navigation-page') and text()='y']"
path <- stringr::str_replace(path, "y", as.character(i+1))
#And click to the next page
remDr$findElement("xpath", path) -> click
click$clickElement()
#And then we can start again - make sure to add a sys.Sleep:
Sys.sleep(5)
}
Don’t forget to close your session afterwards:
remDr$close()
In the future..
The next thing you might do with all these pictures is automatically OCR’ing them! The tesseract
package allows OCRing in R, but the quality is still very low.. Perhaps it would be reasonable to do so once the algorithm is good enough to distinguish between tables and all other text. Thanks for reading!
- Posted on:
- May 29, 2020
- Length:
- 3 minute read, 639 words
- See Also: