Automated Reading of Tables from PDF Documents

By Bas Machielsen

May 31, 2020

Introduction

In this post, I will briefly explain how to read in a table easily from the Indonesian data sources that I have been planning to use (see here). First, I download the file, containing data about municipalities in every province and their GRDP:

if(!is.element("hello.pdf", list.files())){
download.file("https://www.bps.go.id/publication/download.html?nrbvfeve=OTgxMmExYzRlYTI1Mjk4MDA0ODM5NTk2&xzmn=aHR0cHM6Ly93d3cuYnBzLmdvLmlkL3B1YmxpY2F0aW9uLzIwMTkvMTAvMDQvOTgxMmExYzRlYTI1Mjk4MDA0ODM5NTk2L3Byb2R1ay1kb21lc3Rpay1yZWdpb25hbC1icnV0by1rYWJ1cGF0ZW4ta290YS1kaS1pbmRvbmVzaWEtMjAxNC0yMDE4Lmh0bWw%3D&twoadfnoarfeauf=MjAyMi0wOC0wMyAwMDo1NDoyMg%3D%3D", destfile = "hello.pdf")}

Reading in the table

There are different ways in which I can read the table. First, I use the tabulizer package, from which I can use the extract_tables function, with two arguments:

  • The file path (in this case: “hello.pdf”)
  • The page number (in this case, we start at page 24, so let us take that page as an example)
library(tabulizer)

#remotes::install_github(c("ropensci/tabulizerjars", "ropensci/tabulizer"), INSTALL_opts = "--no-multiarch")

file <- "hello.pdf"

tabulizer::extract_tables(file, pages = 24)
## [[1]]
##       [,1]                       [,2]     [,3]      [,4]     [,5]     [,6]    
##  [1,] "KABUPATEN/KOTA"           ""       ""        ""       ""       ""      
##  [2,] ""                         "2014"   "2015"    "2016"   "2017*"  "2018**"
##  [3,] "Regency/Municipality"     ""       ""        ""       ""       ""      
##  [4,] "(1)"                      "(2)"    "(3)"     "(4)"    "(5)"    "(6)"   
##  [5,] "01. Kab. Simeulue"        "1.517"  "1.640"   "1.773"  "1.897"  "2.048" 
##  [6,] "02. Kab. Aceh Singkil"    "1.686"  "1.816"   "1.981"  "2.127"  "2.259" 
##  [7,] "03. Kab. Aceh Selatan"    "3.930"  "4.227"   "4.554"  "4.855"  "5.210" 
##  [8,] "04. Kab. Aceh Tenggara"   "3.314"  "3.567"   "3.883"  "4.236"  "4.558" 
##  [9,] "05. Kab. Aceh Timur"      "8.695"  "8.196"   "8.510"  "9.105"  "9.700" 
## [10,] "06. Kab. Aceh Tengah"     "5.463"  "5.875"   "6.307"  "6.722"  "7.184" 
## [11,] "07. Kab. Aceh Barat"      "5.469"  "5.828"   "6.184"  "6.943"  "7.651" 
## [12,] "08. Kab. Aceh Besar"      "9.650"  "10.327"  "10.969" "11.637" "12.501"
## [13,] "09. Kab. Pidie"           "7.255"  "7.862"   "8.490"  "9.187"  "9.904" 
## [14,] "10. Kab. Bireun"          "9.392"  "10.069"  "10.726" "11.411" "12.139"
## [15,] "11. Kab. Aceh Utara"      "19.941" "16.340"  "16.696" "17.542" "18.953"
## [16,] "12. Kab. Aceh Barat Daya" "2.780"  "2.968"   "3.175"  "3.395"  "3.635" 
## [17,] "13. Kab. Gayo Lues"       "1.934"  "2.076"   "2.234"  "2.428"  "2.597" 
## [18,] "14. Kab. Aceh Tamiang"    "5.648"  "5.763"   "6.063"  "6.518"  "7.008" 
## [19,] "15. Kab. Nagan Raya"      "5.457"  "5.775"   "6.159"  "6.571"  "6.931" 
## [20,] "16. Kab. Aceh Jaya"       "1.839"  "1.981id" "2.116"  "2.275"  "2.434" 
## [21,] "17. Kab. Bener Meriah"    "3.312"  "3.55.0"  "3.802"  "3.998"  "4.203" 
## [22,] "18. Kab. Pidie Jaya"      "2.389"  "2o.598"  "2.770"  "3.013"  "3.242" 
## [23,] "71. Kota Banda Aceh"      "13.502" "g14.494" "15.801" "16.685" "17.571"
## 
## [[2]]
##      [,1]                                                 [,2]         
## [1,] "71. Kota Banda Aceh"                                ""           
## [2,] "72. Kota Sabang"                                    ""           
## [3,] "73. Kota Langsa 74. Kota Lhokseumawe"               ""           
## [4,] "75. Kota Subulussalam"                              ""           
## [5,] "Jml Kab./Kota Total of Reg./Mun. Provinsi/Province" "ww  128.246"
##      [,3]           [,4] [,5]              [,6]              [,7]             
## [1,] "13.502"       ""   "14.494"          "15.801"          "16.685"         
## [2,] "992 ."        ""   "1.070"           "1.158"           "1.273"          
## [3,] "3.562ps9.321" ""   "3.875 7.636"     "4.217 7.729"     "4.538 8.070"    
## [4,] ". 1.b201"     ""   "1.295"           "1.400"           "1.536"          
## [5,] "127.897"      ""   "128.831 129.093" "136.698 136.844" "145.962 145.807"
##      [,8]            
## [1,] "17.571"        
## [2,] "1.399"         
## [3,] "4.8908.454"    
## [4,] "1.642"         
## [5,] "156.114155.912"

Not bad at all! But as you can see, it read two tables instead of one, and in the second table, it also missed some spaced. Finally ,the row summarizing the total is very confusing for the algorithm. A super useful feature of this package is the ability to select a part of the picture to extract:

extract_areas(file, 24)

You can also do this in two or three (or more) parts, and then combine the results to get a data.frame you want. Sometimes, or actually, most of the time, this will still not help you in getting exactly the table you want. One option is to resort to reading the table as a text, and then creating the table you want.

Reading the table as text

That is what we will do next. The same tabulizer package also allows us to read in text, in the following way:

extract_text(file, 24)
## [1] "KABUPATEN/KOTA\nRegency/Municipality\n(1) (2) (3) (4) (5) (6)\n01. Kab. Simeulue 1.517            1.640            1.773            1.897            2.048            \n02. Kab. Aceh Singkil 1.686            1.816            1.981            2.127            2.259            \n03. Kab. Aceh Selatan 3.930            4.227            4.554            4.855            5.210            \n04. Kab. Aceh Tenggara 3.314            3.567            3.883            4.236            4.558            \n05. Kab. Aceh Timur 8.695            8.196            8.510            9.105            9.700            \n06. Kab. Aceh Tengah 5.463            5.875            6.307            6.722            7.184            \n07. Kab. Aceh Barat 5.469            5.828            6.184            6.943            7.651            \n08. Kab. Aceh Besar 9.650            10.327          10.969          11.637          12.501          \n09. Kab. Pidie 7.255            7.862            8.490            9.187            9.904            \n10. Kab. Bireun 9.392            10.069          10.726          11.411          12.139          \n11. Kab. Aceh Utara 19.941          16.340          16.696          17.542          18.953          \n12. Kab. Aceh Barat Daya 2.780            2.968            3.175            3.395            3.635            \n13. Kab. Gayo Lues 1.934            2.076            2.234            2.428            2.597            \n14. Kab. Aceh Tamiang 5.648            5.763            6.063            6.518            7.008            \n15. Kab. Nagan Raya 5.457            5.775            6.159            6.571            6.931            \n16. Kab. Aceh Jaya 1.839            1.981            2.116            2.275            2.434            \n17. Kab. Bener Meriah 3.312            3.550            3.802            3.998            4.203            \n18. Kab. Pidie Jaya 2.389            2.598            2.770            3.013            3.242            \n71. Kota Banda Aceh 13.502          14.494          15.801          16.685          17.571          \n72. Kota Sabang 992               1.070            1.158            1.273            1.399            \n73. Kota Langsa 3.562            3.875            4.217            4.538            4.890            \n74. Kota Lhokseumawe 9.321            7.636            7.729            8.070            8.454            \n75. Kota Subulussalam 1.201            1.295            1.400            1.536            1.642            \nJml Kab./Kota Total of Reg./Mun. 128.246        128.831        136.698        145.962        156.114        \nProvinsi/Province 127.897        129.093        136.844        145.807        155.912        \nCatatan/Note: \n#) Merupakan pecahan dari kabupaten yang berada diatasnya/ As a part of Regency/Municipality above\n* Angka sementara/Preliminary figures\n** Angka sangat sementara/Very preliminary figures\nTabel/Table  1.\nPDRB Provinsi Aceh Atas Dasar Harga Berlaku \nMenurut Kabupaten/Kota (miliar rupiah), 2014-2018\nGRDP of Aceh Province at Current Market Prices \nby Regency/Municipality (billion rupiahs), 2014-2018\n2018**201620152014 2017*\n3\nht\ntp\ns:\n//w\nww\n.b\nps\n.g\no.\nid\n"

This seems kinda messy, but you will shortly see that this is in fact just what we need. To make this clear, let’s do some cleaning:

extract_text(file, 24) %>%
  stringr::str_split("\n") %>%
  lapply(stringr::str_squish) %>%
  magrittr::extract2(1) %>%
  magrittr::extract(4:26) 
##  [1] "01. Kab. Simeulue 1.517 1.640 1.773 1.897 2.048"       
##  [2] "02. Kab. Aceh Singkil 1.686 1.816 1.981 2.127 2.259"   
##  [3] "03. Kab. Aceh Selatan 3.930 4.227 4.554 4.855 5.210"   
##  [4] "04. Kab. Aceh Tenggara 3.314 3.567 3.883 4.236 4.558"  
##  [5] "05. Kab. Aceh Timur 8.695 8.196 8.510 9.105 9.700"     
##  [6] "06. Kab. Aceh Tengah 5.463 5.875 6.307 6.722 7.184"    
##  [7] "07. Kab. Aceh Barat 5.469 5.828 6.184 6.943 7.651"     
##  [8] "08. Kab. Aceh Besar 9.650 10.327 10.969 11.637 12.501" 
##  [9] "09. Kab. Pidie 7.255 7.862 8.490 9.187 9.904"          
## [10] "10. Kab. Bireun 9.392 10.069 10.726 11.411 12.139"     
## [11] "11. Kab. Aceh Utara 19.941 16.340 16.696 17.542 18.953"
## [12] "12. Kab. Aceh Barat Daya 2.780 2.968 3.175 3.395 3.635"
## [13] "13. Kab. Gayo Lues 1.934 2.076 2.234 2.428 2.597"      
## [14] "14. Kab. Aceh Tamiang 5.648 5.763 6.063 6.518 7.008"   
## [15] "15. Kab. Nagan Raya 5.457 5.775 6.159 6.571 6.931"     
## [16] "16. Kab. Aceh Jaya 1.839 1.981 2.116 2.275 2.434"      
## [17] "17. Kab. Bener Meriah 3.312 3.550 3.802 3.998 4.203"   
## [18] "18. Kab. Pidie Jaya 2.389 2.598 2.770 3.013 3.242"     
## [19] "71. Kota Banda Aceh 13.502 14.494 15.801 16.685 17.571"
## [20] "72. Kota Sabang 992 1.070 1.158 1.273 1.399"           
## [21] "73. Kota Langsa 3.562 3.875 4.217 4.538 4.890"         
## [22] "74. Kota Lhokseumawe 9.321 7.636 7.729 8.070 8.454"    
## [23] "75. Kota Subulussalam 1.201 1.295 1.400 1.536 1.642"

Only the last command, magrittr:extract contains a parameter, because that’s where the actual content of the table is separated from things like the header and footer. In this case, I wanted to extract only the municipalities’ info, and not the total or headers or anything, so that’s why I went with row 4 to row 26, but I could’ve included more if I wanted to.

Let us now write this output to a vector called data, and see if we want split up the strings and put the data in a data.frame.

data <- extract_text(file, 24) %>%
  stringr::str_split("\n") %>%
  lapply(stringr::str_squish) %>%
  magrittr::extract2(1) %>%
  magrittr::extract(4:26) 

Splitting up the data in columns

Naturally, we wanted to have a data.frame in which the first and second column contain information about the specific municipality, and all the other numbers afterwards to go to separate variables. Unfortunately, the stringr::str_split function, which is the usual way to go about this, has some limitations, forcing us to use either complicated Regex’s to get the string to split in chunks we want, or to go with a second-best alternative. Fortunately, I found this blog:

strsplit2 <- function(x,
                     split,
                     type = "remove",
                     perl = FALSE,
                     ...) {
  if (type == "remove") {
    # use base::strsplit
    out <- base::strsplit(x = x, split = split, perl = perl, ...)
  } else if (type == "before") {
    # split before the delimiter and keep it
    out <- base::strsplit(x = x,
                          split = paste0("(?<=.)(?=", split, ")"),
                          perl = TRUE,
                          ...)
  } else if (type == "after") {
    # split after the delimiter and keep it
    out <- base::strsplit(x = x,
                          split = paste0("(?<=", split, ")"),
                          perl = TRUE,
                          ...)
  } else {
    # wrong type input
    stop("type must be remove, after or before!")
  }
  return(out)
}

which was exactly what I was looking for: I could use easy Regex, and clean the data in very few steps:

data %>%
  strsplit2("\\s[0-9]+", type = "before") %>%
  purrr::reduce(rbind) %>%
  as.data.frame(row.names = F)
##                          V1      V2      V3      V4      V5      V6
## 1         01. Kab. Simeulue   1.517   1.640   1.773   1.897   2.048
## 2     02. Kab. Aceh Singkil   1.686   1.816   1.981   2.127   2.259
## 3     03. Kab. Aceh Selatan   3.930   4.227   4.554   4.855   5.210
## 4    04. Kab. Aceh Tenggara   3.314   3.567   3.883   4.236   4.558
## 5       05. Kab. Aceh Timur   8.695   8.196   8.510   9.105   9.700
## 6      06. Kab. Aceh Tengah   5.463   5.875   6.307   6.722   7.184
## 7       07. Kab. Aceh Barat   5.469   5.828   6.184   6.943   7.651
## 8       08. Kab. Aceh Besar   9.650  10.327  10.969  11.637  12.501
## 9            09. Kab. Pidie   7.255   7.862   8.490   9.187   9.904
## 10          10. Kab. Bireun   9.392  10.069  10.726  11.411  12.139
## 11      11. Kab. Aceh Utara  19.941  16.340  16.696  17.542  18.953
## 12 12. Kab. Aceh Barat Daya   2.780   2.968   3.175   3.395   3.635
## 13       13. Kab. Gayo Lues   1.934   2.076   2.234   2.428   2.597
## 14    14. Kab. Aceh Tamiang   5.648   5.763   6.063   6.518   7.008
## 15      15. Kab. Nagan Raya   5.457   5.775   6.159   6.571   6.931
## 16       16. Kab. Aceh Jaya   1.839   1.981   2.116   2.275   2.434
## 17    17. Kab. Bener Meriah   3.312   3.550   3.802   3.998   4.203
## 18      18. Kab. Pidie Jaya   2.389   2.598   2.770   3.013   3.242
## 19      71. Kota Banda Aceh  13.502  14.494  15.801  16.685  17.571
## 20          72. Kota Sabang     992   1.070   1.158   1.273   1.399
## 21          73. Kota Langsa   3.562   3.875   4.217   4.538   4.890
## 22     74. Kota Lhokseumawe   9.321   7.636   7.729   8.070   8.454
## 23    75. Kota Subulussalam   1.201   1.295   1.400   1.536   1.642

The example of this method is that it can be used on virtually all tables from the aforementioned .pdf file, requiring very little editing and data wrangling, which saves a lot of time. To domenstrate, let’s look at pages 25, 26 and 27, the pages following the original p 24 which I started with.

  • Page 25:
data <- extract_text(file, 25) %>%
  stringr::str_split("\n") %>%
  lapply(stringr::str_squish) %>%
  magrittr::extract2(1) %>%
  magrittr::extract(4:36) #Remember to change this parameter


data
##  [1] "01. Kab. Nias 2.443 2.677 2.966 3.234 3.509"                     
##  [2] "02. Kab. Mandailing Natal 8.758 9.586 10.660 11.713 12.618"      
##  [3] "03. Kab. Tapanuli Selatan 9.310 10.058 10.965 11.983 12.902"     
##  [4] "04. Kab. Tapanuli Tengah 6.516 7.140 7.850 8.545 9.230"          
##  [5] "05. Kab. Tapanuli Utara 5.429 5.856 6.300 6.766 7.297"           
##  [6] "06. Kab. Toba Samosir 5.173 5.623 6.124 6.642 7.167"             
##  [7] "07. Kab. Labuhan Batu 22.176 24.083 26.505 29.032 31.303"        
##  [8] "08. Kab. Asahan 24.329 26.465 29.207 32.020 34.667"              
##  [9] "09. Kab. Simalungun 25.338 27.147 30.123 32.832 35.445"          
## [10] "10. Kab. Dairi 6.268 6.823 7.484 8.049 8.736"                    
## [11] "11. Kab. Karo 13.817 15.150 16.728 18.066 19.359"                
## [12] "12. Kab. Deli Serdang 69.674 76.735 85.152 93.194 101.120"       
## [13] "13. Kab. Langkat 27.875 30.742 34.105 37.119 39.819"             
## [14] "14. Kab. Nias Selatan 4.298 4.729 5.193 5.696 6.262"             
## [15] "15. Kab. Humbang Hasundutan 4.050 4.413 4.777 5.130 5.524"       
## [16] "16. Kab. Pakpak Bharat 754 826 917 994 1.083"                    
## [17] "17. Kab. Samosir 2.838 3.144 3.443 3.752 4.085"                  
## [18] "18. Kab. Serdang Bedagai 18.457 20.152 22.114 24.095 25.995"     
## [19] "19. Kab. Batu Bara 23.461 25.395 27.555 29.770 31.972"           
## [20] "20. Kab. Padang Lawas Utara 7.448 8.222 9.074 9.904 10.765"      
## [21] "21. Kab. Padang Lawas 7.288 7.853 8.808 9.705 10.591"            
## [22] "22. Kab. Labuhan Batu Selatan 17.601 19.052 21.004 23.196 25.124"
## [23] "23. Kab. Labuhan Batu Utara 16.262 17.620 19.374 21.162 22.750"  
## [24] "24. Kab. Nias Utara 2.319 2.525 2.775 3.008 3.252"               
## [25] "25. Kab. Nias Barat 1.184 1.289 1.414 1.548 1.672"               
## [26] "71. Kota Sibolga 3.429 3.836 4.263 4.645 5.064"                  
## [27] "72. Kota Tanjung Balai 5.439 6.052 6.723 7.425 8.176"            
## [28] "73. Kota Pematang Siantar 9.555 10.566 11.579 12.444 13.177"     
## [29] "74. Kota Tebing Tinggi 3.912 4.288 4.725 5.123 5.513"            
## [30] "75. Kota Medan 148.247 164.722 184.809 203.016 222.482"          
## [31] "76. Kota Binjai 7.649 8.382 9.112 9.905 10.765"                  
## [32] "77. Kota Padang Sidempuan 4.001 4.425 4.903 5.372 5.859"         
## [33] "78. Kota Gunungsitoli 3.212 3.595 4.034 4.503 5.010"
data %>%
  strsplit2("\\s[0-9]+", type = "before") %>%
  purrr::reduce(rbind) %>%
  as.data.frame(row.names = F)
##                               V1       V2       V3       V4       V5       V6
## 1                  01. Kab. Nias    2.443    2.677    2.966    3.234    3.509
## 2      02. Kab. Mandailing Natal    8.758    9.586   10.660   11.713   12.618
## 3      03. Kab. Tapanuli Selatan    9.310   10.058   10.965   11.983   12.902
## 4       04. Kab. Tapanuli Tengah    6.516    7.140    7.850    8.545    9.230
## 5        05. Kab. Tapanuli Utara    5.429    5.856    6.300    6.766    7.297
## 6          06. Kab. Toba Samosir    5.173    5.623    6.124    6.642    7.167
## 7          07. Kab. Labuhan Batu   22.176   24.083   26.505   29.032   31.303
## 8                08. Kab. Asahan   24.329   26.465   29.207   32.020   34.667
## 9            09. Kab. Simalungun   25.338   27.147   30.123   32.832   35.445
## 10                10. Kab. Dairi    6.268    6.823    7.484    8.049    8.736
## 11                 11. Kab. Karo   13.817   15.150   16.728   18.066   19.359
## 12         12. Kab. Deli Serdang   69.674   76.735   85.152   93.194  101.120
## 13              13. Kab. Langkat   27.875   30.742   34.105   37.119   39.819
## 14         14. Kab. Nias Selatan    4.298    4.729    5.193    5.696    6.262
## 15   15. Kab. Humbang Hasundutan    4.050    4.413    4.777    5.130    5.524
## 16        16. Kab. Pakpak Bharat      754      826      917      994    1.083
## 17              17. Kab. Samosir    2.838    3.144    3.443    3.752    4.085
## 18      18. Kab. Serdang Bedagai   18.457   20.152   22.114   24.095   25.995
## 19            19. Kab. Batu Bara   23.461   25.395   27.555   29.770   31.972
## 20   20. Kab. Padang Lawas Utara    7.448    8.222    9.074    9.904   10.765
## 21         21. Kab. Padang Lawas    7.288    7.853    8.808    9.705   10.591
## 22 22. Kab. Labuhan Batu Selatan   17.601   19.052   21.004   23.196   25.124
## 23   23. Kab. Labuhan Batu Utara   16.262   17.620   19.374   21.162   22.750
## 24           24. Kab. Nias Utara    2.319    2.525    2.775    3.008    3.252
## 25           25. Kab. Nias Barat    1.184    1.289    1.414    1.548    1.672
## 26              71. Kota Sibolga    3.429    3.836    4.263    4.645    5.064
## 27        72. Kota Tanjung Balai    5.439    6.052    6.723    7.425    8.176
## 28     73. Kota Pematang Siantar    9.555   10.566   11.579   12.444   13.177
## 29        74. Kota Tebing Tinggi    3.912    4.288    4.725    5.123    5.513
## 30                75. Kota Medan  148.247  164.722  184.809  203.016  222.482
## 31               76. Kota Binjai    7.649    8.382    9.112    9.905   10.765
## 32     77. Kota Padang Sidempuan    4.001    4.425    4.903    5.372    5.859
## 33         78. Kota Gunungsitoli    3.212    3.595    4.034    4.503    5.010
  • Page 26:
data <- extract_text(file, 26) %>%
  stringr::str_split("\n") %>%
  lapply(stringr::str_squish) %>%
  magrittr::extract2(1) %>%
  magrittr::extract(4:23) #Remember to change this parameter


data %>%
  strsplit2("\\s[0-9]+", type = "before") %>%
  purrr::reduce(rbind) %>%
  as.data.frame(row.names = F)
##                                  V1       V2       V3       V4       V5
## 1       01. Kab. Kepulauan Mentawai    3.027    3.396    3.726    4.089
## 2          02. Kab. Pesisir Selatan    9.114   10.197   11.271   12.522
## 3                    03. Kab. Solok    9.408   10.165   11.053   11.980
## 4                04. Kab. Sijunjung    6.471    6.955    7.439    7.978
## 5              05. Kab. Tanah Datar    9.178    9.901   10.735   11.620
## 6          06. Kab. Padang Pariaman   14.153   15.846   17.533   19.182
## 7                     07. Kab. Agam   13.918   15.248   16.693   18.220
## 8          08. Kab. Lima Puluh Kota   10.564   11.583   12.678   13.783
## 9                  09. Kab. Pasaman    5.951    6.505    7.336    8.008
## 10           10. Kab. Solok Selatan    3.891    4.236    4.598    4.987
## 11             11. Kab. Dharmasraya    7.155    7.725    8.438    9.282
## 12           12. Kab. Pasaman Barat   10.703   11.713   12.794   14.068
## 13                  71. Kota Padang   41.266   45.093   49.386   53.869
## 14                   72. Kota Solok    2.729    2.965    3.241    3.555
## 15             73. Kota Sawah Lunto    2.514    2.715    2.938    3.214
## 16          74. Kota Padang Panjang    2.348    2.533    2.774    3.028
## 17             75. Kota Bukittinggi    5.636    6.170    6.783    7.453
## 18              76. Kota Payakumbuh    4.181    4.655    5.203    5.757
## 19                77. Kota Pariaman    3.406    3.699    4.037    4.387
## 20 Jml Kab./Kota Total of Reg./Mun.  165.612  181.302  198.656  216.982
##          V6
## 1     4.397
## 2    13.643
## 3    12.801
## 4     8.516
## 5    12.393
## 6    20.639
## 7    19.506
## 8    14.739
## 9     8.530
## 10    5.303
## 11    9.917
## 12   14.997
## 13   58.272
## 14    3.835
## 15    3.461
## 16    3.269
## 17    8.069
## 18    6.342
## 19    4.765
## 20  233.394
  • And finally, page 27. Let me introduce a more abstract way of recognizing the parameter, making use of which to find out in which character string “Jml Kab.” is located, so I know at which string to stop:
data <- extract_text(file, 27) %>%
  stringr::str_split("\n") %>%
  lapply(stringr::str_squish) %>%
  magrittr::extract2(1) 

uptothis <- which(stringr::str_detect(data, "Jml Kab."))-1

data <- data %>%
  magrittr::extract(4:uptothis)


data %>%
  strsplit2("\\s[0-9]+", type = "before") %>%
  purrr::reduce(rbind) %>%
  as.data.frame(row.names = F)
##                            V1       V2       V3       V4       V5       V6
## 1   01. Kab. Kuantan Singingi   24.022   25.195   27.522   29.518   30.652
## 2     02. Kab. Indragiri Hulu   33.762   34.584   37.033   38.740   40.392
## 3    03. Kab. Indragiri Hilir   47.822   51.800   57.292   60.891   60.223
## 4          04. Kab. Pelalawan   35.401   38.176   41.165   43.871   46.154
## 5               05. Kab. Siak   85.736   77.236   78.942   79.619   84.674
## 6             06. Kab. Kampar   68.817   66.285   69.676   71.587   77.197
## 7         07. Kab. Rokan Hulu   25.355   26.907   29.146   31.006   32.311
## 8          08. Kab. Bengkalis  165.899  135.505  132.201  132.994  149.407
## 9        09. Kab. Rokan Hilir   74.546   70.693   73.268   74.030   78.707
## 10 10. Kab. Kepulauan Meranti   15.127   15.152   16.044   16.731   18.186
## 11        71. Kota Pekan Baru   73.841   83.664   92.129  101.112  108.840
## 12             73. Kota Dumai   23.628   25.454   27.962   30.299   32.994

Conclusion

In this post, I attempted to explain, and show, how to easily import tables from .pdf documents in R. I also did not show a couple of other aspects, most notably, endogenous start of data extraction (a little tricker than endogenous ending), and I also did not separate the Identifiers from the names. Neither did I convert all variables to numeric, and gave variable names. If you want to know more about such basic cleaning operations, feel free to message me, or better, go to the tidyverse website.

Posted on:
May 31, 2020
Length:
15 minute read, 3152 words
See Also: