First Roadblock Scraping PDFs from ProQuest Historical Newspapers 3

Part of my dissertation research involves getting a giant corpus of newspaper articles from ProQuest Historical Newspapers. I need the editorial pages and A section from the Washington Post from 1946-1989 and from the Los Angeles Times from 1963-1989. I also need all of Paul Conrad’s editorial cartoons from the LA Times for the same years since he was notorious for giving away originals and his archives don’t have his complete body of work (unlike Herblock who never gave away an original in his life and gave them all to the Library of Congress when he died who then dutifully scanned them onto a DVD).

I can’t do anything without getting this material. So I have to figure out how to get it. Of course, thefirst thing I did was contact PHN to see if they could just give me the files I need, but of course, just my luck, their contracts with the two papers I need prohibit them from giving a large data file for the years I need. So I need to get them another way.

I’m working on contacting the newspapers directly to see if they have a digitized archive and are willing to give the files to me directly, but if that falls through I need to figure out how to get myself and I’d like to do it while avoiding an insane amount of clicking. I dove right in. This is what I’ve done and how far I’ve gotten so far.

First, I had to figure out if there was any rhyme or reason to the URL structure for ProQuest. There isn’t (as far as I can tell right now). But I did discover that each individual article has it’s own Document ID. Unfortunately, it also doesn’t appear that the articles are in order so that all the articles from the Times start at one number and end at another. Because that would be too easy.

I did discover that I could export the Document IDs for all the results from a search. Thankfully, all of Conrad’s cartoons for the year 1969 are correctly titled as “Editorial Cartoon — 1″ (this is not true of other years).

So after selecting the search results for the editorials cartoon from Une through December and exporting it as a Custom text only file selecting only the “Publication Date.” The file that it downloads has each result selected looking like this:

____________________________________________________________
Editorial Cartoon 1 -- No Title</pre>

http://search.proquest.com/docview/156239263?accountid=14541

Publication date: Jun 1, 1969
____________________________________________________________

The Document ID is the 9 digit number between /docview/ and ?accountid.

But I needed to get this cleaned up. First, I needed to alter the URL to one that resembles what my URL looks like when I login through the university subscription. I also wanted to change the publication date to one totally numerical so I could use it to rename the file that I download from ProQuest. More importantly, I needed to get these two lists of information separate from each other and isolated from all the extraneous text.

I started to look into using Regular Expressions and Python. I very quickly realized this was not going to work for me. I don’t know Python. So instead of trying to learn a whole new language I decided to write all my scripts in PHP. I know PHP. I understand PHP. When I do a google search when I get stuck, I understand the answers in PHP.

Using regular expressions and PHP I wrote a script that creates two arrays containing the cleaned up URLs and the reconfigured dates and then printed them into two separate text files. All my code and the files text files I’ve used can be found on Github. This worked great. While the code now has the two arrays exporting into two different text files, eventually this script will be expanded to combine the two arrays to complete a wget command that will run on the command line and get the PDFs for each URL with their individual doc IDs.

I started to experiment with wget. The first thing I had to get around was the login page for ProQuest. So I downloaded the “cookie.txt export” extension for Chrome which will give me a cookies.txt file to download to use in my wget command. First problem solved. My second problem is that when I try to combine the two arrays and create one long string for my wget command, every iteration but the last one has a line break before my URL gets added. While experimenting, I used smaller files containing only four urls and four dates. The code looks like this:

<?php

# imports url list as an array
$urls = file('urls.txt');

# imports date list as an array
$dates = file('dates.txt');

# establishes base of wget command
$wget = 'wget --load-cookies cookies.txt -A.pdf-O ';

# creates dummy array for testing foreach loop
$commands = array();

foreach (array_combine($urls, $dates) as $url => $date) {
$command = $wget .$date ." " .$url;
exec($command);
}

?>

When I printed out the results they looked like this:

wget --load-cookies cookies.txt -A.pdf -O 6-1-1969.html

http://search.proquest.com.mutex.gmu.edu/hnpwashingtonpost/docview/156239263?accountid=14541

wget --load-cookies cookies.txt -A.pdf -O 6-2-1969.html

http://search.proquest.com.mutex.gmu.edu/hnpwashingtonpost/docview/156303225?accountid=14541

wget --load-cookies cookies.txt -A.pdf -O 6-3-1969.html

http://search.proquest.com.mutex.gmu.edu/hnpwashingtonpost/docview/156138168?accountid=14541

wget --load-cookies cookies.txt -A.pdf -O 6-4-1969.html http://search.proquest.com.mutex.gmu.edu/hnpwashingtonpost/docview/156284179?accountid=14541

I want all the commands to look like the fourth one (all one line) but I can’t figure out what’s causing the line breaks.

Update 6/16/15: I figured out why I was getting line breaks. It’s because the file that I was importing the dates from has line breaks after each date except the last one. The new final combined code is on my GitHub under the filename: Combo.php

But this is just a small hurdle compared to my major roadblock and one I need to solve before I can move on and actually start gathering my material, which is: I can’t actually get the freaking PDF to download!!!

I’ve tried every thing I can think of with wget. This workflow was done while still trying to get past this major problem. The URLs I’m using are the only ones that I was actually able to get anything to download. But the HTML page it downloads contains the PDF viewer and I would have to open each file and select to download and save the image directly which defeats the whole purpose of trying to scrape: major clicking avoidance. Not to mention that if I don’t open these files and save the PDF right away, my cookie session times out and the HTML file I’ve downloaded is useless.

Editorial_Cartoon_1_--_No_Title_-_ProQuest

Anyone have any advice? Options? How can I get just the PDF and not the HTML page with the PDF viewer? I’ve tried using the levels with wget but that didn’t work. Is there another option to wget? I’ve looked into Beautiful Soup to try and find the direct URL to the PDF (following the “Open with your PDF Viewer” link) but the source code is such a convoluted mess I can’t make heads nor tails of it. And if I click on that link and then try and use that direct URL with wget I get an “Error 403 Forbidden” message on my command line. I’m seriously stuck and don’t know which way to go from here.

3 Comments

Zac says:

Hey Sasha,

I don’t suppose you had any luck with this – funny that someone else should be trying to scrape from the utter mess that is proquest. Obviously the designers knew what they were doing and did their utmost to obscure and confuse any potential scrapers – sadly there really is no alternative to get news paper articles.

I actually am stuck at the same place (nearly anyway) as you – I’ve been exploring options such as using firefox with a captured session but nothing seems to work. 403 forbidden every time.

Still looking for a solution. Should I find one, I’ll let you know!


Avigail Oren says:

I was just googling to find out if there was a way to create a corpus from ProQuest historical newspapers and I found your post. I’m also a doctoral candidate in history (at Carnegie Mellon) and I am experimenting with corpus analysis for a possible article. None of the existing corpora of American English that I’ve seen really work for the analysis I want to do (because of sampling/representation issues or because they are too recent) and so I hoped to build my own corpus. Obviously, ProQuest owns the digital copies of newspaper articles that I want to use… so here were are.

I was just wondering if you’d found a way around this issue, or–as the above commenter notes–this remains an intractable problem.

Thanks!
Avigail


Annie says:

Hi, did you ever figure this out? I’m trying to do something similar and it’s very difficult! Thanks and good luck to you!


Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>