One of the big selling points of my dissertation is scale. Other historians and academics have looked at nuclear culture during the Cold War but have focused on discrete time periods, particular media, or certain manifestations of nuclear fear and anxiety. While my project is still only focusing on one particular form of media (in this case newspapers and their editorial cartoons) I am looking at all discourses on nuclear fear and anxiety during the full duration of the Cold War.
I plan on accomplishing this examination by a heavy reliance on digital methodologies, namely topic modeling and text mining. But before I can even begin to run these analyses, I have to gather my data, and this is something that doesn’t get discussed, examined, or elucidated much in the DH world…where do data sets come from? Particularly, data sets of what you actually need, not necessarily ones that are just available. In my case, what I need is approximately 15-45 years worth of various newspapers. Succinctly, I need The Washington Post from 1945-1989; The Denver Post from 1950-1964; The Los Angeles Times from 1964-1989; and The Des Moines Register from 1953-1983 (you can read about my dissertation and why these particular papers/years here).
Two of these papers, The Los Angeles Times and The Washington Post are digitized on Proquest. The other two are only available on microfilm. So I have two sets of materials that I need to gather, but there is no real “how-to” guide, or blog post, about the best way to gather them and organize them. There are many posts about what to do with my texts once I have them, but very little guidance about best practices for putting together a data set of what you actually want, not just working with what’s readily or easily available.
Not only do I have to figure this out fairly on my own, but I have to come up with two very different strategies and ensure that in the end I have a consistent data set across the newspapers and formats. I’m just beginning to figure out how this process will work but instead of waiting until I have it all figured out, I am going to blog my process as it goes. Not only so that when I do go and write my dissertation I have an accurate record of what steps I took and what worked and didn’t work, but also so that if anyone is thinking about attempting a project of this nature, they can see what mistakes I’ve made and save themselves a bit of time.
The first step I took was to contact Proquest and see if there was anyway I could circumvent their online infrastructure and get my material straight from them. The technical support person I spoke to explained that Proquest currently does not allow users to data mine their database nor does the interface facilitate batch downloads of the kind I need. He also mentioned that there were other databases that allowed access to their materials directly in order to perform the kind of methodologies that I wanted and I responded that I understood that, but those databases didn’t have what I wanted, they did. I explained to him that I understood all this completely and that was why I was contacting them directly. I would rather work with them to gain access to their materials then try to circumvent their interface with a script. After placing me on hold for a few minutes he came back and said that he would pass along my contact information and request to the Project Managers of the Historical Newspapers division and they would contact me see what they could do, or at least discuss the situation with me.
If this approach fails I will move on to Plan B: experimenting with the best way to web scrape what I need from the database. My only other alternative is to start on January 1 of any given year and download the PDF page view of each page of each issue of each paper that I need. I would like to avoid doing this as it defeats the purpose of having these newspapers digitized since this is essentially the process I have to implement in dealing with the newspapers on microfilm. I have to go through each reel of microfilm for the years I need and grab a PDF image of the page. My only real concerns here are ensuring that the quality of PDF is high enough for Abbyy Finereader to be able to OCR it as accurately as possible, and to organize my PDFs in the best possible manner.
This is just a preliminary post, laying the groundwork of my struggles and problems as I move forward. I will continue to blog my progress and the trials and tribulations I experience as I attempt to (a) gather my stuffs (b) convert my stuffs into a usable format and (c) do something with my stuffs.
I received an email from one of the Project Managers of the Historical Newspapers yesterday after publishing this post. While I’m not sure that my request was properly explained to him as I think he believes I want to mine to database directly (instead of circumventing the online database to get the material on my own computer to then mine) I think the response would be generally the same. He informed me that Proquest does not allow for data mining due to copyright/distribution rights issues and that the platform is not “set up to absorb massive amounts of hits and data transfers.” He did mention that they were “in process of starting an add-on service for Historical Newspapers that would allow researchers…to mine the data in a controlled ‘sandbox’ area, but we are a year away from this service.”
I am not terribly surprised by this response as it was what I was expecting. What does interest me is this “controlled sandbox” they are building for researchers “like myself.” While I’m sure Proquest’s intentions are honorable, it will be interesting to see what kind of data mining they will allow and what tools they will give the researcher to use in this “sandbox.” In my experience, anytime you limit the tools available, and therefore the techniques that can be applied to data, you are stifling the researching process by automatically limiting the kinds of questions that can be asked based on how the data can be manipulated.
Still, I look forward to seeing how this “sandbox” eventually turns out even though it will be implemented too late for me to use. In the meantime, I’m off to learn Python so I can write a script to get what I need from Proquest without having to download every day of every issue of every paper.