Background
For the last few years I’ve been subscribing to Nature Publishing Group’s Nature magazine. It is a scientific, peer-reviewed weekly magazine that publishes some of the best and ground breaking papers in hard sciences (And when I say “science”, I don’t mean “English Literature”) – biology, chemistry, material sciences, physics, occasional psychology and mathematics.
It is a great magazine, and probably the only real competition it has in the field is Science, published by AAAS (American Association for the Advancement of Science). AAAS’s Science is also offered as a digital download (and in fact they offer subscription with a significant discount), however it only offers it in a Zinio format. Zinio, on the other hand, is infamous for having DRM in the downloaded files, giving original content providers ability to expire the document after certain time. So in reality I don’t own content, like I do when I buy a magazine, but rent it, and they can pull the plug at any moment.
As an aside, I am seeing that at least some flavors of PDF can be configured not to open after a certain date, however I wonder how widespread the practice is, and how “portable”such documents are, as, presumably, one would need an extension to open such a PDF. Maybe Zinio is in fact an overgrown extension to PDF documents, with a fancy page turning animation? Here is a thought. Link to the PPT presentation, on page 10 says regarding benefits of PDF: Security. Allows multiple security settings from fully editable to print only access. Files can be set to expire (cannot open past expire date).
One of the benefits of a personal subscription to Nature is ability to download any article that they publish as a PDF file once you authenticate and get a cookie loaded into the browser (If you are a student, you can probably see if your university has a site license for Nature and Science, and if you can access their web sites through your university’s web proxy). As Nature is a weekly publication, 52 issues, each 7 – 8 mm thick, start eating up space on the bookshelf rather rapidly, and weigh a good deal too. As a result, back, when I had more time on my hands, I were actually downloading all the PDF pieces for each issue I’d recieve, glue them together into a single “issue” about 10 megs big, and drop it as a single file on my hard drive. Eventually I’ve stopped doing it, because it was taking a fair bit of my time, and attempts to convince my little brother to do it for me (I were offering up to 2$ per issue) were not getting anywhere.
I still upkeep my subscription, and occasionally do have time to read both Nature Science Update online and leaf through the actual magazine.
Image unavailable for copyright reasons
So I went and looked at the PDFs on Nature site today (first time in a few months), and I’ve noticed something that is new to me. On the PDF versions of some pages, more noticably on the feature articles (which are 2 – 3 page articles that describe in depth a particular aspect of science), some images have been replaced with boxes saying “IMAGE UNAVAILABLE FOR COPYRIGHT REASONS”.
Before, the mail difference between the PDF and printed version was lack of advertisment (well, most of the time, sometimes they would goof, and you’d get to see a half page ad somewhere), as NPG was a firm believer that in order to advertise in PDF version advertiser should pay. Occasionally Nature would retract an article, and then the PDF to it would be removed, and on occasion other pages, where article began and ended, censored too.
Now some images are missing. Here is an example. (480K, sorry, I didn’t feel like cropping it).
Note that this is new – I went and doublechecked archives, and older stories don’t have this “feature”. Of course this can change – maybe they didn’t yet have enough man power to go and look through the back archives.
Why? Why would they do something like that?
Maybe they got nailed by their stock image library. Maybe some photographer took them to court. It’s the larger images that seem to be unavailable, so maybe some kid took the image from some Nature issue, and used it in a high school project.
This kind of crap upsets me.
Spotlight and PDFs
Now, while talking about Spotlight, I started thinking about possibilities…. Indexing entire hard drive is evil (Or, in my case eats too many CPU/IO cycles), so for now I’ve disabled Spotlight in my /etc/hostconfig. However, if I were to create a single subdirectory (Or Partition, or, heck, external drive) for documents, turn off indexing for boot volume, turn off indexing for this new volume/subdirectory/partition (from now on: PDF repository), then, once I’ve added/modified the PDF content sufficiently, tell it to index the contents once (but not continuously) using mdimport(1), I’ll get all the benefits of Spotlight/mdfind(1) with none of the slowdowns for the documents in repository (That I presumably want indexed). So I could have the cake and eat it too.
So I started looking at various options. Dave’s pointer about using wget with cookies seems like the right step forward. I’ll have to make sure that I tell wget to send the same User-Agent string as my browser does, as I recall from older days that folks at Nature actually keep track of that.
I’ve not tried it yet, but I’ll be seeing what I can do.
So here are questions that I don’t know answers for:
What is the easy way to join a bunch of PDFs into a single PDF? Bonus points if it’s something that can be done from command line, maybe as a batch.
Also, is there an easy way to screen out duplicate pages in a PDF, preferably not involving human iunteraction? Under Acrobat 4 (or 3) it was rather simple – I’d just generate previews for each page, and then click on whichever look similar, and kill them, but that requred at least glancing through the PDF document.
Zinio reader uses it’s own format, that is heavily DRMed, and, as far as I can tell, might actually be based on PDF, as they licensed the PDF library from Adobe. So, question I have is: Is there a tool that will strip the DRM, and generate a normal PDF out of Zinio file? Best solution so far that I found was to print each page to PDF individually, and then basically merge pieces together, but this is ugly as sin.
Lastly Any better suggestions on how to deal with spidering sites like Nature’s, and pulling down certain types of content? Maybe I shouldn’t try to re-invent the wheel.
As usual, feel free to post comments (all 3 people and 100 comment spam bots that occasionally look at the site) 🙂