Wednesday, February 11, 2009

Public Library of Science (PLoS) Automated Download Methods

Public Library of Science Automated Download Methods

Via email from PLoS Webmaster


(highlights & formatting added)

To retrieve an entire volume, we recommend using the PubMed Central FTP service.  I’ve included other options below as well-- a PHP Script for downloading XML files and using wget to get PDF and XML files (just change the HTTP address to the location of the PLoS Biology articles).

 

PubMed Central FTP Service

The PMC FTP Service may be used to download the source files for any article in the PMC Open Access Subset, associate PMC articles with identifiers such as: PubMed IDs, DOIs, Manuscript IDs, ISSN, etc., and can used as a source for data mining.

For more information: http://www.pubmedcentral.nih.gov/about/ftp.html

PMC also has a single archive file (~1Gb) that contains XML (and only XML) files for ALL PMC open access articles. This was created for users who need PMC XML for data mining and processing purposes, but do not need PDFs, images, or supplementary data.

For more information: http://www.pubmedcentral.nih.gov/about/ftp.html#XML_for_Data_Mining


Using a PHP Script

The XML and PDF files are available for each article from the PLoS journal websites. You’ll need to create a script to download these files. Fred Howell wrote a script (http://www.neurogems.org/fetchplos/) to download XML files. This script can be easily modified to also download the relevant PDF files.

PLoS articles are all tagged to parse against the NLM/NIH journal publishing DTD. The latest version is DTD 2.2

Here’s the link to the Preview XSL Transform that the NLM/NIH group has written for their DTD: http://dtd.nlm.nih.gov/tools/

And you find the DTD in the DOCTYPE tag of the XML article, or

 
< !DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.0 20040830//EN" "http://dtd.nlm.nih.gov/publishing/2.0/journalpublishing.dtd" >
 

All articles can be redistributed and reused according to the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.5/).


Using wget

Here’s an example script to use wget to fetch PDF files from PLoS ONE articles 0000001 - 0000009:

 
#!/bin/sh
 
for i in `seq 1 9`
  do
    echo "Fetching journal.pone.000000$i.pdf..."
    wget -q "http://www.plosone.org/article/fetchObjectAttachment.action?uri=info%3Adoi%2F10.1371%2Fjournal.pone.00
0000$i&representation=PDF"
    echo journal.pone.000000$i.pdf >> filelist
    # don't bash the servers!
    sleep 10
done

Here’s an example script to use wget to fetch XML files from PLoS ONE articles 0000001 - 0000009:

 
#!/bin/sh
 
for i in `seq 1 9`
  do
    echo "Fetching journal.pone.000000$i.pdf..."
    wget -q "http://www.plosone.org/article/fetchObjectAttachment.action?uri=info%3Adoi%2F10.1371%2Fjournal.pone.00
0000$i&representation=XML"
    echo journal.pone.000000$i.xml >> filelist
    # don't bash the servers!
    sleep 10
done