Via email from PLoS Webmaster
(highlights & formatting added)
To retrieve an entire volume, we recommend using the PubMed Central FTP service. I’ve included other options below as well-- a PHP Script for downloading XML files and using wget to get PDF and XML files (just change the HTTP address to the location of the PLoS Biology articles).
PubMed Central FTP Service
The PMC FTP Service may be used to download the source files for any article in the PMC Open Access Subset, associate PMC articles with identifiers such as: PubMed IDs, DOIs, Manuscript IDs, ISSN, etc., and can used as a source for data mining.
For more information: http://www.pubmedcentral.nih.
gov/about/ftp.html PMC also has a single archive file (~1Gb) that contains XML (and only XML) files for ALL PMC open access articles. This was created for users who need PMC XML for data mining and processing purposes, but do not need PDFs, images, or supplementary data.
For more information: http://www.pubmedcentral.nih.
gov/about/ftp.html#XML_for_ Data_Mining
Using a PHP Script
The XML and PDF files are available for each article from the PLoS journal websites. You’ll need to create a script to download these files. Fred Howell wrote a script (http://www.neurogems.org/
fetchplos/ ) to download XML files. This script can be easily modified to also download the relevant PDF files.PLoS articles are all tagged to parse against the NLM/NIH journal publishing DTD. The latest version is DTD 2.2
Here’s the link to the Preview XSL Transform that the NLM/NIH group has written for their DTD: http://dtd.nlm.nih.gov/tools/
And you find the DTD in the DOCTYPE tag of the XML article, or
< !DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.0 20040830//EN" "http://dtd.nlm.nih.gov/publishing/2.0/ " >journalpublishing.dtd All articles can be redistributed and reused according to the terms of the Creative Commons Attribution License (http://creativecommons.org/
licenses/by/2.5/ ).
Using wget
Here’s an example script to use wget to fetch PDF files from PLoS ONE articles 0000001 - 0000009:
#!/bin/shfor i in `seq 1 9`doecho "Fetching journal.pone.000000$i.pdf..."wget -q "http://www.plosone.org/article/fetchObjectAttachment. action?uri=info%3Adoi%2F10. 1371%2Fjournal.pone.00 0000$i&representation=PDF"echo journal.pone.000000$i.pdf >> filelist# don't bash the servers!sleep 10done
Here’s an example script to use wget to fetch XML files from PLoS ONE articles 0000001 - 0000009:
#!/bin/shfor i in `seq 1 9`doecho "Fetching journal.pone.000000$i.pdf..."wget -q "http://www.plosone.org/article/fetchObjectAttachment. action?uri=info%3Adoi%2F10. 1371%2Fjournal.pone.00 0000$i&representation=XML"echo journal.pone.000000$i.xml >> filelist# don't bash the servers!sleep 10done