how to find a DOI and create a permanent link using curl

DOI (Digital Object Identifier) is a system to identify and to find an Object (document) on a network. This document may be an audio file, an image, a paper, a video ... The International DOI Foundation assigns an unique identifyer (DOI) to an object so that it can be retrieved even if its location (uri) changes.

DOI
DOI translator	Naming Authority (prefix)	suffix
http://dx.doi.org/	10.1002	asi.23179

Prerequisite

A command line interface (shell/terminal) (For Windows: download curl and cygwin)

A doi: I will use this one: 10.1002/asi.23179

Curl is a program available on pretty much all the operating system and linux distros (Windows, Unix, Mac OS X ...). What does it really do?

In your terminal type the following: (Note: please don't type '-> %')

 -> %  whatis curl
    curl (1)             - transfer a URL

It doesn't say much :) but it actually does a lot of things, it will transfer or retrieve data to or from an url.

 -> %  man curl | less

 curl  is  a tool to transfer data from or to a server, using one of the
 supported protocols (DICT, FILE, FTP, FTPS, GOPHER, HTTP, HTTPS,  IMAP,
 IMAPS,  LDAP,  LDAPS,  POP3, POP3S, RTMP, RTSP, SCP, SFTP, SMTP, SMTPS,
 TELNET and TFTP).

(press 'q' to exit ). So we will use this tool to retrieve the url and as it is written further down in the man, we can pass switches or paremeters to the curl command. We will pass one parameter '--head'. This parameter (--head) allows you to view the HTTP response headers that the web server returns when requesting a URL..

-> %  curl --head http://dx.doi.org/10.1002/asi.23179

HTTP/1.1 303 See Other
Server: Apache-Coyote/1.1
Vary: Accept
Location: http://doi.wiley.com/10.1002/asi.23179
Expires: Fri, 30 Jan 2015 20:49:19 GMT
Content-Type: text/html;charset=utf-8
Content-Length: 162
Date: Fri, 30 Jan 2015 04:46:21 GMT

Curl went to this address and the server send a '303' response code.

303: 'The response to the request can be found under a different URI and SHOULD be retrieved using a GET method on that resource. source: w3.org

Now that you have the URI let's remove from the output all the line that are unnecessary because we only need to print the URI. We will use grep, it will search only for the pattern that I need. I want grep to look for a line that start with 'Location'. We will also use an other parameter: '-s'

-s, --silent
        Silent  or  quiet  mode. Don't show progress meter or error mes‐
        sages.  Makes Curl mute. It will still output the data  you  ask
        for, potentially even to the terminal/stdout unless you redirect
        it.

Now you can type the following:

-> % curl --head -s http://dx.doi.org/10.1002/asi.23179  | grep "Location"

Location: http://doi.wiley.com/10.1002/asi.23179

Not bad but we can do better. We can remove the word 'location' from the output using a regular expression.

-> % curl --head -s http://dx.doi.org/10.1002/asi.23179  | grep -Po  '(\b(([\w-]+://?|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/))))'
 
 http://doi.wiley.com/10.1002/asi.23179

Let's take this a little bit further let's say that we want to retrieve the title of this paper, along with the authors and so on as if it was dtyled in a APA style. For that I will dump the HTTP response header (-D) that may contains cookies. I also want curl to take theses informations (cookies) and go to the redirection link (-L). I also want the information in plain text and within this text I want to retrieve all the lines that start with 'doi'.

-D, --dump-header 
    This  option  is  handy to use when you want to store
    the headers that an HTTP site sends to  you.  Cookies
    from  the headers could then be read in a second curl
    ... 

-H, --header 
    (HTTP) Extra header to use when getting a  web  page.
    You  may  specify  any  number of extra headers....

-L, --location
    (HTTP/HTTPS) If the server reports that the requested
    page has moved to  a  different  location  (indicated
    with  a  Location:  header  and a 3XX response code),
    this option will make curl redo the  request  on  the
    new place.

-> % curl -s -D --next -L -H "Accept: text/plain" "http://dx.doi.org/10.1002/asi.23179" | grep  "doi:"

Zhu, X., Turney, P., Lemire, D., & Vellino, A. (2014). Measuring academic influence: Not all citations are equal. J Assn Inf Sci Tec, 66(2), 408–427. doi:10.1002/asi.23179

Let's say that we want to see the webpage locally like it's suppose to look in Willey' website. We will dump the webpage in a file call doi.html and than opens it in Firefox. Notes: as I'm not quite sure what is the command to open a browser in Mac OS X and Windows, you can run the first command bellow and than go to your directory and open in your browser the file named doi.html

-> % #IMPORTANT for MAC OS and Windows user: 
curl -D -s -L  "Accept: text/html;charset=UTF-8" "http://dx.doi.org/10.1002/asi.23179" > doi.html


-> % #Linux distro
curl -D -s -L  "Accept: text/html;charset=UTF-8" "http://dx.doi.org/10.1002/asi.23179" > /tmp/doi.html && firefox /tmp/doi.html

Lets retrieve the turtle instead of a plain text or an html file.

-> % curl -D - -L -H  "Accept: text/turtle" "http://dx.doi.org/10.1002/asi.23179"

Wait a minute! Something is really interesting here: Whitin the header section the location have changed for Location: http://data.crossref.org/10.1002%2Fasi.23179 this wasn't the location that we had in the previous HTTP response header Location: http://doi.wiley.com/10.1002/asi.23179. So the redirection have been process by http://data.crossref.org/. By following this link I realized that crosscite.org provides an API to GET a DOI resource and their documentation is pretty neat.

So all the steps I did above to retrieved the resource associated with the DOI was done from scratch. I realized that I could have get the result faster if I knew about crosscite. For the last example I will use their doc instead and as I don't know much about RDF, I do prefer retrieving the resource in json instead of turtle :)

-> % curl -LH "Accept: application/rdf+xml;q=0.5, application/vnd.citationstyles.csl+json;q=1.0" http://dx.doi.org/10.1002/asi.23179

#IMPORTANT if you have Python and you want to format the output like in this example
#for Node.js if you have npm json installed instead of 'python -mjson.tool' use | json 
-> % curl -LH "Accept: application/rdf+xml;q=0.5, application/vnd.citationstyles.csl+json;q=1.0" http://dx.doi.org/10.1002/asi.23179 | python -mjson.tool

{  
   "subtitle":[  
      "Measuring Academic Influence"
   ],
   "issued":{  
      "date-parts":[  
         [  
            2014,
            5,
            21
         ]
      ]
   },
   "score":1.0,
   "prefix":"http:\/\/id.crossref.org\/prefix\/10.1002",
   "author":[  
      {  
         "family":"Zhu",
         "given":"Xiaodan"
      },
      {  
         "family":"Turney",
         "given":"Peter"
      },
      {  
         "family":"Lemire",
         "given":"Daniel"
      },
      {  
         "family":"Vellino",
         "given":"Andr\u00e9"
      }
   ],
   "container-title":"Journal of the Association for Information Science and Technology",
   "reference-count":79,
   "page":"408-427",
   "deposited":{  
      "date-parts":[  
         [  
            2015,
            1,
            20
         ]
      ],
      "timestamp":1421712000000
   },
   "issue":"2",
   "funder":[  
      {  
         "award":[  
            "26143"
         ],
         "name":"Natural Sciences and Engineering Research Council of Canada",
         "DOI":"10.13039\/501100000038"
      }
   ],
   "title":"Measuring academic influence: Not all citations are equal",
   "type":"journal-article",
   "DOI":"10.1002\/asi.23179",
   "ISSN":[  
      "2330-1635"
   ],
   "URL":"http:\/\/dx.doi.org\/10.1002\/asi.23179",
   "source":"CrossRef",
   "publisher":"Wiley-Blackwell",
   "indexed":{  
      "date-parts":[  
         [  
            2015,
            1,
            23
         ]
      ],
      "timestamp":1421971998759
   },
   "volume":"66",
   "member":"http:\/\/id.crossref.org\/member\/311"
}

Voilà!!! This was my long journey to find out how to retrieve a DOI resource.

Challenge:

From the previous example. Try to retrieve the xml version instead. But you must only output the lines that include the name 'Vellino' or 'vellino' along with their line number.

Hack safe! :)

References:

Guinslym:: My Blog

how to find a DOI and create a permanent link using curl

Hack safe! :)

Guinsly Mondésir Ottawa

Web developer - Pythonist - Rubyist