DOI (Digital Object Identifier) is a system to identify and to find an Object (document) on a network. This document may be an audio file, an image, a paper, a video ... The International DOI Foundation assigns an unique identifyer (DOI) to an object so that it can be retrieved even if its location (uri) changes.
DOI translator | Naming Authority (prefix) | suffix | http://dx.doi.org/ | 10.1002 | asi.23179 |
Prerequisite
A command line interface (shell/terminal) (For Windows: download curl and cygwin)
A doi: I will use this one: 10.1002/asi.23179
Curl is a program available on pretty much all the operating system and linux distros (Windows, Unix, Mac OS X ...). What does it really do?
In your terminal type the following: (Note: please don't type '-> %')
-> % whatis curl curl (1) - transfer a URL
It doesn't say much :) but it actually does a lot of things, it will transfer or retrieve data to or from an url.
-> % man curl | less
curl is a tool to transfer data from or to a server, using one of the supported protocols (DICT, FILE, FTP, FTPS, GOPHER, HTTP, HTTPS, IMAP, IMAPS, LDAP, LDAPS, POP3, POP3S, RTMP, RTSP, SCP, SFTP, SMTP, SMTPS, TELNET and TFTP).
(press 'q' to exit ). So we will use this tool to retrieve the url and as it is written further down in the man, we can pass switches or paremeters to the curl command. We will pass one parameter '--head'. This parameter (--head) allows you to view the HTTP response headers that the web server returns when requesting a URL..
-> % curl --head http://dx.doi.org/10.1002/asi.23179
HTTP/1.1 303 See Other Server: Apache-Coyote/1.1 Vary: Accept Location: http://doi.wiley.com/10.1002/asi.23179 Expires: Fri, 30 Jan 2015 20:49:19 GMT Content-Type: text/html;charset=utf-8 Content-Length: 162 Date: Fri, 30 Jan 2015 04:46:21 GMT
Curl went to this address and the server send a '303' response code.
303: 'The response to the request can be found under a different URI and SHOULD be retrieved using a GET method on that resource. source: w3.org
Now that you have the URI let's remove from the output all the line that are unnecessary because we only need to print the URI. We will use grep, it will search only for the pattern that I need. I want grep to look for a line that start with 'Location'. We will also use an other parameter: '-s'
-s, --silent Silent or quiet mode. Don't show progress meter or error mes‐ sages. Makes Curl mute. It will still output the data you ask for, potentially even to the terminal/stdout unless you redirect it.
Now you can type the following:
-> % curl --head -s http://dx.doi.org/10.1002/asi.23179 | grep "Location" Location: http://doi.wiley.com/10.1002/asi.23179
Not bad but we can do better. We can remove the word 'location' from the output using a regular expression.
-> % curl --head -s http://dx.doi.org/10.1002/asi.23179 | grep -Po '(\b(([\w-]+://?|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/))))' http://doi.wiley.com/10.1002/asi.23179
Let's take this a little bit further let's say that we want to retrieve the title of this paper, along with the authors and so on as if it was dtyled in a APA style. For that I will dump the HTTP response header (-D) that may contains cookies. I also want curl to take theses informations (cookies) and go to the redirection link (-L). I also want the information in plain text and within this text I want to retrieve all the lines that start with 'doi'.
-D, --dump-headerThis option is handy to use when you want to store the headers that an HTTP site sends to you. Cookies from the headers could then be read in a second curl ... -H, --header (HTTP) Extra header to use when getting a web page. You may specify any number of extra headers.... -L, --location (HTTP/HTTPS) If the server reports that the requested page has moved to a different location (indicated with a Location: header and a 3XX response code), this option will make curl redo the request on the new place.
-> % curl -s -D --next -L -H "Accept: text/plain" "http://dx.doi.org/10.1002/asi.23179" | grep "doi:" Zhu, X., Turney, P., Lemire, D., & Vellino, A. (2014). Measuring academic influence: Not all citations are equal. J Assn Inf Sci Tec, 66(2), 408–427. doi:10.1002/asi.23179
Let's say that we want to see the webpage locally like it's suppose to look in Willey' website. We will dump the webpage in a file call doi.html and than opens it in Firefox. Notes: as I'm not quite sure what is the command to open a browser in Mac OS X and Windows, you can run the first command bellow and than go to your directory and open in your browser the file named doi.html
-> % #IMPORTANT for MAC OS and Windows user: curl -D -s -L "Accept: text/html;charset=UTF-8" "http://dx.doi.org/10.1002/asi.23179" > doi.html -> % #Linux distro curl -D -s -L "Accept: text/html;charset=UTF-8" "http://dx.doi.org/10.1002/asi.23179" > /tmp/doi.html && firefox /tmp/doi.html
Lets retrieve the turtle instead of a plain text or an html file.
-> % curl -D - -L -H "Accept: text/turtle" "http://dx.doi.org/10.1002/asi.23179"
Wait a minute! Something is really interesting here: Whitin the header section the location have changed for Location: http://data.crossref.org/10.1002%2Fasi.23179 this wasn't the location that we had in the previous HTTP response header Location: http://doi.wiley.com/10.1002/asi.23179. So the redirection have been process by http://data.crossref.org/. By following this link I realized that crosscite.org provides an API to GET a DOI resource and their documentation is pretty neat.
So all the steps I did above to retrieved the resource associated with the DOI was done from scratch. I realized that I could have get the result faster if I knew about crosscite. For the last example I will use their doc instead and as I don't know much about RDF, I do prefer retrieving the resource in json instead of turtle :)
-> % curl -LH "Accept: application/rdf+xml;q=0.5, application/vnd.citationstyles.csl+json;q=1.0" http://dx.doi.org/10.1002/asi.23179 #IMPORTANT if you have Python and you want to format the output like in this example #for Node.js if you have npm json installed instead of 'python -mjson.tool' use | json -> % curl -LH "Accept: application/rdf+xml;q=0.5, application/vnd.citationstyles.csl+json;q=1.0" http://dx.doi.org/10.1002/asi.23179 | python -mjson.tool { "subtitle":[ "Measuring Academic Influence" ], "issued":{ "date-parts":[ [ 2014, 5, 21 ] ] }, "score":1.0, "prefix":"http:\/\/id.crossref.org\/prefix\/10.1002", "author":[ { "family":"Zhu", "given":"Xiaodan" }, { "family":"Turney", "given":"Peter" }, { "family":"Lemire", "given":"Daniel" }, { "family":"Vellino", "given":"Andr\u00e9" } ], "container-title":"Journal of the Association for Information Science and Technology", "reference-count":79, "page":"408-427", "deposited":{ "date-parts":[ [ 2015, 1, 20 ] ], "timestamp":1421712000000 }, "issue":"2", "funder":[ { "award":[ "26143" ], "name":"Natural Sciences and Engineering Research Council of Canada", "DOI":"10.13039\/501100000038" } ], "title":"Measuring academic influence: Not all citations are equal", "type":"journal-article", "DOI":"10.1002\/asi.23179", "ISSN":[ "2330-1635" ], "URL":"http:\/\/dx.doi.org\/10.1002\/asi.23179", "source":"CrossRef", "publisher":"Wiley-Blackwell", "indexed":{ "date-parts":[ [ 2015, 1, 23 ] ], "timestamp":1421971998759 }, "volume":"66", "member":"http:\/\/id.crossref.org\/member\/311" }
Voilà!!! This was my long journey to find out how to retrieve a DOI resource.
Challenge:
From the previous example. Try to retrieve the xml version instead. But you must only output the lines that include the name 'Vellino' or 'vellino' along with their line number.
References: