Pages:
1
2 |
Polverone
Now celebrating 21 years of madness
Posts: 3186
Registered: 19-5-2002
Location: The Sunny Pacific Northwest
Member Is Offline
Mood: Waiting for spring
|
|
Harvesting scanned books from HathiTrust
I recently mentioned in an older thread that the HathiTrust provides access to many books that Google has scanned in partnership with universities.
They are more aggressive about legally clearing material for the public domain, so you can get complete text and page images for some books that
Google has scanned but does not offer for download through the Google Books site. For example, Google does not provide complete access for many US
government publications or post-1923 books that are actually in the public domain according to US copyright law.
The problem is that the HathiTrust does not offer a convenient way to download full books. The closest they come is letting you download a 10 page
segment from a book in reduced-resolution PDF format.
But the HathiTrust does offer a HTTP data API that can fetch page images and text for public domain volumes. I have written a 'HathiHelper' program
that will use this data API to retrieve the complete, high-resolution page data for a public domain HathiTrust volume. It automatically handles
retries in case of bad downloads, automatically names volumes and pages, and allows you to stop and restart downloads at any time.
It is only a command line program, and it requires a Python interpreter to run, which means it will be a little unfamiliar to many computer users. I
have put together a download and tutorial page showing how to use it in Windows. It should be usable under Windows and any Unix-like environment including Mac OS X.
PGP Key and corresponding e-mail address
|
|
kmno4
International Hazard
Posts: 1496
Registered: 1-6-2005
Location: Silly, stupid country
Member Is Offline
Mood: No Mood
|
|
If somebody is perfect layman in programming (as I am) but wants to instal Python interpreter (+ a few more things), go to :
http://niche.uwo.ca/programming-historian/index.php/Getting_...
Caution: installed Firefox is needed.
ps. In my case, version 2.5 works good, higher versions (3.0) of Python do not want to work [ I do not know why, I am not a programist (pity)]
[Edited on 28-6-2009 by kmno4]
|
|
pantone159
National Hazard
Posts: 590
Registered: 27-6-2006
Location: Austin, TX, USA
Member Is Online
Mood: desperate for shade
|
|
Cool, I tried this and it seemed to work ok.
|
|
Polverone
Now celebrating 21 years of madness
Posts: 3186
Registered: 19-5-2002
Location: The Sunny Pacific Northwest
Member Is Offline
Mood: Waiting for spring
|
|
kmno4, do you have public domain access to the book I used on the example page? I have not checked if/how the HathiTrust is changing access
permissions based on IP address. On one of their pages I read that they may narrow public domain materials using IP address geolocation since
copyright restrictions may be greater outside the US.
Under a Unix-like environment, the program will automatically pick up http proxy environment variables and use the proxy for downloads, in case you
should need to use a US-located IP address. I do not know what the equivalent setting would be in Windows though
EDIT: I just reread the urllib2 docs and found that it will get proxy settings from the Internet Options specified in the Windows registry. If you set
up a proxy using the internet options dialog of Internet Explorer it should affect HathiHelper too.
[Edited on 6-28-2009 by Polverone]
PGP Key and corresponding e-mail address
|
|
kmno4
International Hazard
Posts: 1496
Registered: 1-6-2005
Location: Silly, stupid country
Member Is Offline
Mood: No Mood
|
|
If you mean "Hydrogenation of fatty oils, 1951" - yes, I have access from my IP. In case of books from Google books it depends on proxy if I have
access or not. In case of HathiTrust it is hard to say (now) if it also works in this way....
|
|
pantone159
National Hazard
Posts: 590
Registered: 27-6-2006
Location: Austin, TX, USA
Member Is Online
Mood: desperate for shade
|
|
I didn't need any of the extra stuff besides just Python, I use a 2.x version.
http://www.python.org/
This is OT, but I find Python very useful, I like it a lot.
|
|
raaz
Harmless
Posts: 4
Registered: 19-6-2011
Member Is Offline
Mood: No Mood
|
|
hathihelper not working, some error coming, called line 52
please help me
|
|
Polverone
Now celebrating 21 years of madness
Posts: 3186
Registered: 19-5-2002
Location: The Sunny Pacific Northwest
Member Is Offline
Mood: Waiting for spring
|
|
Be more specific. What operating system, Python version, and HathiHelper version are you using? Show how you are trying to use the tool and the exact
error message that you get.
I thought that maybe the APIs had changed so the software needed updating, but I just tried it and successfully completed a book download.
PGP Key and corresponding e-mail address
|
|
raaz
Harmless
Posts: 4
Registered: 19-6-2011
Member Is Offline
Mood: No Mood
|
|
Hello Sir,
I am on Windows Xp Pro, Installed Python 3.2 in c drive, I followed all ur instruction as in tutorials. It works fine for one day. But Since yesterday
some error comes.
In Command prompt-
c:\python32>python hathihelper30.py -m -i mdp.39015002013368
Trackback<most recent call last>:
file "hathihelper30.py" line 52, in <module>
message = '' .join<identify_proc.communicate<>>
Type error: sequence item 0: expected str instance, bites found
c:\python32>
like this comes in Command prompt plz help
|
|
Polverone
Now celebrating 21 years of madness
Posts: 3186
Registered: 19-5-2002
Location: The Sunny Pacific Northwest
Member Is Offline
Mood: Waiting for spring
|
|
Quote: Originally posted by raaz | Hello Sir,
I am on Windows Xp Pro, Installed Python 3.2 in c drive, I followed all ur instruction as in tutorials. It works fine for one day. But Since yesterday
some error comes.
In Command prompt-
c:\python32>python hathihelper30.py -m -i mdp.39015002013368
Trackback<most recent call last>:
file "hathihelper30.py" line 52, in <module>
message = '' .join<identify_proc.communicate<>>
Type error: sequence item 0: expected str instance, bites found
c:\python32>
like this comes in Command prompt plz help |
I'm not sure why it worked before and then stopped working. I have added explicit type conversion that should fix the error you saw. Try downloading
hathihelper30.py again now that it has my changes.
PGP Key and corresponding e-mail address
|
|
Polverone
Now celebrating 21 years of madness
Posts: 3186
Registered: 19-5-2002
Location: The Sunny Pacific Northwest
Member Is Offline
Mood: Waiting for spring
|
|
According to their March 2012 updates, HathiTrust data access is going to get more restrictive:
Quote: | Over the next several months HathiTrust will be implemeting security enhancments to the Data API. The enhancements will require developers using the
API to acquire an OAuth 1.0 access key that identifies them, and a secret key that must be used to “sign” URLs to retrieve HathiTrust resources
via the Data API. HathiTrust will also provide a Web client that employ’s a user’s login credentials as a proxy for these keys to facilitate
non-programmatic uses. In March, staff at the University of Michigan integrated 2-legged OAuth into the Data API and began to develop the Data API
client. Once OAuth is released, there will be an approximately 6-month transition period, ending October 1, 2012, during which signed access to the
Data API will be possible but not required. After October 1, all requests to the Data API will need to be properly signed with an access key retrieved
from HathiTrust. Complete documentation of the security enhancements and methods of obtaining keys and accessing the Web client is forthcoming. OAuth
is planned for release in April 2012. |
You may want to complete any archival activities sooner rather than later. I will update my tools if there is an option for the hoi polloi to obtain
access keys. I suspect that there won't be and my next iteration will have to be a rewrite that scrapes PDF from the web viewer instead of using a
proper API.
PGP Key and corresponding e-mail address
|
|
Bot0nist
International Hazard
Posts: 1559
Registered: 15-2-2011
Location: Right behind you.
Member Is Offline
Mood: Streching my cotyledons.
|
|
Damn, that sucks. I wonder what prompted this new tightening of policy.
U.T.F.S.E. and learn the joys of autodidacticism!
Don't judge each day only by the harvest you reap, but also by the seeds you sow.
|
|
Polverone
Now celebrating 21 years of madness
Posts: 3186
Registered: 19-5-2002
Location: The Sunny Pacific Northwest
Member Is Offline
Mood: Waiting for spring
|
|
Quite likely, the very existence and use of hathihelper.py. According to the HathiTrust librarian I talked with a while ago, Google has imposed an
asinine condition on the HathiTrust that Google-scanned books can be read but not downloaded. This is ridiculous on a few levels:
1, that after making this huge effort to make public-domain works more accessible they're trying to lock down distribution.
2, that Google Books itself allows users (at least in the US) to download public domain books as full PDFs with one click.
3, that they think it is possible even in principle to make books "visible but not copyable."
But having signed this stupid agreement, HathiTrust most likely has to make the stupid effort to impede use if their logs show people downloading full
books through the API.
PGP Key and corresponding e-mail address
|
|
bbartlog
International Hazard
Posts: 1139
Registered: 27-8-2009
Location: Unmoored in time
Member Is Offline
Mood: No Mood
|
|
Interesting. Given that the books in question really *are* public domain works, i.e. there is no legal reason whatever that I shouldn't be able to
download entire copies, it should be allowable to do some mass use of hathihelper before the deadline.
The less you bet, the more you lose when you win.
|
|
iHME
Harmless
Posts: 30
Registered: 29-10-2008
Location: the arctic circle
Member Is Offline
Mood: No Mood
|
|
THe script fails to work for me. I get this when attempting to run the script :
D:\Python27>python hathihelper30.py -m -i uva.x004816338
Traceback (most recent call last):
File "hathihelper30.py", line 6, in <module>
import urllib.request, urllib.error, urllib.parse
ImportError: No module named request
I installed python 2.7.3 from the link provided on the page.
I'll install python 3.0 and edit if anything improves.
Ran it on Python 3.0.1 and it works like a charm now.
[Edited on 17-4-2012 by iHME]
It\'s a catastrophic success.
|
|
AJKOER
Radically Dubious
Posts: 3026
Registered: 7-5-2011
Member Is Offline
Mood: No Mood
|
|
Thank you Google for clearing up this matter, just a few confused little people sprouting some nonsense about accessing books in the public domain
without your blessing as to the form of access. By the way, exactly who paid to have those books scanned, or is that something we shouldn't be talking
about.
For the little people, on all those Google books for sale with censored pages, try using the book search feature and you can still gather a sentence
or two of valuable info for free (but don't over do it, Google may not approve).
[Edited on 2-5-2012 by AJKOER]
|
|
cooljoebay
Harmless
Posts: 1
Registered: 30-1-2013
Member Is Offline
Mood: No Mood
|
|
I am seeing a HTTP error 403 Forbidden. Is it not possible to capture the images any longer?
|
|
Polverone
Now celebrating 21 years of madness
Posts: 3186
Registered: 19-5-2002
Location: The Sunny Pacific Northwest
Member Is Offline
Mood: Waiting for spring
|
|
The API has changed to require authentication. The HathiHelper no longer works as written. You can request an API key but you probably won't get it if
your stated reason is "so I can download complete books." At some point, hopefully before I die of old age, I'll write a similar tool to scrape the
page turner content that's delivered in your browser.
PGP Key and corresponding e-mail address
|
|
Dr.Bob
International Hazard
Posts: 2732
Registered: 26-1-2011
Location: USA - NC
Member Is Offline
Mood: No Mood
|
|
It is not easy, but it is possible to take screen shots of windows text on large windows and then dump them into a file and then convert it back into
a PDF. But that takes an enormous amount of effort, compared to just copying a PDF file. You could also copy screen shoots into a word document,
then print it and scan it in. Again, this requires a large monitor or a virtual large window to have enough pixel resolution to make it work, but it
can work.
The other alternative is to ask each person on Sci Mad to find one useful book and scan it in, and share it, especially when they are in the public
domain. Even the rarest books are available in some library somewhere and thus able to be copied in some way. With phone cameras improving, I
don't think it will be long before a simple photo will be good enough resolution to make a useful PDF file. But most of mine are not yet there.
|
|
Maniax
Harmless
Posts: 1
Registered: 3-6-2013
Member Is Offline
Mood: No Mood
|
|
Hi all,
I've found another tool for downloading books from hathitrust.org which works fine for me. You can download either pdf files or images. At the very
end the tool creates a single pdf file. It's called "Hathi Download Helper". At the moment there is the source code as well as an
installer for Windows.
Here is the link:
http://qt-apps.org/content/show.php/Hathi+Download+Helper?content=158702
|
|
gsd
National Hazard
Posts: 847
Registered: 18-8-2005
Member Is Offline
Mood: No Mood
|
|
Thanks for sharing. It works like a dream.
Very nice!
gsd
|
|
Salmo
Harmless
Posts: 42
Registered: 20-9-2012
Member Is Offline
Mood: No Mood
|
|
thanks maniax! crazy program!
|
|
hyfalcon
International Hazard
Posts: 1003
Registered: 29-3-2012
Member Is Offline
Mood: No Mood
|
|
I've been at it now for 24hrs. I didn't know about this resource. Great little app.
-------
Only complaint I've got is, the file sizes are HUGE!
[Edited on 5-6-2013 by hyfalcon]
|
|
Polverone
Now celebrating 21 years of madness
Posts: 3186
Registered: 19-5-2002
Location: The Sunny Pacific Northwest
Member Is Offline
Mood: Waiting for spring
|
|
Great app. Glad I don't need to write it. Expect HathiTrust to play more cat-and-mouse games with access if it becomes popular. Scripting Firefox to
visit every page in a volume, in random order and with delays if necessary, and running requests through a caching proxy should be pretty much
unstoppable, if it comes to that.
To make compact joined PDFs you will need to post-process the images. All the images downloaded are continuous-tone, but there is very little extra
visual information provided by continuous-tone representation for pages that just have text and/or line diagrams. Text-like images, the ones saved as
PNG in the download images directory, need to be converted to bitonal (black and white) images, with JBIG2 or G3/G4 compression in the PDF. This will
save a lot of space. You will notice that the PDF file produced by this program is larger than the sum of the input images; it should actually be
smaller. But if you just want to quickly create a file that is a little easier to read, the built in conversion is OK.
PGP Key and corresponding e-mail address
|
|
bfesser
|
Thread Topped 23-7-2013 at 21:14 |
German
Harmless
Posts: 44
Registered: 13-5-2009
Member Is Offline
Mood: No Mood
|
|
Dude just use libgen.info
It's a Russian site I've been using for years. They have all the google scans.
|
|
Pages:
1
2 |