Sciencemadness Discussion Board - Harvesting scanned books from HathiTrust


	Not logged in [Login ]

FAQ

Member List

Today's Posts Forum Stats

Stats

Back to:

Sciencemadness Discussion Board » Fundamentals » Miscellaneous » Harvesting scanned books from HathiTrust

Printable Version

Pages: 1 2

Polverone

Now celebrating 21 years of madness

Posts: 3186
Registered: 19-5-2002
Location: The Sunny Pacific Northwest
Member Is Offline

Mood: Waiting for spring

posted on 27-6-2009 at 13:29

Harvesting scanned books from HathiTrust

I recently mentioned in an older thread that the HathiTrust provides access to many books that Google has scanned in partnership with universities. They are more aggressive about legally clearing material for the public domain, so you can get complete text and page images for some books that Google has scanned but does not offer for download through the Google Books site. For example, Google does not provide complete access for many US government publications or post-1923 books that are actually in the public domain according to US copyright law.

The problem is that the HathiTrust does not offer a convenient way to download full books. The closest they come is letting you download a 10 page segment from a book in reduced-resolution PDF format.

But the HathiTrust does offer a HTTP data API that can fetch page images and text for public domain volumes. I have written a 'HathiHelper' program that will use this data API to retrieve the complete, high-resolution page data for a public domain HathiTrust volume. It automatically handles retries in case of bad downloads, automatically names volumes and pages, and allows you to stop and restart downloads at any time.

It is only a command line program, and it requires a Python interpreter to run, which means it will be a little unfamiliar to many computer users. I have put together a download and tutorial page showing how to use it in Windows. It should be usable under Windows and any Unix-like environment including Mac OS X.

PGP Key and corresponding e-mail address

kmno4

International Hazard

Posts: 1496
Registered: 1-6-2005
Location: Silly, stupid country
Member Is Offline

Mood: No Mood

posted on 28-6-2009 at 03:11

If somebody is perfect layman in programming (as I am) but wants to instal Python interpreter (+ a few more things), go to :
http://niche.uwo.ca/programming-historian/index.php/Getting_...
Caution: installed Firefox is needed.

ps. In my case, version 2.5 works good, higher versions (3.0) of Python do not want to work [ I do not know why, I am not a programist (pity)]

[Edited on 28-6-2009 by kmno4]

pantone159

National Hazard

Posts: 590
Registered: 27-6-2006
Location: Austin, TX, USA
Member Is Offline

Mood: desperate for shade

posted on 28-6-2009 at 09:52

Cool, I tried this and it seemed to work ok.

Polverone

Now celebrating 21 years of madness

Posts: 3186
Registered: 19-5-2002
Location: The Sunny Pacific Northwest
Member Is Offline

Mood: Waiting for spring

posted on 28-6-2009 at 10:39

kmno4, do you have public domain access to the book I used on the example page? I have not checked if/how the HathiTrust is changing access permissions based on IP address. On one of their pages I read that they may narrow public domain materials using IP address geolocation since copyright restrictions may be greater outside the US.

Under a Unix-like environment, the program will automatically pick up http proxy environment variables and use the proxy for downloads, in case you should need to use a US-located IP address. I do not know what the equivalent setting would be in Windows though

EDIT: I just reread the urllib2 docs and found that it will get proxy settings from the Internet Options specified in the Windows registry. If you set up a proxy using the internet options dialog of Internet Explorer it should affect HathiHelper too.

[Edited on 6-28-2009 by Polverone]

PGP Key and corresponding e-mail address

kmno4

International Hazard

Posts: 1496
Registered: 1-6-2005
Location: Silly, stupid country
Member Is Offline

Mood: No Mood

posted on 28-6-2009 at 11:13

If you mean "Hydrogenation of fatty oils, 1951" - yes, I have access from my IP. In case of books from Google books it depends on proxy if I have access or not. In case of HathiTrust it is hard to say (now) if it also works in this way....

pantone159

National Hazard

Posts: 590
Registered: 27-6-2006
Location: Austin, TX, USA
Member Is Offline

Mood: desperate for shade

posted on 28-6-2009 at 13:06

Quote: Originally posted by kmno4

I didn't need any of the extra stuff besides just Python, I use a 2.x version.
http://www.python.org/

This is OT, but I find Python very useful, I like it a lot.

raaz

Harmless

Posts: 4
Registered: 19-6-2011
Member Is Offline

Mood: No Mood

posted on 19-6-2011 at 07:23

hathihelper not working, some error coming, called line 52

please help me

Polverone

Now celebrating 21 years of madness

Posts: 3186
Registered: 19-5-2002
Location: The Sunny Pacific Northwest
Member Is Offline

Mood: Waiting for spring

posted on 19-6-2011 at 15:54

Quote: Originally posted by raaz

hathihelper not working, some error coming, called line 52

please help me

Be more specific. What operating system, Python version, and HathiHelper version are you using? Show how you are trying to use the tool and the exact error message that you get.

I thought that maybe the APIs had changed so the software needed updating, but I just tried it and successfully completed a book download.

PGP Key and corresponding e-mail address

raaz

Harmless

Posts: 4
Registered: 19-6-2011
Member Is Offline

Mood: No Mood

posted on 19-6-2011 at 16:44

Hello Sir,
I am on Windows Xp Pro, Installed Python 3.2 in c drive, I followed all ur instruction as in tutorials. It works fine for one day. But Since yesterday some error comes.
In Command prompt-

c:\python32>python hathihelper30.py -m -i mdp.39015002013368

Trackback<most recent call last>:

file "hathihelper30.py" line 52, in <module>

message = '' .join<identify_proc.communicate<>>
Type error: sequence item 0: expected str instance, bites found

c:\python32>

like this comes in Command prompt plz help

Polverone

Now celebrating 21 years of madness

Posts: 3186
Registered: 19-5-2002
Location: The Sunny Pacific Northwest
Member Is Offline

Mood: Waiting for spring

posted on 19-6-2011 at 17:08

Quote: Originally posted by raaz

I'm not sure why it worked before and then stopped working. I have added explicit type conversion that should fix the error you saw. Try downloading hathihelper30.py again now that it has my changes.

PGP Key and corresponding e-mail address

Polverone

Now celebrating 21 years of madness

Posts: 3186
Registered: 19-5-2002
Location: The Sunny Pacific Northwest
Member Is Offline

Mood: Waiting for spring

posted on 16-4-2012 at 15:24

According to their March 2012 updates, HathiTrust data access is going to get more restrictive:

Quote:

Over the next several months HathiTrust will be implemeting security enhancments to the Data API. The enhancements will require developers using the API to acquire an OAuth 1.0 access key that identifies them, and a secret key that must be used to “sign” URLs to retrieve HathiTrust resources via the Data API. HathiTrust will also provide a Web client that employ’s a user’s login credentials as a proxy for these keys to facilitate non-programmatic uses. In March, staff at the University of Michigan integrated 2-legged OAuth into the Data API and began to develop the Data API client. Once OAuth is released, there will be an approximately 6-month transition period, ending October 1, 2012, during which signed access to the Data API will be possible but not required. After October 1, all requests to the Data API will need to be properly signed with an access key retrieved from HathiTrust. Complete documentation of the security enhancements and methods of obtaining keys and accessing the Web client is forthcoming. OAuth is planned for release in April 2012.

You may want to complete any archival activities sooner rather than later. I will update my tools if there is an option for the hoi polloi to obtain access keys. I suspect that there won't be and my next iteration will have to be a rewrite that scrapes PDF from the web viewer instead of using a proper API.

PGP Key and corresponding e-mail address

Bot0nist

International Hazard

Posts: 1559
Registered: 15-2-2011
Location: Right behind you.
Member Is Offline

Mood: Streching my cotyledons.

posted on 16-4-2012 at 15:27

Damn, that sucks. I wonder what prompted this new tightening of policy.

U.T.F.S.E. and learn the joys of autodidacticism!

Don't judge each day only by the harvest you reap, but also by the seeds you sow.

Polverone

Now celebrating 21 years of madness

Posts: 3186
Registered: 19-5-2002
Location: The Sunny Pacific Northwest
Member Is Offline

Mood: Waiting for spring

posted on 16-4-2012 at 15:38

Quote: Originally posted by Bot0nist

Damn, that sucks. I wonder what prompted this new tightening of policy.

Quite likely, the very existence and use of hathihelper.py. According to the HathiTrust librarian I talked with a while ago, Google has imposed an asinine condition on the HathiTrust that Google-scanned books can be read but not downloaded. This is ridiculous on a few levels:

1, that after making this huge effort to make public-domain works more accessible they're trying to lock down distribution.

2, that Google Books itself allows users (at least in the US) to download public domain books as full PDFs with one click.

3, that they think it is possible even in principle to make books "visible but not copyable."

But having signed this stupid agreement, HathiTrust most likely has to make the stupid effort to impede use if their logs show people downloading full books through the API.

PGP Key and corresponding e-mail address

bbartlog

International Hazard

Posts: 1139
Registered: 27-8-2009
Location: Unmoored in time
Member Is Offline

Mood: No Mood

posted on 17-4-2012 at 05:35

Interesting. Given that the books in question really *are* public domain works, i.e. there is no legal reason whatever that I shouldn't be able to download entire copies, it should be allowable to do some mass use of hathihelper before the deadline.

The less you bet, the more you lose when you win.

iHME

Harmless

Posts: 30
Registered: 29-10-2008
Location: the arctic circle
Member Is Offline

Mood: No Mood

posted on 17-4-2012 at 09:17

THe script fails to work for me. I get this when attempting to run the script :

D:\Python27>python hathihelper30.py -m -i uva.x004816338
Traceback (most recent call last):
File "hathihelper30.py", line 6, in <module>
import urllib.request, urllib.error, urllib.parse
ImportError: No module named request

I installed python 2.7.3 from the link provided on the page.
I'll install python 3.0 and edit if anything improves.

Ran it on Python 3.0.1 and it works like a charm now.

[Edited on 17-4-2012 by iHME]

It\'s a catastrophic success.

AJKOER

Radically Dubious

Posts: 3026
Registered: 7-5-2011
Member Is Offline

Mood: No Mood

posted on 2-5-2012 at 05:32

Thank you Google for clearing up this matter, just a few confused little people sprouting some nonsense about accessing books in the public domain without your blessing as to the form of access. By the way, exactly who paid to have those books scanned, or is that something we shouldn't be talking about.

For the little people, on all those Google books for sale with censored pages, try using the book search feature and you can still gather a sentence or two of valuable info for free (but don't over do it, Google may not approve).

[Edited on 2-5-2012 by AJKOER]

cooljoebay

Harmless

Posts: 1
Registered: 30-1-2013
Member Is Offline

Mood: No Mood

posted on 30-1-2013 at 05:57

I am seeing a HTTP error 403 Forbidden. Is it not possible to capture the images any longer?

Polverone

Now celebrating 21 years of madness

Posts: 3186
Registered: 19-5-2002
Location: The Sunny Pacific Northwest
Member Is Offline

Mood: Waiting for spring

posted on 30-1-2013 at 15:44

The API has changed to require authentication. The HathiHelper no longer works as written. You can request an API key but you probably won't get it if your stated reason is "so I can download complete books." At some point, hopefully before I die of old age, I'll write a similar tool to scrape the page turner content that's delivered in your browser.

PGP Key and corresponding e-mail address

Dr.Bob

International Hazard

Posts: 2732
Registered: 26-1-2011
Location: USA - NC
Member Is Offline

Mood: No Mood

posted on 31-1-2013 at 14:17

It is not easy, but it is possible to take screen shots of windows text on large windows and then dump them into a file and then convert it back into a PDF. But that takes an enormous amount of effort, compared to just copying a PDF file. You could also copy screen shoots into a word document, then print it and scan it in. Again, this requires a large monitor or a virtual large window to have enough pixel resolution to make it work, but it can work.

The other alternative is to ask each person on Sci Mad to find one useful book and scan it in, and share it, especially when they are in the public domain. Even the rarest books are available in some library somewhere and thus able to be copied in some way. With phone cameras improving, I don't think it will be long before a simple photo will be good enough resolution to make a useful PDF file. But most of mine are not yet there.

Maniax

Harmless

Posts: 1
Registered: 3-6-2013
Member Is Offline

Mood: No Mood

posted on 3-6-2013 at 13:12

Hi all,

I've found another tool for downloading books from hathitrust.org which works fine for me. You can download either pdf files or images. At the very end the tool creates a single pdf file. It's called "Hathi Download Helper". At the moment there is the source code as well as an installer for Windows.

Here is the link:
http://qt-apps.org/content/show.php/Hathi+Download+Helper?content=158702

gsd

National Hazard

Posts: 847
Registered: 18-8-2005
Member Is Offline

Mood: No Mood

posted on 4-6-2013 at 08:05

Quote: Originally posted by Maniax

Thanks for sharing. It works like a dream.

Very nice!

gsd

Salmo

Harmless

Posts: 42
Registered: 20-9-2012
Member Is Offline

Mood: No Mood

posted on 4-6-2013 at 14:57

thanks maniax! crazy program!

hyfalcon

International Hazard

Posts: 1003
Registered: 29-3-2012
Member Is Offline

Mood: No Mood

posted on 5-6-2013 at 07:09

I've been at it now for 24hrs. I didn't know about this resource. Great little app.

-------

Only complaint I've got is, the file sizes are HUGE!

[Edited on 5-6-2013 by hyfalcon]

Polverone

Now celebrating 21 years of madness

Posts: 3186
Registered: 19-5-2002
Location: The Sunny Pacific Northwest
Member Is Offline

Mood: Waiting for spring

posted on 5-6-2013 at 12:44

Great app. Glad I don't need to write it. Expect HathiTrust to play more cat-and-mouse games with access if it becomes popular. Scripting Firefox to visit every page in a volume, in random order and with delays if necessary, and running requests through a caching proxy should be pretty much unstoppable, if it comes to that.

To make compact joined PDFs you will need to post-process the images. All the images downloaded are continuous-tone, but there is very little extra visual information provided by continuous-tone representation for pages that just have text and/or line diagrams. Text-like images, the ones saved as PNG in the download images directory, need to be converted to bitonal (black and white) images, with JBIG2 or G3/G4 compression in the PDF. This will save a lot of space. You will notice that the PDF file produced by this program is larger than the sum of the input images; it should actually be smaller. But if you just want to quickly create a file that is a little easier to read, the built in conversion is OK.

PGP Key and corresponding e-mail address

bfesser

Resident Wikipedian

Thread Topped
23-7-2013 at 21:14

German

Harmless

Posts: 44
Registered: 13-5-2009
Member Is Offline

Mood: No Mood

posted on 26-6-2014 at 17:34

Dude just use libgen.info

It's a Russian site I've been using for years. They have all the google scans.

Pages: 1 2

Sciencemadness Discussion Board » Fundamentals » Miscellaneous » Harvesting scanned books from HathiTrust