Sciencemadness Discussion Board

Harvesting scanned books from HathiTrust

Polverone - 27-6-2009 at 13:29

I recently mentioned in an older thread that the HathiTrust provides access to many books that Google has scanned in partnership with universities. They are more aggressive about legally clearing material for the public domain, so you can get complete text and page images for some books that Google has scanned but does not offer for download through the Google Books site. For example, Google does not provide complete access for many US government publications or post-1923 books that are actually in the public domain according to US copyright law.

The problem is that the HathiTrust does not offer a convenient way to download full books. The closest they come is letting you download a 10 page segment from a book in reduced-resolution PDF format.

But the HathiTrust does offer a HTTP data API that can fetch page images and text for public domain volumes. I have written a 'HathiHelper' program that will use this data API to retrieve the complete, high-resolution page data for a public domain HathiTrust volume. It automatically handles retries in case of bad downloads, automatically names volumes and pages, and allows you to stop and restart downloads at any time.

It is only a command line program, and it requires a Python interpreter to run, which means it will be a little unfamiliar to many computer users. I have put together a download and tutorial page showing how to use it in Windows. It should be usable under Windows and any Unix-like environment including Mac OS X.

kmno4 - 28-6-2009 at 03:11

If somebody is perfect layman in programming (as I am) but wants to instal Python interpreter (+ a few more things), go to :
http://niche.uwo.ca/programming-historian/index.php/Getting_...
Caution: installed Firefox is needed.

ps. In my case, version 2.5 works good, higher versions (3.0) of Python do not want to work [ I do not know why, I am not a programist (pity)]

[Edited on 28-6-2009 by kmno4]

pantone159 - 28-6-2009 at 09:52

Cool, I tried this and it seemed to work ok.


Polverone - 28-6-2009 at 10:39

kmno4, do you have public domain access to the book I used on the example page? I have not checked if/how the HathiTrust is changing access permissions based on IP address. On one of their pages I read that they may narrow public domain materials using IP address geolocation since copyright restrictions may be greater outside the US.

Under a Unix-like environment, the program will automatically pick up http proxy environment variables and use the proxy for downloads, in case you should need to use a US-located IP address. I do not know what the equivalent setting would be in Windows though :(

EDIT: I just reread the urllib2 docs and found that it will get proxy settings from the Internet Options specified in the Windows registry. If you set up a proxy using the internet options dialog of Internet Explorer it should affect HathiHelper too.

[Edited on 6-28-2009 by Polverone]

kmno4 - 28-6-2009 at 11:13

If you mean "Hydrogenation of fatty oils, 1951" - yes, I have access from my IP. In case of books from Google books it depends on proxy if I have access or not. In case of HathiTrust it is hard to say (now) if it also works in this way....

pantone159 - 28-6-2009 at 13:06

Quote: Originally posted by kmno4  
If somebody is perfect layman in programming (as I am) but wants to instal Python interpreter (+ a few more things), go to :
http://niche.uwo.ca/programming-historian/index.php/Getting_...
Caution: installed Firefox is needed.

ps. In my case, version 2.5 works good, higher versions (3.0) of Python do not want to work [ I do not know why, I am not a programist (pity)]

I didn't need any of the extra stuff besides just Python, I use a 2.x version.
http://www.python.org/

This is OT, but I find Python very useful, I like it a lot.

raaz - 19-6-2011 at 07:23

hathihelper not working, some error coming, called line 52

please help me

Polverone - 19-6-2011 at 15:54

Quote: Originally posted by raaz  
hathihelper not working, some error coming, called line 52

please help me


Be more specific. What operating system, Python version, and HathiHelper version are you using? Show how you are trying to use the tool and the exact error message that you get.

I thought that maybe the APIs had changed so the software needed updating, but I just tried it and successfully completed a book download.

raaz - 19-6-2011 at 16:44

Hello Sir,
I am on Windows Xp Pro, Installed Python 3.2 in c drive, I followed all ur instruction as in tutorials. It works fine for one day. But Since yesterday some error comes.
In Command prompt-

c:\python32>python hathihelper30.py -m -i mdp.39015002013368

Trackback<most recent call last>:

file "hathihelper30.py" line 52, in <module>

message = '' .join<identify_proc.communicate<>>
Type error: sequence item 0: expected str instance, bites found

c:\python32>


like this comes in Command prompt plz help

Polverone - 19-6-2011 at 17:08

Quote: Originally posted by raaz  
Hello Sir,
I am on Windows Xp Pro, Installed Python 3.2 in c drive, I followed all ur instruction as in tutorials. It works fine for one day. But Since yesterday some error comes.
In Command prompt-

c:\python32>python hathihelper30.py -m -i mdp.39015002013368

Trackback<most recent call last>:

file "hathihelper30.py" line 52, in <module>

message = '' .join<identify_proc.communicate<>>
Type error: sequence item 0: expected str instance, bites found

c:\python32>


like this comes in Command prompt plz help


I'm not sure why it worked before and then stopped working. I have added explicit type conversion that should fix the error you saw. Try downloading hathihelper30.py again now that it has my changes.

Polverone - 16-4-2012 at 15:24

According to their March 2012 updates, HathiTrust data access is going to get more restrictive:

Quote:
Over the next several months HathiTrust will be implemeting security enhancments to the Data API. The enhancements will require developers using the API to acquire an OAuth 1.0 access key that identifies them, and a secret key that must be used to “sign” URLs to retrieve HathiTrust resources via the Data API. HathiTrust will also provide a Web client that employ’s a user’s login credentials as a proxy for these keys to facilitate non-programmatic uses. In March, staff at the University of Michigan integrated 2-legged OAuth into the Data API and began to develop the Data API client. Once OAuth is released, there will be an approximately 6-month transition period, ending October 1, 2012, during which signed access to the Data API will be possible but not required. After October 1, all requests to the Data API will need to be properly signed with an access key retrieved from HathiTrust. Complete documentation of the security enhancements and methods of obtaining keys and accessing the Web client is forthcoming. OAuth is planned for release in April 2012.


You may want to complete any archival activities sooner rather than later. I will update my tools if there is an option for the hoi polloi to obtain access keys. I suspect that there won't be and my next iteration will have to be a rewrite that scrapes PDF from the web viewer instead of using a proper API.

Bot0nist - 16-4-2012 at 15:27

Damn, that sucks. I wonder what prompted this new tightening of policy.

Polverone - 16-4-2012 at 15:38

Quote: Originally posted by Bot0nist  
Damn, that sucks. I wonder what prompted this new tightening of policy.


Quite likely, the very existence and use of hathihelper.py. According to the HathiTrust librarian I talked with a while ago, Google has imposed an asinine condition on the HathiTrust that Google-scanned books can be read but not downloaded. This is ridiculous on a few levels:

1, that after making this huge effort to make public-domain works more accessible they're trying to lock down distribution.

2, that Google Books itself allows users (at least in the US) to download public domain books as full PDFs with one click.

3, that they think it is possible even in principle to make books "visible but not copyable."

But having signed this stupid agreement, HathiTrust most likely has to make the stupid effort to impede use if their logs show people downloading full books through the API.

bbartlog - 17-4-2012 at 05:35

Interesting. Given that the books in question really *are* public domain works, i.e. there is no legal reason whatever that I shouldn't be able to download entire copies, it should be allowable to do some mass use of hathihelper before the deadline.

iHME - 17-4-2012 at 09:17

THe script fails to work for me. I get this when attempting to run the script :


D:\Python27>python hathihelper30.py -m -i uva.x004816338
Traceback (most recent call last):
File "hathihelper30.py", line 6, in <module>
import urllib.request, urllib.error, urllib.parse
ImportError: No module named request

I installed python 2.7.3 from the link provided on the page.
I'll install python 3.0 and edit if anything improves.

Ran it on Python 3.0.1 and it works like a charm now.



[Edited on 17-4-2012 by iHME]

AJKOER - 2-5-2012 at 05:32

Thank you Google for clearing up this matter, just a few confused little people sprouting some nonsense about accessing books in the public domain without your blessing as to the form of access. By the way, exactly who paid to have those books scanned, or is that something we shouldn't be talking about.

For the little people, on all those Google books for sale with censored pages, try using the book search feature and you can still gather a sentence or two of valuable info for free (but don't over do it, Google may not approve).


[Edited on 2-5-2012 by AJKOER]

cooljoebay - 30-1-2013 at 05:57

I am seeing a HTTP error 403 Forbidden. Is it not possible to capture the images any longer?

Polverone - 30-1-2013 at 15:44

The API has changed to require authentication. The HathiHelper no longer works as written. You can request an API key but you probably won't get it if your stated reason is "so I can download complete books." At some point, hopefully before I die of old age, I'll write a similar tool to scrape the page turner content that's delivered in your browser.

Dr.Bob - 31-1-2013 at 14:17

It is not easy, but it is possible to take screen shots of windows text on large windows and then dump them into a file and then convert it back into a PDF. But that takes an enormous amount of effort, compared to just copying a PDF file. You could also copy screen shoots into a word document, then print it and scan it in. Again, this requires a large monitor or a virtual large window to have enough pixel resolution to make it work, but it can work.

The other alternative is to ask each person on Sci Mad to find one useful book and scan it in, and share it, especially when they are in the public domain. Even the rarest books are available in some library somewhere and thus able to be copied in some way. With phone cameras improving, I don't think it will be long before a simple photo will be good enough resolution to make a useful PDF file. But most of mine are not yet there.

Maniax - 3-6-2013 at 13:12

Hi all,

I've found another tool for downloading books from hathitrust.org which works fine for me. You can download either pdf files or images. At the very end the tool creates a single pdf file. It's called "Hathi Download Helper". At the moment there is the source code as well as an installer for Windows.

Here is the link:
http://qt-apps.org/content/show.php/Hathi+Download+Helper?content=158702

gsd - 4-6-2013 at 08:05

Quote: Originally posted by Maniax  
Hi all,

I've found another tool for downloading books from hathitrust.org which works fine for me. You can download either pdf files or images. At the very end the tool creates a single pdf file. It's called "Hathi Download Helper". At the moment there is the source code as well as an installer for Windows.

Here is the link:
http://qt-apps.org/content/show.php/Hathi+Download+Helper?content=158702


Thanks for sharing. It works like a dream.

Very nice!

gsd

Salmo - 4-6-2013 at 14:57

thanks maniax! crazy program!

hyfalcon - 5-6-2013 at 07:09

I've been at it now for 24hrs. I didn't know about this resource. Great little app.

-------


Only complaint I've got is, the file sizes are HUGE!

[Edited on 5-6-2013 by hyfalcon]

Polverone - 5-6-2013 at 12:44

Great app. Glad I don't need to write it. Expect HathiTrust to play more cat-and-mouse games with access if it becomes popular. Scripting Firefox to visit every page in a volume, in random order and with delays if necessary, and running requests through a caching proxy should be pretty much unstoppable, if it comes to that.

To make compact joined PDFs you will need to post-process the images. All the images downloaded are continuous-tone, but there is very little extra visual information provided by continuous-tone representation for pages that just have text and/or line diagrams. Text-like images, the ones saved as PNG in the download images directory, need to be converted to bitonal (black and white) images, with JBIG2 or G3/G4 compression in the PDF. This will save a lot of space. You will notice that the PDF file produced by this program is larger than the sum of the input images; it should actually be smaller. But if you just want to quickly create a file that is a little easier to read, the built in conversion is OK.

German - 26-6-2014 at 17:34

Dude just use libgen.info

It's a Russian site I've been using for years. They have all the google scans.

arkoma - 26-6-2014 at 18:02

oh wow--libgen is cool. thanx

German - 26-6-2014 at 18:11

Yep libgen.info has any book you could ever want completely free. Millions of them. From brand new to very old. Even graduate level textbooks. Best site on the internet barnone.

arkoma - 26-6-2014 at 18:23

shit homeboy--don't be such a stranger around here......you upped my game already LOL

chris893 - 24-9-2014 at 16:51

how do you use it? Every time I click on a book, I get an error message saying that it can't be found?

Denize1 - 24-4-2016 at 19:45

HathiTrust Downloader is not working. Is there a fix?

Mush - 9-12-2017 at 05:52

Quote: Originally posted by Denize1  
HathiTrust Downloader is not working. Is there a fix?


https://sourceforge.net/projects/hathidownloadhelper/


Description

*************************
2017-07-20 PLEASE NOTE:
Due to an update to hathitrust website Hathi Download Helper 1.1.3 is not operable anymore.
Please update to version 1.1.4
*************************

Mitigator - 25-6-2018 at 05:19

Yeah, libgen is cool and (somewhere) illegal. But they lost few domains and current are:
https://libgen.pw/ (i used this last year, only here could find updated "crc handbook of chemistry and physics 2016-2017")
http://libgen.io/ (wow, this one looks better, offers more options, gonna use this now instead of .pw)

The above version of book has only chapters bookmarked, but version 2015-2016 has even each subchapters bookmarked, much easier to browse.

Also probably many of you have noticed that archive.org offers so many scanned legal books but they don't give them for free download but only for borrowing (for 2 week online preview or temporary encrypted download viewable using adobe digital editions).

So I figured 2 workarounds how to get those book downloaded.

First method is discovered by me and is only recommended either if you wanna high quality book (high resolution pages) or if 2nd method doesn't work (adobe digital editions doesn't work on some vpn or ip or proxy or weird configurations).

For 1st method you simply browse borrowed book online and using nirsoft chromecacheview you can see al those images stored as cache jpg or jp2 files in cache folder. Just copy them using that program. Depending on your online preview size images quality will vary. So to get highest quality just use extension Resource Override, and let it automatically replace any letters showing resolution in url with it removed, something like "aaa-200x200.jpg" with "aaa.jpg" of course using pattern like "aaa-*.jpg" to be replaced with aaa.jpg. To find real images direct url just use inspect element - network tab and try loading next page and it will appear on list as jpg, or that cache viewer and search for something like name of book or domain archive.org, eventually you'll find one image sample and see its url. Only patter for end of such urls have to be replaced. Something like that. But this is slow, you have to manually broswe whole book to cache all those pages, and sometimes they may dissapear. Huh...

For 2nd method you simply download encrypted pdf and decrypt using any pdf digital license removers, or adobe digital editions removers, or whatever they are called, like epubsoft. But these books take about 20 MB size, while same jp2 files downloaded manually take about 200 MB size for each book. Difference: resolution (quality).

Of course one account is needed to be able to borrow anything from archive.org or operlibrary.org.

[Edited on 25-6-2018 by Mitigator]