Sciencemadness Discussion Board

Digital Library of India

Polverone - 11-12-2003 at 21:02

http://www.dli.gov.in
The website seems buggy, at least under Mozilla in Linux. However, there are some interesting books available, and the website has been set up such that you can find the directory where a book is stored and websuck all the scanned images. Of course the only images served are low-res GIF files, but they're sufficient to read the books.

A sample of the titles available:
F.H. Leeds & W.J. Atkinson Butterfield Acetylene The Principles of its Generation and Use

Chung Yu Wang Antimony

H.W. Webb Absorption Of Nitrous Gases

Jean Effront Biochemical Catalysts in Life and Industry

P.C.L. Thorne Chemistry From The Industrial Stand Point

Irving W. fay The Chemistry of the Coal Tar Dyes

Walter Lob Electro Chemistry for Organic Compounds

Ludwig Gattermann The Practical Methods of Organic Chemistry

C.E.Parker Some Micro-Chemical Tests for Alkaloids

Arthur J. Hale The Synthetic Use of Metals in Organic Chemistry

a_bab - 12-12-2003 at 08:08

It seems to be a really good resource, pretty much like the galica library. It's piss easy to download a book as well: the path pointing to a tiff page looks like http://www.dli.gov.in/data0/000/169/PTIFF/00000001.tif?rs=1, so http://www.dli.gov.in/data0/000/169/PTIFF/ is the folder with all the pages.

Organikum - 12-12-2003 at 11:08

Somebody please suck this and throw it onto the FTP server?

thx
still digitally starving
ORG

Gattermann is a must have!

Polverone - 12-12-2003 at 11:21

Gatterman (this is an English translation, BTW) has been websucked and other titles are in progress. This will all show up on the FTP server eventually. I just wish the images that were available were higher-resolution. They have OCR-generated plaintext and HTML versions of the pages too, but the formatting is poor.

a_bab - 12-12-2003 at 11:33

Just solved the issue of the resolution !
Yes you are right, the resolution can barely cover the reading.
BUT the path to an given image is http://www.dli.gov.in/data0/000/169/PTIFF/00000001.tif?rs=1, where rs=1 stands for a *given* resolution.

I've tried 0 (the basic resolution you can find in the folder), 2, 3, 4. While 4 is 694x1070, no. 3 seems to be the good choise: you can OCR the image (actually the texts are already there) and is sharp and clear (aprox 80 Kb in size, so 32 Mb for a 400 pages book). With Silx should go under 20 megs.

The images could be downloaded manually but I have another way: I generate the linx via VB; I create a list with the links (actually one can use the list supplied by the browser in the images folder and modify it by adding "?rs=2" for every entry), and then I can upload the page somewhere and let Teleport Pro to suck the pages (set on "1 address outside of the original server"

In other words, the books could be ripped ("sucked";) very easy.

EDIT:
Just got the book Acetylene, in hires pdf. Where can I upload it ? Anyone interested ?

[Edited on 12-12-2003 by a_bab]

Excellent work!

Polverone - 12-12-2003 at 16:30

I would say you should upload them to Raistlin's FTP server when he's finished integrating E&W and Sciencemadness on it. I'm more of a Python kind of guy, but thanks for the hint about how to get higher-resolution images. There's probably at least a couple dozen books on that site that interest me, so hooray!

Edit: it seems that the image size keeps increasing with increasing rs parameters up to 7. 7 and above gives a black and white image that I would guess was scanned at about 600 DPI. I'd like to remind everybody who wants to try automated websucking that it'd be a shame if we hit the servers so hard that the administrators made it more difficult to do automated downloading. Use a program like wget (http://www.gnu.org/software/wget/wget.html, http://www.jensroesner.de/wgetgui/) and enable bandwidth throttling and random delay between grabbing the next file. This will make it less likely that the server will be overwhelmed or that automated logfile analysis will discover the large-scale downloading.

Another edit: From reading the goals and copyright disclaimer sections of their site, it seems that they vigorously support the free dissemination of information (unlike, say, the MOA project). Therefore I believe I will host the more interesting books either on my personal website or here in the Sciencemadness library, once suitable PDF files have been created.

[Edited on 12-13-2003 by Polverone]

a_bab - 13-12-2003 at 10:40

Well, I got tons of books from this site. I have a list with those worthing to be downloaded. I realised that some of the scans are crap, ie the edges are cut too much (acetylene book is a good example).

Another tip: the last HTML page located in the HTML folder of each book is actually all the book rather then the last page of the book.

And a Xmas present for all of you (a little too early; I know): "TNT TRINITROTOLUENES AND
MONO- AND DINITEOTOLUENES - THEIR MANUFACTURE AND PROPERTIES" (1912) in rough HTML format (rar archive).

Attachment: TNT.rar (130kB)
This file has been downloaded 797 times


the evolution of the DLI

Polverone - 3-1-2005 at 23:53

The Digital Library of India has a new home at http://www.dli.ernet.in/. In a year, their collection has greatly increased in size. I have found that the new shortcut to directly browsing their raw data is to go to http://www.dli.ernet.in/collections/. There one can find OCR-generated text, high-resolution and intended-for-display TIFF files, Perl scripts used to help the project process its books, and other things.

Nearly a year ago, I used scripts to download high-resolution image files for more than 50 scientific books from the old DLI site. A mere 5 of those have been placed online in the Sciencemadness library at http://www.sciencemadness.org/library/dlibooks.html. Since I recently received a Windows computer, I have been able to resume processing of the remaining books (cleaning up scans and applying OCR, then generating JBIG2 compressed PDFs). Unfortunately, I will probably be unable to upload the books that I am currently working on since Sciencemadness's web host has greatly oversold hard disk space and I cannot upload much more material even though I am officially well below my maximum disk quota. Further, the DLI has grown so much that I find the prospect of fetching and converting all the new books they have available daunting indeed.

Please post in this thread or contact me via U2U if you think you may be able to provide web hosting for the converted DLI books. I would also be interested in making contact with people who would be willing to coordinate work on turning the raw TIFF files into clean, OCRed PDF documents. I just want to avoid duplication of effort and maybe find some like-minded people willing to work on this.

IPN - 4-1-2005 at 10:57

I used downTHEMall! (An extension for mozilla) to get the TIFF files of few chemistry books from the DLI. I could process them into clean pdf files but I don't have any OCR software.
Any recommendations?
Also, are there any specific books other than the chemistry ones that should be downloaded and processed?

Digital Library of India

Sergei_Eisenstein - 4-1-2005 at 11:34

I have found that www.archive.org also has some books from the Digital Library of India (in djvu format).

http://www.archive.org/texts/textslisting-browse.php?collect...

Organikum - 4-1-2005 at 11:56

Go to
http://any2djvu.djvuzone.org
and yank the tiffs or the pdf through this server converting them to DJVU with OCR.

/ORG

Polverone - 4-1-2005 at 14:07

It's okay if you don't have OCR software as long as you have a decent network connection, since you can then upload the processed TIFF files for someone else to OCR or, as Organikum suggested, make a DjVu file and process with any2djvu. I have found with any2djvu that you may have trouble uploading large files from the browser; in this case it's best to stick the file on some WWW server and tell the any2djvu where it is.

The books provided by archive.org are nice if you don't want to wait for the cleaned-up versions to grow, but they are inferior in at least three ways:

1) They are too large. I tried downloading AluminiumAndItsAlloys.djvu and the file is nearly 17 MB in size for 226 pages. This is 3-4 times more bytes than what I'd consider reasonable for a book of this size.
2) Scanning garbage around the edges has not been cleaned up, nor have blotches and speckles been removed. The images therefore look inferior and would waste considerable ink/toner if you were to print the files.
3) No OCR has been applied.

I believe that points 2 and 1 are related. Like JBIG2, DjVu's bitonal compression scheme is optimized for certain patterns (like repeated images of printed letters in scanned text). Noisy garbage like scanning artifacts does not exhibit these patterns and cannot be compressed nearly as efficiently.

Further, the archive.org books represent only a limited fraction of the books available on the DLI site itself.

As to what books apart from chemistry might be interesting, I think the biology, agriculture, physics, and engineering categories may all have some interesting titles in them. I have found in using the DLI web interface that it has some errors where books will appear to be not-found because the URL being used has "ons" instead of "collections" in it. Manually fixing the URL fixes this problem. If you want to grab books for touchup, though, you'll want to browse through the raw data instead of the web interface since that's the only way to get the higher-resolution TIFF files. Happily, the new DLI serves genuine CCITT compressed TIFF files instead of GIF files masquerading as TIFF. This saves considerable time/space when downloading.

hellomynameisop - 5-1-2005 at 00:30

Quote:
Originally posted by Polverone
The Digital Library of India has a new home at http://www.dli.ernet.in/. In a year, their collection has greatly increased in size. I have found that the new shortcut to directly browsing their raw data is to go to http://www.dli.ernet.in/collections/.


At first I had assumed their site was no longer available, but then i realized your post was very recent. If anyone is having trouble getting the above links to work try removing the "www" from the beginning. so they should be:

http://dli.ernet.in/
and
http://dli.ernet.in/collections/

i thought most browsers corrected this automatically. evidently netscape does not :(

update

Polverone - 1-2-2005 at 19:56

I have touched up several books and created OCR'd PDF files from them. BromicAcid has generously agreed to host a large number of these books, saving the bandwidth and disk space of the main sciencemadness web site. The Sciencemadness <A HREF="http://www.sciencemadness.org/library/">Library</A> has been updated to feature the complete books more prominently. Visit now to see the 16 books that have been added in the most recent batch. I will continue to add books in batches as I produce archival CDs for my own use. Artistically inclined members should feel free to submit alternate images that could be used as a banner over the library. I want something clean, but a little less stiff than the image that's there now.

another update

Polverone - 5-3-2005 at 20:37

The following DLI books have been added to the <A HREF="http://www.sciencemadness.org/library/">library</A>:

Aluminium and its Alloys
Absorption of Nitrous Gases
An Introduction to the Chemistry of Plant Products
Animal Proteins
Ephedrine and Related Substances
Industrial Nitrogen Compounds and Explosives*
Industrial Fermentations
The Catalytic Oxidation of Organic Compounds in the Vapor Phase
The Chemistry and Literature of Beryllium
The Chemistry of the Coal Tar Dyes
The Silicates in Chemistry and Commerce
Treatise on General and Industrial Organic Chemistry II
What Industry Owes to Chemical Science

As always, thanks go to BromicAcid for providing server space and bandwidth for these books.

*It turns out that the production of cyanates and subsequent reduction to cyanides with hot carbon (my favorite home-cyanide process) was actually considered for industrial use and patented at one time. There is nothing new under the sun, not even in amateur chemistry.

Sorry....

BromicAcid - 14-3-2005 at 15:40

Due to sciencemadness going down and redirects going to my site for the day, coupled with the new books and extra use they have generated, I have used up 85% of my bandwith, and being that I don't want my site to go down for the rest of the month I've temporarily disabled access to the books hosted on my site. Sorry for any inconvience.