Polverone
Now celebrating 21 years of madness
Posts: 3186
Registered: 19-5-2002
Location: The Sunny Pacific Northwest
Member Is Offline
Mood: Waiting for spring
|
|
Digital Library of India
http://www.dli.gov.in
The website seems buggy, at least under Mozilla in Linux. However, there are some interesting books available, and the website has been set up such
that you can find the directory where a book is stored and websuck all the scanned images. Of course the only images served are low-res GIF files, but
they're sufficient to read the books.
A sample of the titles available:
F.H. Leeds & W.J. Atkinson Butterfield Acetylene The Principles of its Generation and Use
Chung Yu Wang Antimony
H.W. Webb Absorption Of Nitrous Gases
Jean Effront Biochemical Catalysts in Life and Industry
P.C.L. Thorne Chemistry From The Industrial Stand Point
Irving W. fay The Chemistry of the Coal Tar Dyes
Walter Lob Electro Chemistry for Organic Compounds
Ludwig Gattermann The Practical Methods of Organic Chemistry
C.E.Parker Some Micro-Chemical Tests for Alkaloids
Arthur J. Hale The Synthetic Use of Metals in Organic Chemistry
|
|
a_bab
Hazard to Others
Posts: 458
Registered: 15-9-2002
Member Is Offline
Mood: Angry !!!!!111111...2?!
|
|
It seems to be a really good resource, pretty much like the galica library. It's piss easy to download a book as well: the path pointing to a
tiff page looks like http://www.dli.gov.in/data0/000/169/PTIFF/00000001.tif?rs=1, so http://www.dli.gov.in/data0/000/169/PTIFF/ is the folder with all the pages.
|
|
Organikum
resurrected
Posts: 2337
Registered: 12-10-2002
Location: Europe
Member Is Offline
Mood: frustrated
|
|
Somebody please suck this and throw it onto the FTP server?
thx
still digitally starving
ORG
Gattermann is a must have!
|
|
Polverone
Now celebrating 21 years of madness
Posts: 3186
Registered: 19-5-2002
Location: The Sunny Pacific Northwest
Member Is Offline
Mood: Waiting for spring
|
|
Gatterman (this is an English translation, BTW) has been websucked and other titles are in progress. This will all show up on the FTP server
eventually. I just wish the images that were available were higher-resolution. They have OCR-generated plaintext and HTML versions of the pages too,
but the formatting is poor.
|
|
a_bab
Hazard to Others
Posts: 458
Registered: 15-9-2002
Member Is Offline
Mood: Angry !!!!!111111...2?!
|
|
Just solved the issue of the resolution !
Yes you are right, the resolution can barely cover the reading.
BUT the path to an given image is http://www.dli.gov.in/data0/000/169/PTIFF/00000001.tif?rs=1, where rs=1 stands for a *given* resolution.
I've tried 0 (the basic resolution you can find in the folder), 2, 3, 4. While 4 is 694x1070, no. 3 seems to be the good choise: you can OCR the
image (actually the texts are already there) and is sharp and clear (aprox 80 Kb in size, so 32 Mb for a 400 pages book). With Silx should go under 20
megs.
The images could be downloaded manually but I have another way: I generate the linx via VB; I create a list with the links (actually one can use the
list supplied by the browser in the images folder and modify it by adding "?rs=2" for every entry), and then I can upload the page somewhere
and let Teleport Pro to suck the pages (set on "1 address outside of the original server"
In other words, the books could be ripped ("sucked" very easy.
EDIT:
Just got the book Acetylene, in hires pdf. Where can I upload it ? Anyone interested ?
[Edited on 12-12-2003 by a_bab]
|
|
Polverone
Now celebrating 21 years of madness
Posts: 3186
Registered: 19-5-2002
Location: The Sunny Pacific Northwest
Member Is Offline
Mood: Waiting for spring
|
|
Excellent work!
I would say you should upload them to Raistlin's FTP server when he's finished integrating E&W and Sciencemadness on it. I'm more
of a Python kind of guy, but thanks for the hint about how to get higher-resolution images. There's probably at least a couple dozen books on
that site that interest me, so hooray!
Edit: it seems that the image size keeps increasing with increasing rs parameters up to 7. 7 and above gives a black and white image that I would
guess was scanned at about 600 DPI. I'd like to remind everybody who wants to try automated websucking that it'd be a shame if we hit the
servers so hard that the administrators made it more difficult to do automated downloading. Use a program like wget (http://www.gnu.org/software/wget/wget.html, http://www.jensroesner.de/wgetgui/) and enable bandwidth throttling and random delay between grabbing the next file. This will make it less likely
that the server will be overwhelmed or that automated logfile analysis will discover the large-scale downloading.
Another edit: From reading the goals and copyright disclaimer sections of their site, it seems that they vigorously support the free dissemination of
information (unlike, say, the MOA project). Therefore I believe I will host the more interesting books either on my personal website or here in the
Sciencemadness library, once suitable PDF files have been created.
[Edited on 12-13-2003 by Polverone]
|
|
a_bab
Hazard to Others
Posts: 458
Registered: 15-9-2002
Member Is Offline
Mood: Angry !!!!!111111...2?!
|
|
Well, I got tons of books from this site. I have a list with those worthing to be downloaded. I realised that some of the scans are crap, ie the edges
are cut too much (acetylene book is a good example).
Another tip: the last HTML page located in the HTML folder of each book is actually all the book rather then the last page of the book.
And a Xmas present for all of you (a little too early; I know): "TNT TRINITROTOLUENES AND
MONO- AND DINITEOTOLUENES - THEIR MANUFACTURE AND PROPERTIES" (1912) in rough HTML format (rar archive).
Attachment: TNT.rar (130kB) This file has been downloaded 799 times
|
|
Polverone
Now celebrating 21 years of madness
Posts: 3186
Registered: 19-5-2002
Location: The Sunny Pacific Northwest
Member Is Offline
Mood: Waiting for spring
|
|
the evolution of the DLI
The Digital Library of India has a new home at http://www.dli.ernet.in/. In a year, their collection has greatly increased in size. I have found that the new shortcut to directly browsing their
raw data is to go to http://www.dli.ernet.in/collections/. There one can find OCR-generated text, high-resolution and intended-for-display TIFF files, Perl scripts
used to help the project process its books, and other things.
Nearly a year ago, I used scripts to download high-resolution image files for more than 50 scientific books from the old DLI site. A mere 5 of those
have been placed online in the Sciencemadness library at http://www.sciencemadness.org/library/dlibooks.html. Since I recently received a Windows computer, I have been able to resume processing of the
remaining books (cleaning up scans and applying OCR, then generating JBIG2 compressed PDFs). Unfortunately, I will probably be unable to upload the
books that I am currently working on since Sciencemadness's web host has greatly oversold hard disk space and I cannot upload much more material
even though I am officially well below my maximum disk quota. Further, the DLI has grown so much that I find the prospect of fetching and converting
all the new books they have available daunting indeed.
Please post in this thread or contact me via U2U if you think you may be able to provide web hosting for the converted DLI books. I would also be
interested in making contact with people who would be willing to coordinate work on turning the raw TIFF files into clean, OCRed PDF documents. I just
want to avoid duplication of effort and maybe find some like-minded people willing to work on this.
PGP Key and corresponding e-mail address
|
|
IPN
Hazard to Others
Posts: 156
Registered: 31-5-2003
Location: Finland
Member Is Offline
Mood: oxidized
|
|
I used downTHEMall! (An extension for mozilla) to get the TIFF files of few chemistry books from the DLI. I could process them into clean pdf files
but I don't have any OCR software.
Any recommendations?
Also, are there any specific books other than the chemistry ones that should be downloaded and processed?
|
|
Sergei_Eisenstein
Hazard to Others
Posts: 290
Registered: 13-12-2004
Location: Waziristan
Member Is Offline
Mood: training
|
|
Digital Library of India
I have found that www.archive.org also has some books from the Digital Library of India (in djvu format).
http://www.archive.org/texts/textslisting-browse.php?collect...
|
|
Organikum
resurrected
Posts: 2337
Registered: 12-10-2002
Location: Europe
Member Is Offline
Mood: frustrated
|
|
Go to
http://any2djvu.djvuzone.org
and yank the tiffs or the pdf through this server converting them to DJVU with OCR.
/ORG
|
|
Polverone
Now celebrating 21 years of madness
Posts: 3186
Registered: 19-5-2002
Location: The Sunny Pacific Northwest
Member Is Offline
Mood: Waiting for spring
|
|
It's okay if you don't have OCR software as long as you have a decent network connection, since you can then upload the processed TIFF files
for someone else to OCR or, as Organikum suggested, make a DjVu file and process with any2djvu. I have found with any2djvu that you may have trouble
uploading large files from the browser; in this case it's best to stick the file on some WWW server and tell the any2djvu where it is.
The books provided by archive.org are nice if you don't want to wait for the cleaned-up versions to grow, but they are inferior in at least three
ways:
1) They are too large. I tried downloading AluminiumAndItsAlloys.djvu and the file is nearly 17 MB in size for 226 pages. This is 3-4 times more bytes
than what I'd consider reasonable for a book of this size.
2) Scanning garbage around the edges has not been cleaned up, nor have blotches and speckles been removed. The images therefore look inferior and
would waste considerable ink/toner if you were to print the files.
3) No OCR has been applied.
I believe that points 2 and 1 are related. Like JBIG2, DjVu's bitonal compression scheme is optimized for certain patterns (like repeated images
of printed letters in scanned text). Noisy garbage like scanning artifacts does not exhibit these patterns and cannot be compressed nearly as
efficiently.
Further, the archive.org books represent only a limited fraction of the books available on the DLI site itself.
As to what books apart from chemistry might be interesting, I think the biology, agriculture, physics, and engineering categories may all have some
interesting titles in them. I have found in using the DLI web interface that it has some errors where books will appear to be not-found because the
URL being used has "ons" instead of "collections" in it. Manually fixing the URL fixes this problem. If you want to grab books for
touchup, though, you'll want to browse through the raw data instead of the web interface since that's the only way to get the
higher-resolution TIFF files. Happily, the new DLI serves genuine CCITT compressed TIFF files instead of GIF files masquerading as TIFF. This saves
considerable time/space when downloading.
PGP Key and corresponding e-mail address
|
|
hellomynameisop
Harmless
Posts: 1
Registered: 5-1-2005
Member Is Offline
Mood: No Mood
|
|
At first I had assumed their site was no longer available, but then i realized your post was very recent. If anyone is having trouble getting the
above links to work try removing the "www" from the beginning. so they should be:
http://dli.ernet.in/
and
http://dli.ernet.in/collections/
i thought most browsers corrected this automatically. evidently netscape does not
|
|
Polverone
Now celebrating 21 years of madness
Posts: 3186
Registered: 19-5-2002
Location: The Sunny Pacific Northwest
Member Is Offline
Mood: Waiting for spring
|
|
update
I have touched up several books and created OCR'd PDF files from them. BromicAcid has generously agreed to host a large number of these books,
saving the bandwidth and disk space of the main sciencemadness web site. The Sciencemadness <A
HREF="http://www.sciencemadness.org/library/">Library</A> has been updated to feature the complete books more prominently. Visit
now to see the 16 books that have been added in the most recent batch. I will continue to add books in batches as I produce archival CDs for my own
use. Artistically inclined members should feel free to submit alternate images that could be used as a banner over the library. I want something
clean, but a little less stiff than the image that's there now.
PGP Key and corresponding e-mail address
|
|
Polverone
Now celebrating 21 years of madness
Posts: 3186
Registered: 19-5-2002
Location: The Sunny Pacific Northwest
Member Is Offline
Mood: Waiting for spring
|
|
another update
The following DLI books have been added to the <A HREF="http://www.sciencemadness.org/library/">library</A>:
Aluminium and its Alloys
Absorption of Nitrous Gases
An Introduction to the Chemistry of Plant Products
Animal Proteins
Ephedrine and Related Substances
Industrial Nitrogen Compounds and Explosives*
Industrial Fermentations
The Catalytic Oxidation of Organic Compounds in the Vapor Phase
The Chemistry and Literature of Beryllium
The Chemistry of the Coal Tar Dyes
The Silicates in Chemistry and Commerce
Treatise on General and Industrial Organic Chemistry II
What Industry Owes to Chemical Science
As always, thanks go to BromicAcid for providing server space and bandwidth for these books.
*It turns out that the production of cyanates and subsequent reduction to cyanides with hot carbon (my favorite home-cyanide process) was actually
considered for industrial use and patented at one time. There is nothing new under the sun, not even in amateur chemistry.
PGP Key and corresponding e-mail address
|
|
BromicAcid
International Hazard
Posts: 3246
Registered: 13-7-2003
Location: Wisconsin
Member Is Offline
Mood: Rock n' Roll
|
|
Sorry....
Due to sciencemadness going down and redirects going to my site for the day, coupled with the new books and extra use they have generated, I have used
up 85% of my bandwith, and being that I don't want my site to go down for the rest of the month I've temporarily disabled access to the
books hosted on my site. Sorry for any inconvience.
|
|