Sciencemadness Discussion Board

scanners & ocrs-what's best for book scanning?

chemrox - 31-7-2007 at 17:22

Two books have fallen into my friend's hands and he would like to share them with the group. They are: Casy & Parfitt, Opioid Analgesics, and Cook, Enamines.

He started scanning with the HP flatbed he uses at work and was quickly frustrated. No multiple page tif, bmp, etc. capability and the ocr did a great job on some pages while others, that looked very similar in the book were unrecognizable when interpreted. He had sxanned forty odd pages when the system froze and had to be re-initialized causing a new file to be necessary. He thinks the books should be one file each.

He hasn't bothered to send Ploverone a promised sample because the quality was so disappointing.

He (and I) solicit your recommendations on software/hardware combinations that seem better for scanning articles and books.

We thank you,
CRX

[Edited on 31-7-2007 by chemrox]

Nicodem - 31-7-2007 at 22:23

There already is a thread on this topic.

solo - 1-8-2007 at 05:11

This may not the thread Nicodem alluded to, but here is some help......solo

https://sciencemadness.org/talk/viewthread.php?tid=8104&...

Lambda - 1-8-2007 at 06:38

@chemrox,

Honorable Conrad @Solo has referred to a recent thread, but old bones can be found here:

Making E-books into "real" books:
https://sciencemadness.org/talk/viewthread.php?tid=4070&...

It's a pity, but this thread seems to have passed away prematurely. However, with your request on eBook scanning, maybe some life can be pumped into the brewery so that we can all again enjoy it's sweet alcoholic beverages.

Oh,.. by the way, I assume you already have good OCR software, but if not, then it can be downloaded via Madhatter's FTP services in the folder:

UPLOAD / Lambda / Software / ABBYY Collection 2006 (FineReader v8.0 Pro - PDF Transformer v2.0 - Scan To Office v1.0) /

as

ABBYY Collection 2006 (FineReader v8.0 Pro - PDF Transformer v2.0 - Scan To Office v1.0) - By Helion Prime.iso (501 MB)

It's the best OCR Software I know for the Job !

Who the hell uploaded that Software to my folder ?;)

Regards,

Lambda.

Sauron - 1-8-2007 at 23:17

Very few of the books I have seen from the forum library or from MadHatter FTP have been OCR'd - rather they were scanned as page images and then assembled directly into pdf or djvu.

OCR in my experience still requires proofreading and if the book has a lot of illustrations (and what chemistry book does not?) then some human intervention is required.

When scanning in as page images, no such attention is required. All you need is a good flat page to plate contact and a judicious selection of resolution and mode and assuming you have a good scanner you will get great images that assemble easily into your choice of format.

OCR is best for pure text if and only if you need to be able to word-process (edit) later, or if you are going for absolutely minimum file size. In these days of cheap mass storage I see little motivation to fight for small files, by investinbg about 4X the labor.

I just scanned two 600 page books - the sixth and seventh Harry Potters - using an elderly Canon flatbed with serial interface (not USB) and did so directly into Adobe Acrobat 8.0 Professional using the Create PDF From Scanner option.

No worries.

Polverone - 2-8-2007 at 08:22

Most if not all of the books in the forum library have OCR applied, but the OCR text is hidden beneath the page image and left uncorrected. This is the same sort of OCR that journal archives apply to their articles, and it's very useful for searching. It can also be useful for copy/paste quotation as long as the scan is of reasonable quality and the passage doesn't contain much mathematical or chemical notation.

Lambda - 2-8-2007 at 18:40

I regularly find interesting articles on the Internet that have been scanned, and are full of dark spots and smudges. Often the sides of the article or book pages are just left unattended to, for instance when an A5 Book page has been scanned to A4 format. I then use "PaintShop Pro" to arrays the spots and to straiten the page out by means of rotating slightly, and save the image as TIFF-Format. Depending on how many pages are involved, I keep the amount of significant numberings the same. For instance, 99 pages or less would be 01 to XX, or 999 pages or less would be 001 to XXX. In this way they can easily be batch converted to PDF- or DjVU-Format without having page 1 succeeded by page 10, 11, 12 and page 2 by page 20, 21, 22, etc. but then again, it's a matter of how you adjust your page sequence conversion. By just keeping the amount of significant numbers the same, you may play "Brain Dead", for thing almost always work out just fine then.

A very handy batch converter and rename Program that I love, and has been of great benefit to me is; "Better File Rename v4.9.4". An other Program which can also be used is; "Total Commander v7.0 - Public Beta 1". They are both available via Madhatter's FTP services in the folder;

UPLOAD / Lambda / Software / Better File Rename v4.9.4 with Serial /

as

"Better File Rename v4.9.4 with Serial.rar" (912 KB)

and

UPLOAD / Lambda / Software / Total Commander v7.0 - Public Beta 1 /

as

"Total Commander v7.0 - Public Beta 1 + Key.rar" (2 MB)

It's really astonishing what programs I find in my "upload folder", thanks to nameless Samaritans.

Regards,

Lambda.

chemrox - 3-8-2007 at 15:29

OK- I want to followup with these last two posts. Isn't it the case that book scans work best when run through an ocr and presented as a wordprocessor file or pdf? Isn't it best for the conversion to pdf to have one filename for the whole lot?

BTW_ nobody clued me how to get top Madhatter's FTP; are there instructins somehwere? You could send me a u2u on the last one....

Thanks for all this help .. I want to put in opiods, casey & parfitt; alkaloids of opium (?)(1930's) and Enamines

(to capitalize or not? i get lazy..)

Sauron - 3-8-2007 at 18:01

The short answer to your questions is NO. Just get yourself your choice of a PDF creator and you will see.

MadHatter is a member of this forum. PM him politely and request a PW and other details to his ftp site, it is his personal site and no one else can let you in. Be patient awaiting a reply.

Attached is the first sixteen pages of Bodanzsky's Principles of Peptide Synthesis (2nd Ed.) as an example of what can be done in a single step with a single application - in this instance Adobe Acrobat 8 Professional.

The PDF was created directly from scanner at 180 dpi in Text mode with OCR under which is just a matter of checking a box in the setup page. I did not optimize the image area to eliminate white space but will do so in the book proper. That can also be done after the fact.

Adobe Acrobat 8 Pro requires XP Service Pack 2 to install, but I have found that once installed, you can uninstall SP2 if you wish and Adobe still runs without any apparent problems.

I would expect previous releases of Acrobat to give similar results, and I think Version 7 is on MadHatter. Whereas I got mine at Bangkok's IT Mall. About $5.

Unfortunately it is pretty necessary to disassemble the book to get really good scans.

[Edited on 4-8-2007 by Sauron]

Attachment: fmtocintro.pdf (545kB)
This file has been downloaded 2510 times


Lambda - 4-8-2007 at 14:07

Nice Scans @Sauron, a neat Job !

In "PaintShop Pro", what I do, is to select only the "Text Window", and then to save it in TIFF-Format. In this way the Image will cover the whole Page-Format. A few Centimeters left free "above", "under", "left" and "right" can be done by copying and pasting the selection onto a fresh new page, and then to saved it again as TIFF-Image. A "Raster" can be used to align it all out neatly when pasting the Image, just make shore that it is turned "off" when saving the new Image.

By using "PaintShop Pro", spots, smudges, and even slightly rotating the image to alline the whole out neatly can easily be performed.

I "print out" and "Book Bind" my scans, so free space at the sides of the "Text-Window" becomes important, especially at the binding edges where two pages interleave. The Book will then not have to be forcefully bent open to read the adjacent interconnecting Text Parts. Also, after Binding all the Pages together, the Book is then cut on three sides, namely the "outer side", "above" and "under side". Only then do I "past on" the Cover which has already been cut to the appropriate size, slightly bigger (~0.5cm on all free sides) than the Format of the "Book-Page-Window".

Scanning the Book in;

This can be done in two ways, directly via a scanner device (there are many kinds and techniques), or by means of Photocopying the Book and feeding the Scans through a Scanner. Very important is it not to damage the original Book, for I personally consider this to be a crime (there are worse things, I know). In order not to damage the Book Spine, it should not be forcefully bent open, especially "Glued-Cover" Books, the ones that are not Sewn and have a Soft Cover (Paper Backs). Luckily, "sort minded people" have thought out a manner to do this by keeping the Book undamaged. There are Photocopying devices that take images from the edge, meaning that the Book may be bent over the sides at an angle of about 90 Degrees or slightly more, as not to force the Book open 180 Degrees. The Binding Edges and Cover (Spine) are thus not subjected to damaging stressful forces which may ruin the Book. A few examples of what I mean, can be performed by the following Scanners;


HP PlusTek3600

HP PlusTek3600 Scanner is Designed for Book Scanning for $318 (zero edge scanning);
http://gizmodo.com/gadgets/gadgets/copyright+violating-scann...

Plustek Opticbook 3600 Book Scanner for $319.99 (zero edge scanning);
http://www.tigerdirect.ca/applications/SearchTools/item-deta...

The Microtek ScanMaker s450 for only $99 (Zero Boundary Design);
http://www.photonewstoday.com/?p=4879

And here are a few links on how to digitize books;

How to Digitize a Book (Interesting, but he buggers up the book !!);
http://www.jaml.com/eBook/

How to scan a book (Again !!);
http://www.proportionalreading.com/scan.html

Book scanning;
http://en.wikipedia.org/wiki/Book_scanning

There is a lot more to be said about Book Scanning and Book Binding, and hopefully this thread will continue with for instance commentary and advice by @S.C. Wack and @Mephisto who both do a hell of a job. There are also others, but the two above mentioned which come to mind are known to produce quality work in this field.

Regards,

Lambda.

[Edited on 6-8-2007 by Lambda]

Sauron - 4-8-2007 at 17:46

I'm at a point with my eyesight where I have a hard time reading hardcopy of anything and it is far easier to read on a 19 inch flat panel, so scanning for me is no longer a luxury. It's a necessity.

I have now scanned 40% of the Bodanszky book, and all pages after the introduction were cropped to minimize white space.

Next I will scan the companion lab manual "Practice of Peptide Synthesis" by same author. This is along with "Principles" the primary text on the subject at the Oxbridge chemistry departments I think.

I'll be adding these to MadHatter and also putting them up on 4shared.

Lambda - 5-8-2007 at 17:07

I have been very fortunate today, for I have managed to buy the complete German 3 volume set of "Chemie Lexikon by Hermann Römpp" for half the Price. Each volume is nearly as thick as the biceps including triceps of @Arnold schwarzenegger, and weigh a "Ton". I already have it on CD-ROM (~500MB), but nothing beats the sexy feeling of a Book. And I now again feel like a Virgin.....:D


Arnold schwarzenegger

Anybody interested in the "Römpp CD-ROM" ?... in German Language though, and I guess that I have to split it up into RAR-Parts for you @Sauron, due to Big-File download problems, aye...?

About your bad eye sight;

Is it not possible to lay a Book on your Scanner and view it via your TV Set or Beamer by means of a Computer TV-Card ?. There are also Full-Vision Magnifying Loupes with built in Light Source, preferably with a Halide Lamp, for the Spectra corresponds best to that of Sunlight. Often TL Lights are used, and these Magnifying Loups have a shitty round form.

This looks like what may benefit you the most, and then by using your own Light Source;

Fresnel Book Magnifier:
http://www.narang.com/laboratory_equipments/magnifiers.php
http://www.lenseloptics.com/l103.html
http://www.3dlens.com/bookmagnifier.htm
http://process-equipment.globalspec.com/Industrial-Directory...


http://profile.imageshack.us/camerabuy.php?model=Canon+EOS+5... Book Stand with Fresnel Magnifier Plastic ($79.95)
http://www.lssproducts.com/product/6477/fresnel

Maybe it's better that we U2U each other about this.

In the meantime, I will dig up several good Books on Book Binding so that we can continue this thread.

Take care !

Regards,

Lambda.

[Edited on 6-8-2007 by Lambda]

Sauron - 6-8-2007 at 02:58

I finished scanning Principles of Peptide Synthesis into PDF, using Acrobat Pro 8.1.0

Redid the front matter, TOC and introduction to match the rest of the book.

Put it up on 4shared.com, link is now posted in References, q.v.

Am now working on the companion The Practice of Peptide Synthesis, 2nd Edition, and that should be up on a few days as well.

Enjoy!

Sauron - 7-8-2007 at 05:52

"Practice" is now 60% scanned in and will be completed within 24 hrs.

An important tip for wannabe book scanners: do not skimp on physical RAM. I found that 512 Mb RAM gave a 60 sec per page scan time for c.A4 page of text mode, @ 180 dpi, very tedious. Merely adding another 512 Mb RAM (total 1 Gb) reduced scan time for same parameters by 500% - 11 to 12 seconds/page.

Upgrading from 8.0 to 8.1.0 version Acrobat Professional cut the OCR postprocessing time in half as well, but it was already fast.

Sauron - 20-11-2007 at 04:20

The MicrpTek ScanMaker s450 is vaporware in the Thailand market. There are three MicroTek importers/distributors in the Kingdom and NONE stock this model. I have coerced the largest into importing it for me and am now awaiting quotation.

This scanner is classified as a home/office product by its maker. But is is 4800 x 9600 optical resolution and 48 bit color depth. It is primarily intended for scanning continuous tone color prints, slides and film negatives. I do not quite understand why a book scanning feature like Zero Boundary was tacked on to this model. Scanning color plates (or grayscale plates) in books with a digital scanner is fraught with problems, and who needs 4800 dpi for text?

In general a home/office scanner is not intended to stand up to heavy duty use, but at $99 it is cheap enough to be a throwaway. So I do not worry about aftersales service. I'll just buy a spare one if I like it, and replace them as they die off.

Microtek

MadHatter - 20-11-2007 at 18:13

I've used the Microtek V6UPL flatbed scanner with good results. It's old, slow even
at low resolutions, but does a reasonable job. My assocate, TMP used this to scan in
Hydrazine And Its Derivatives. He's still editing the individual pages. He was disappointed
with his Lexmark all-in-one for some reason.

chemrox - 20-11-2007 at 22:19

I have an older Microtek scanmaker too and just loaded a newr driver (still available). The OCR supplied was AaBbYy. My friend Sauron recommends it. I want to scan Casey & Parfitt for the forum but I'm still confused about a few things including the following:

Quote:
Originally posted by Polverone
Most if not all of the books in the forum library have OCR applied, but the OCR text is hidden beneath the page image and left uncorrected. This is the same sort of OCR that journal archives apply to their articles, and it's very useful for searching. It can also be useful for copy/paste quotation ...


I don't see the advantage in having the OCR text under the graphic files or how to make a continuous graphic file from the individual pages. The OCR will make the continuous text file as I understand the process but I'm lost about retaining the tif or bmp's.

Sauron - 20-11-2007 at 23:11

To make the document you have choice of PDF format or DjVu format. PDF scrolls, DjVu uses a navigation popup. I prefer PDF and create mine in Acrobat Pro 8.

What you see normally are the page images, as you scroll through. The OCR is there in the background, but OCR is not all that useful for books and articles chock full of chemical drawings etc., it is mostly useful for searching which I rarely do.

Adobe Acrobat Pro takes you directly and seamlessly from scanning into the final document. If you scan chapter by chapter as I do, you will end up with a group of smaller PDFs that in order, make up the book, typically like this:

Front Matter
Chapter 1
Chapter 2
and so on
Index (if any)

You then assemble these quite easily using Create PDF-> From Multiple Files

The advantage of organizing this way is that you automatically get bookmarks that allow the reader to jump to particular chapters withut scrolling through.

Therefore if the chapters have meaningful names, use those as file names and the bookmark pane will be easier to use than is it says "Chapter Five" for example.

Regarding Microtek: once upon a time they were top of the heap in flatbed scanners, HP being their only real competition. Then some of their own people broke away and started Umax, Canon got into the scanner game, and Microtek's market share dwindled. It was actually Lambda who recommended this particular newer model Microtek with a feature that supposedly makes book scanning easier. And the price is $99. The equivalent HP zero-edge book scanner is over $300. I suppose Microtek is doing a loss-leader trying to regain market share. This model is unknown in the Thai market but the importer is ordering for me.

I want it so that when I scan thicker books, I will no longer have to disassemble (read: destroy) them to scan properly. Thinner books usually scan OK, but big fat ones have too much curl near the spine and text there goes out of focus if not flast on the glass. The s450 is supposed to have licked this problem with patented Zero Boundary technology. Whatever that is.

The HP book scanner by comparison brings the scan pane right to the edge of the scanner so you can lay the page flat on the glass right up to the spine. The rest of the book hangs off the scanner as in the photo Lambda posted above.

I had to destroy this Kosolapoff book to scan it which saddened me; Topics in Phosphorus Chemistry Vol.1 was thin enough that I didn't have to tear it apart.

Actually the two books are not so different in length, but the 1950 Wiley was printed on thick paper while the 1964 Wiley was on thin glossy stock. One is about 1.25" inches thick and the other half an inch.

[Edited on 21-11-2007 by Sauron]

Sauron - 21-11-2007 at 09:44

The Microtek ScanMaker s450 is on amazon.com for $89 new, knocked down from $99

http://www.amazon.com/Scanmaker-S450-Clr-4800X9600-48BIT/dp/...

Customer reviews kvetch about poor performance of driver/software under Mac OS and Vista, FWIW, I'm an XP user so not too worried.

The Plustek Optibook 3600 also on amazon for $239 knocked down from $319 but gets really BAD reviews, for both image quality (or its lack) and instability of driver/software. The image quality is described as on a par with bottom end low budget scanners, only.

About the only person who was wildly enthusiastic about the Plustek was someone who sells "books" meaning scanned pdf's on eBay, and he apprently is not concerned with piss poor images. I give my scans away and I AM concerned.

So I think I will risk $89 and try the Microtek and pass on the $240 PlusTek.

PS I finally got local quote for Microtek s450 here in Thailand, $150 and I have to wait 45 days for delivery from Taiwan.



[Edited on 22-11-2007 by Sauron]

[Edited on 22-11-2007 by Sauron]

41OcudceZOL._SS500_.jpg - 26kB

Sauron - 15-1-2008 at 04:30

The Microtek S450 shown above, which I special ordered back in November, has shown up finally and will be delivered Friday morning 18 January.

I will report on its worth as a book scanner. Microtek does not market it as a book scanner, but as a home/office general purpose high resolution scanner. It does have a feature that is supposed to make it very suitable for book scanning. We will see.

chemrox - 15-1-2008 at 23:47

I have found microtek drivers to be the biggest limitation on the equipment. I like a lot more image control. HP drivers are just as bad... maybe worse. I'd like to find a scanner run on separate driver-image software combinations. I find it ironic that I could better image output with a 400 dpi Epsom than with a 2400 dpi HP or microtec.

Sauron - 16-1-2008 at 04:10

Fine, but as the Epson, Canon etc do not let you get a decent scan from a bound book, without destroying it, then what the fuck use are they?

THAT is the point of the Microtek S450. Supposedly anyway. I will report on this after delivery on Friday morning.

HP has nothing to do with it. The Plustek pictured upthread and sometimes marketed by HP is an utter piece of rubbish from Taiwan abd 3X the money of the Microtek, also from Taiwan. But as far as I know there are no other alleged book scanners. The Microtek goes for $89 from amazon.com, the Plustek $200+. What price does your Epson fetch? And gee, there's that black blob down the spine and the print is all out of focus for the last inch and a half on both sides nearest the spine where the pages are not quite flast on the glass.

THAT is what the Microtek is supposed to eliminate with its Zero Boundary feature. If it works, not bad for $89.

Image control? After the fact with photoshop.

Sauron - 9-2-2008 at 00:17

I have gotten around to testing the ScanMaker S450 as a book scanner. Thus far it has performed with flying colors.

The salient feature of this scanner pointed out by Lambda upthread is something called Zero Edge technology. This allows printed and bound material to be scanned without distortion of the text closest to the spine. With my elderly Canon scanner if I placed a book to scan two facing pages, particularly in mid volume, I would get a black blob down the spine and the text left and right of the spine, slightly off the glass, would be curled and out of focus.

Thus I was having to resort to tearing books apart to scan them. I am happy to report NO LONGER.

The Microtek S450 turns in facing pages of a normal bound volume perfectly legible on both sides of the spine and only a faint shadow along the spine itself.

I scanned the test image using default settings at 600 dpi and postprocessed the jpg in Photoshop 7 with merely an auto contrast tweak. Controls inside Microtek's provided software allow user adjustment of brightness, contrast, sharpness, and color/hue/saturation (irrelevant here) so obviously, things can be bettered with a little trial and error. But the image is quite clear as is.

Considering that this scanner sells on Amazon for $89 in USA I believe it is well worth a purchase. I had to special order mine from Taiwan and wait two months to receive it, paying altogether about twice the US discount price. But, what's the choice? I don't wish to go on havung to destroy every old and new chemistry book I acquire.

So this inexpensive S450 does indeed appear to fit the bill. I commend it to all of you.

I did not make any effort to hold the book flatter on the glass.

The test scan from middle of this hardbound 400+ page Wiley volume is attached as a pdf. When I first attached it as a jpg the forum displayed it and distorted the thread. So I made a pdf in Acrobat Pro 7.

[Edited on 9-2-2008 by Sauron]

Attachment: test.pdf (702kB)
This file has been downloaded 583 times


Sauron - 10-2-2008 at 02:34

Apologies for double post. A second test of facing pages with S450, this time in black & wite rather than grayscale. This maximizes contrast but there is black stripe down spine, easily removed in Photoshop by simple select and clear operation.

600 dpi. No speckling, a very clean image. Sharpening not required.

Note that along with the max contrast you get a much smaller file size. By default bmp but you can change to jpg when you get rid of the black spine in Photoshop.

[Edited on 11-2-2008 by Sauron]

Attachment: testbw001.pdf (140kB)
This file has been downloaded 625 times