Sciencemadness Discussion Board

A massive project

Oxydro - 1-10-2004 at 09:19

I have a plan for a project, which if successful will be enough to keep me busy for a very long time. My concept is to make a huge chemical database, sort of on the same lines as the handbooks, CRC, Merck, etc. It would be searchable by every possible property, and it would have synthesis data for as many as possible of the chemicals as possible.

Now, all that is fairly basic and probably exists elsewhere as well. My big thing is going to be an expert system that, given a desired product and a list of available starting point chemicals, creates a chain of steps to take in order to get to that product. If it failed, it would also tell you why, and what you could reagents you could add in order to make it possible. Say you were making some compound that contained bromine, but you hadn’t listed anything with bromine in it. The system would then ask, “Would you like to see a list of chemicals that would allow completion of the procedure?” Then when you said yes, it would find possible precursors that would go with the ones you already entered, display the list. Then you would have an option, display only those with possible OTC sources.

Now it would also be capable of finding you products from a list of precursors. You would enter what you had available, and tell it, find something these can make which has property X and property Y. It would find all the products possible from those precursors, and see if any of them had those properties. If so, it would list them. If not, it would then ask, “Would you like to go a level deeper?”. Supposing you say yes, it would take the list of possible end products, add it to the list of precursors, and repeat the process.

I realise that this system is incredibly ambitious, and there are still many more features that I haven’t mentioned. I do believe, though, with dedicated effort, it would produce a system that would be an incredible tool for the amateur, the student, and the professional chemist. What think you folks? Possible, impossible? Useful, useless? Should I get started or should I forget about it? Anyone want to help? Etc etc.

sorry about how long it is, didn't really realize how much I typed...

Dodoman - 1-10-2004 at 10:53

I think it's a good idea.

But don't start from the begining. Get a good source of info and add the features you talked about to it.

This way you'd probably save a decade.:cool:

A truly massive undertaking

Hermes_Trismegistus - 1-10-2004 at 11:03

The idea is a noble one.

However, you must realise that such an undertaking would be you life's work.

Your whole life. title

webster..... merck....These were once actual people.

Such a work could only be the result of OBSESSION.

if completed, you would be remembered for centuries.

If half-completed, you would be forgotten before your corpse cooled.

It isn't the idea that you have to question. It is yourself.

ask yourself "Am I a hard man? Do I have the singlemindedness of purpose of an obsessive-compulsive, tenacity of a stone bulldog and true grit of a dyed-in-the-wool sonofabitch?"

If the answer is yes, We will not be hearing from you for again, though we may here of you, anecdotally.

Polverone - 1-10-2004 at 13:38

This is one of those projects that (in a somewhat reduced form) has been rattling around in my head for some time. Facts themselves are not copyrightable. Arrangements of facts - compilations like the Merck Index or CRC Handbook of Chemistry and Physics - are copyrightable. At least in the US, courts have ruled that there must be a certain degree of creativity present for a compilation to be protected by copyright. I imagine it would be an interesting murky area if you were to scan, OCR, and repackage (say) the physical properties tables from the CRC Handbook and several other handboks in a database. Regardless of legality it appears that, practically speaking, movie and music companies are the only entities that frequently initiate legal action against people for non-commercial copyright infringement.

Or, here's another clever idea to avoid copyright trouble: instead of distributing a database compiled from several references, distribute the program that creates that database from electronic source materials. Major data handbooks are already available electronically, and in the future more will no doubt be available in electronic format. The end database would be more useful than the sum of its parts, and you could leave it up to individuals to collect the "parts" that it takes as input.

There is no doubt tremendous potential for "data mining" useful facts out of the vast volumes of digitized journals, as well, but those databases are not available in their entirety to mere mortals and their providers take a dim view of mass downloading. However, given a large number of (say) PDF documents, some of which are known to contain interesting facts, it would be an interesting problem to try to automate or semi-automate aggregation of facts from them. A similar project of even larger scale would be to do the same with the www. With journals you have peer review on your side, with the www you have easier-to-parse data and openness on your side.

Mmmm, but I'm wandering too far afield in my thoughts. Basically, given some compilations of chemical data already in electronic format, I think it's a project of reasonable scope to extract common properties like melting point and density, tag the origin of each fact, weed out duplicate entries, and provide an easily-queried database that is far more useful than the original compilations.

Providing more comprehensive information would be far more ambitious and difficult, and deserving of another suite of specialized search programs to aid you. The more sources you draw from, and the less trivial the information you collect, the more danger that you will never be able to make your project completely public, usable, and legal.

There has already been substantial work on automatic synthesis planning. See for example http://www.risc.uni-linz.ac.at/people/blurock/courses/caos/c....

Marvin - 1-10-2004 at 21:43

I think its a very nice fantasy, but a terrible idea.

The problem 90%+ of chemists (let alone home chemists) have is not that they have so much information they need a single search engine to sift the data for them, its that they dont have access, or in the case of professional chemists easy access, to the information in the first place.

The simplest conceptual solution is to put all the data man has into a single place and get a computer to archive it. Unfortunatly, its also infeasable and illegal.

I see 2 main problems really at the root of the idea,

1. Data in books not in electronic format in or out of copyright.

2. Data in electronic form with restricted access, most often requiring paid subscription, but also those that only cater to non profit institutions and/or certain catchment areas.

Computer aided synthesis programs are available but are of questionable use, so I dont see this as a problem. Unless someone is expecting to be able to do at home what commercial companies have done only better.

Scanning will help solve problem 1. Legality isnt the issue, plenty of useful stuff is coming out of copyright, its a matter of knowing the law very accuratly and being happy with material 50 or 70 or 100 years old. The out of copyright material will continue to swamp anyone wanting to scan books, the only real problem here apart from manpower is that people want to scan the new books first.

Attempting to recycle copyrighted material into non copyrightabled is going nowhere without massive qualified human resorces that all understand the way the law works. Otherwise you risk at any time in the future someone prooving you broke someones copyright and the whole database could be ordered to be destroyed.

Problem 2 to solve legally requires a lot of money, but one way would be to form a virtual library, giving a certain amount of money a month, and access the sites through a proxy server.

Of more questionable legality would be to setup a proxy server inside the zone of an existing subscriber, say a university and use that.

I'm sorry to rant like I should end it with 'The great Oz has spoken' and if anything the most important point has allready been made by Hermes. If you spent an entire year scanning well chosen books you could massivly contribute to the free online chemistry information. If you put the same effort into your database the result at the end of the year would still be useless. I'd estimate it would take one person 10 years+ to assemble a database of equvalent usefulness to the dictionary of organic compounds and all that has is a molecular structure, a melting point, maybe a boiling point, a few derivitives and some references to the real information you actually want. Prep. IR. NMR. etc.

My advice, pick something that will be useful after a few weeks work and doesnt rely on the availability of anything else. Also, tell people what you are doing, as you are doing it so noone else attempts the same thing.

tokat - 1-10-2004 at 23:45

mirror a site like orgsyn. You might want to think about adding encytion?

Polverone - 2-10-2004 at 00:42

It's true that scanning the right old books can be very useful. Unfortunately, copyright is a moving target that keeps growing in duration. I believe the Project Gutenberg rule of thumb is that only works from before 1923 are guaranteed to be in the public domain... it can be difficult to establish the status of later works that may or may not have had copyrights renewed. Limiting oneself to pre-1923 works eliminates a lot of potentially valuable material.

At the universities I've attended, the subscription electronic services were available to anybody who cared to wander into the library and sit down at a computer. You needed to be student/staff only to access those resources from non-library computers. Documents can be saved to disks or emailed to oneself from the library. Assuming proximity to a university, many people actually have access to expensive subscription services even if they're not students.

There's also always the option of ignoring the law, and distributing your database/software under cloak of anonymity. The history of movie and MP3 distribution, software piracy since the days of the Apple ][, etc. shows that this is a viable if not elegant way to go.

I would agree that the grander version of this database (compiling as much information as possible, including descriptions of syntheses) would be a life's work. I think that a reduced version that aims mainly for numerically-defined values would be much easier to achieve. The push for electronic access means that publishers themselves are providing digitized versions of more and more works. Once a single person sympathetic to the project has obtained a copy, it's effectively as if everybody has obtained a copy. See Axehandle's FTP site for evidence.

So, our Plucky Pirate collects ebooks or database screen-scrapings (originally accessed by himself or just someone willing to share) and writes a little parsing program to extract the desired data from each source. Extracted data is sorted, duplicate-cleared, and merged to form the database. Updated versions of the database are periodically made (anonymously) available to Plucky Pirate's friends, who can then spread it as they see fit. It may still be a decade-long project, but very little of that decade is taken up with PP expending effort on the project. He waits for a work to become available in electronic format, then himself spends only an hour or so writing another interface program to extract the desired pieces and add it to the growing database. Apart from the initial overhead of creating database merge and search programs, relatively little effort needs to be expended to distill each new book into database-friendly format.

I actually began work on a precursor to this very concept around January. I took the physical constants of organic compounds table from the CRC Handbook of Chemistry and Physics and tried to break it up into fields that I could store and search. Unfortunately, all I had to parse the PDF with was the free program pdftotext, which of course obliterates most formatting information when it converts to plain text. That ultimately derailed my efforts since I had no way to tell if a property entry had been left blank (and hence couldn't keep every entry in its proper column).

If I had a more effective way to parse PDF available to me, a searchable personal CRC database would already be in my possession. Simple full-text searches like you can do in Acrobat Reader are much less powerful than searches on a database. Right now, I find it about as easy to flip through my paper CRC handbook as it is to connect to the library (or even open my locally downloaded copy) and search in Acrobat. I think that would change if I had a dedicated database/frontend for searching out substance properties.

If I find a better way to parse PDF (I know commercial libraries exist, but I don't have them) I will take this project off the back burner and see how far my programming can take my imagination.

[Edited on 10-2-2004 by Polverone]

JohnWW - 2-10-2004 at 01:37

A good idea - if you are independently wealthy, living off investments, and can afford to do nothing else for a year or more. I would start with getting the International Critical Tables (only recently released in electronic form) into an easily searchable form, and add the other information databases like CRC Handbook of Chemistry & Physics, other CRC handbooks, Beilstein, Perry, etc. to it.

And you would have to come up with something sufficiently original to avoid copyright limitations. BTW, because the International Critical Tables were originally published in the 1920s, the copyright in them should by now have expired in nearly all countries.

John W.

vulture - 2-10-2004 at 02:15

I'd say limit yourself.

A good one to start with would to combine all formation enthalpies and gibbs energies of as many compounds as possible, because it's usually troublesome to find that data for non trivial chemicals.

Oxydro - 2-10-2004 at 03:34

I have, much as I hate to admit it, never looked at any of those handbooks... up untill about a month ago I had been living far (150km) from any university or indeed any significant library. Can those who have, tell me, how much of the volume is taken up by real chemical data, and how much is taken up by formatting, other information, etc?

If you think about it, there are only so many pages in each of the reference, (The Merk Index, for example, is 2564 pages and the CRC Handbook is 2608). A fair typist should be able to enter at least several pages of data per hour, one would think. Assuming a rate of 5 pages an hour, that works out to about 520 hours of data entry for the CRC (assuming all of every page is relevant!). That is quite a bit less than everyone where I lived worked to make their hours (to collect unemployment), and they managed to do that in 2 months. I may be being optimistic with my pages/hour, but if OCRing it could be possible, the rate of data entry would be much greater.

I'm trying to think of a usefull, inovative way to represent chemical data online, and I'm going through ideas at a great rate.

One thought is that a refinement of this system is to create a database of syntheses alone. Each "node" in the database would be for either a single chemical, or for a class of chemicals. The node would also contain the precursors necessary, links to references if available (locally hosted copies if possible) and maybe a rating on suitability for industry, the amateur lab, proffesionals. this would then be "very searchable" so you could look up "product X which uses Z and Y and a maximum of 2 other chemicals, not counting organic solvents," for instance.

The idea here isn't to make it complete, it's to be just like other references which detail a large number of syntheses, except larger and easier to access.

This is irrelevant trivia, but Hermes mentioned Webster, the person.... I found out yesterday, that the whole Webster family (of lexicographic fame) own sumner houses around the town where I live... they were having a wedding recently at a resort within site of my house, and they booked up the entire (5 star, they claim) resort for a week.

I think that I'll (for now) go with the suggestions of something smaller, and try to make something at least a bit usefull, as fast as possible.

Hmmm

Organikum - 2-10-2004 at 05:01

Thats a secret project of mine... LOL

No time to go into details now, but I believe a worked out an interesting way to solve aquisition and copyright problems.

Based on a system of mutual help and storing the results.


Later more.
ORG

Eliteforum - 2-10-2004 at 15:09

Oh! You tease!

sarcosuchus - 3-10-2004 at 12:39

I for one think that the idea has some merit but the point about it taking years to complete is anything if understated,the rapid growth in science would require near daily updating the data,just to keep up would take a life time,and a dozen T-1 lines.:D but the idea of a program to digest say pdf files is i think much more practical in the long term. what would be nice is if say we had 4-5 code geeks,caffine IV`s and a couple of real hot babes that would give it there all in the name of science,in two or three months we would be set:D:D:D

one thing that we all can do till there is this superchem9000 is collect all the ebooks/programs and any useful web related stuff we can find to feed in to this database... i for one cant wait till there are 50gig dvd burners if nothing else to clean up the clutter..

solo - 25-4-2008 at 05:45

Oxydro ...........it's four years later ....what have you done or accomplished .........solo