Oxydro
Hazard to Others
Posts: 152
Registered: 24-5-2004
Location: NS, Canada
Member Is Offline
Mood: distracted
|
|
A massive project
I have a plan for a project, which if successful will be enough to keep me busy for a very long time. My concept is to make a huge chemical database,
sort of on the same lines as the handbooks, CRC, Merck, etc. It would be searchable by every possible property, and it would have synthesis data for
as many as possible of the chemicals as possible.
Now, all that is fairly basic and probably exists elsewhere as well. My big thing is going to be an expert system that, given a desired product and a
list of available starting point chemicals, creates a chain of steps to take in order to get to that product. If it failed, it would also tell you
why, and what you could reagents you could add in order to make it possible. Say you were making some compound that contained bromine, but you
hadn’t listed anything with bromine in it. The system would then ask, “Would you like to see a list of chemicals that would allow completion of
the procedure?” Then when you said yes, it would find possible precursors that would go with the ones you already entered, display the list. Then
you would have an option, display only those with possible OTC sources.
Now it would also be capable of finding you products from a list of precursors. You would enter what you had available, and tell it, find something
these can make which has property X and property Y. It would find all the products possible from those precursors, and see if any of them had those
properties. If so, it would list them. If not, it would then ask, “Would you like to go a level deeper?”. Supposing you say yes, it would take the
list of possible end products, add it to the list of precursors, and repeat the process.
I realise that this system is incredibly ambitious, and there are still many more features that I haven’t mentioned. I do believe, though, with
dedicated effort, it would produce a system that would be an incredible tool for the amateur, the student, and the professional chemist. What think
you folks? Possible, impossible? Useful, useless? Should I get started or should I forget about it? Anyone want to help? Etc etc.
sorry about how long it is, didn't really realize how much I typed...
|
|
Dodoman
Hazard to Self
Posts: 79
Registered: 2-8-2004
Member Is Offline
Mood: No Mood
|
|
I think it's a good idea.
But don't start from the begining. Get a good source of info and add the features you talked about to it.
This way you'd probably save a decade.
|
|
Hermes_Trismegistus
National Hazard
Posts: 602
Registered: 27-11-2003
Location: Greece, Ancient
Member Is Offline
Mood: conformation:ga
|
|
A truly massive undertaking
The idea is a noble one.
However, you must realise that such an undertaking would be you life's work.
Your whole life. title
webster..... merck....These were once actual people.
Such a work could only be the result of OBSESSION.
if completed, you would be remembered for centuries.
If half-completed, you would be forgotten before your corpse cooled.
It isn't the idea that you have to question. It is yourself.
ask yourself "Am I a hard man? Do I have the singlemindedness of purpose of an obsessive-compulsive, tenacity of a stone bulldog and true grit of
a dyed-in-the-wool sonofabitch?"
If the answer is yes, We will not be hearing from you for again, though we may here of you, anecdotally.
Arguing on the internet is like running in the special olympics; even if you win: you\'re still retarded.
|
|
Polverone
Now celebrating 21 years of madness
Posts: 3186
Registered: 19-5-2002
Location: The Sunny Pacific Northwest
Member Is Offline
Mood: Waiting for spring
|
|
This is one of those projects that (in a somewhat reduced form) has been rattling around in my head for some time. Facts themselves are not
copyrightable. Arrangements of facts - compilations like the Merck Index or CRC Handbook of Chemistry and Physics - are copyrightable. At least in the
US, courts have ruled that there must be a certain degree of creativity present for a compilation to be protected by copyright. I imagine it would be
an interesting murky area if you were to scan, OCR, and repackage (say) the physical properties tables from the CRC Handbook and several other
handboks in a database. Regardless of legality it appears that, practically speaking, movie and music companies are the only entities that frequently
initiate legal action against people for non-commercial copyright infringement.
Or, here's another clever idea to avoid copyright trouble: instead of distributing a database compiled from several references, distribute the
program that creates that database from electronic source materials. Major data handbooks are already available electronically, and
in the future more will no doubt be available in electronic format. The end database would be more useful than the sum of its parts, and you could
leave it up to individuals to collect the "parts" that it takes as input.
There is no doubt tremendous potential for "data mining" useful facts out of the vast volumes of digitized journals, as well, but those
databases are not available in their entirety to mere mortals and their providers take a dim view of mass downloading. However, given a large number
of (say) PDF documents, some of which are known to contain interesting facts, it would be an interesting problem to try to automate or semi-automate
aggregation of facts from them. A similar project of even larger scale would be to do the same with the www. With journals you have peer review on
your side, with the www you have easier-to-parse data and openness on your side.
Mmmm, but I'm wandering too far afield in my thoughts. Basically, given some compilations of chemical data already in electronic format, I think
it's a project of reasonable scope to extract common properties like melting point and density, tag the origin of each fact, weed out duplicate
entries, and provide an easily-queried database that is far more useful than the original compilations.
Providing more comprehensive information would be far more ambitious and difficult, and deserving of another suite of specialized search programs to
aid you. The more sources you draw from, and the less trivial the information you collect, the more danger that you will never be able to make your
project completely public, usable, and legal.
There has already been substantial work on automatic synthesis planning. See for example http://www.risc.uni-linz.ac.at/people/blurock/courses/caos/c....
PGP Key and corresponding e-mail address
|
|
Marvin
National Hazard
Posts: 995
Registered: 13-10-2002
Member Is Offline
Mood: No Mood
|
|
I think its a very nice fantasy, but a terrible idea.
The problem 90%+ of chemists (let alone home chemists) have is not that they have so much information they need a single search engine to sift the
data for them, its that they dont have access, or in the case of professional chemists easy access, to the information in the first place.
The simplest conceptual solution is to put all the data man has into a single place and get a computer to archive it. Unfortunatly, its also
infeasable and illegal.
I see 2 main problems really at the root of the idea,
1. Data in books not in electronic format in or out of copyright.
2. Data in electronic form with restricted access, most often requiring paid subscription, but also those that only cater to non profit institutions
and/or certain catchment areas.
Computer aided synthesis programs are available but are of questionable use, so I dont see this as a problem. Unless someone is expecting to be able
to do at home what commercial companies have done only better.
Scanning will help solve problem 1. Legality isnt the issue, plenty of useful stuff is coming out of copyright, its a matter of knowing the law very
accuratly and being happy with material 50 or 70 or 100 years old. The out of copyright material will continue to swamp anyone wanting to scan books,
the only real problem here apart from manpower is that people want to scan the new books first.
Attempting to recycle copyrighted material into non copyrightabled is going nowhere without massive qualified human resorces that all understand the
way the law works. Otherwise you risk at any time in the future someone prooving you broke someones copyright and the whole database could be ordered
to be destroyed.
Problem 2 to solve legally requires a lot of money, but one way would be to form a virtual library, giving a certain amount of money a month, and
access the sites through a proxy server.
Of more questionable legality would be to setup a proxy server inside the zone of an existing subscriber, say a university and use that.
I'm sorry to rant like I should end it with 'The great Oz has spoken' and if anything the most important point has allready been made
by Hermes. If you spent an entire year scanning well chosen books you could massivly contribute to the free online chemistry information. If you put
the same effort into your database the result at the end of the year would still be useless. I'd estimate it would take one person 10 years+ to
assemble a database of equvalent usefulness to the dictionary of organic compounds and all that has is a molecular structure, a melting point, maybe a
boiling point, a few derivitives and some references to the real information you actually want. Prep. IR. NMR. etc.
My advice, pick something that will be useful after a few weeks work and doesnt rely on the availability of anything else. Also, tell people what you
are doing, as you are doing it so noone else attempts the same thing.
|
|
tokat
Hazard to Self
Posts: 64
Registered: 1-6-2004
Location: 6feet south
Member Is Offline
Mood: No Mood
|
|
mirror a site like orgsyn. You might want to think about adding encytion?
|
|
Polverone
Now celebrating 21 years of madness
Posts: 3186
Registered: 19-5-2002
Location: The Sunny Pacific Northwest
Member Is Offline
Mood: Waiting for spring
|
|
It's true that scanning the right old books can be very useful. Unfortunately, copyright is a moving target that keeps growing in duration. I
believe the Project Gutenberg rule of thumb is that only works from before 1923 are guaranteed to be in the public domain... it can be difficult to
establish the status of later works that may or may not have had copyrights renewed. Limiting oneself to pre-1923 works eliminates a lot of
potentially valuable material.
At the universities I've attended, the subscription electronic services were available to anybody who cared to wander into the library and sit
down at a computer. You needed to be student/staff only to access those resources from non-library computers. Documents can be saved to disks or
emailed to oneself from the library. Assuming proximity to a university, many people actually have access to expensive subscription services even if
they're not students.
There's also always the option of ignoring the law, and distributing your database/software under cloak of anonymity. The history of movie and
MP3 distribution, software piracy since the days of the Apple ][, etc. shows that this is a viable if not elegant way to go.
I would agree that the grander version of this database (compiling as much information as possible, including descriptions of syntheses) would be a
life's work. I think that a reduced version that aims mainly for numerically-defined values would be much easier to achieve. The push for
electronic access means that publishers themselves are providing digitized versions of more and more works. Once a single person sympathetic to the
project has obtained a copy, it's effectively as if everybody has obtained a copy. See Axehandle's FTP site for evidence.
So, our Plucky Pirate collects ebooks or database screen-scrapings (originally accessed by himself or just someone willing to share) and writes a
little parsing program to extract the desired data from each source. Extracted data is sorted, duplicate-cleared, and merged to form the database.
Updated versions of the database are periodically made (anonymously) available to Plucky Pirate's friends, who can then spread it as they see
fit. It may still be a decade-long project, but very little of that decade is taken up with PP expending effort on the project. He waits for a work to
become available in electronic format, then himself spends only an hour or so writing another interface program to extract the desired pieces and add
it to the growing database. Apart from the initial overhead of creating database merge and search programs, relatively little effort needs to be
expended to distill each new book into database-friendly format.
I actually began work on a precursor to this very concept around January. I took the physical constants of organic compounds table from the CRC
Handbook of Chemistry and Physics and tried to break it up into fields that I could store and search. Unfortunately, all I had to parse the PDF with
was the free program pdftotext, which of course obliterates most formatting information when it converts to plain text. That ultimately derailed my
efforts since I had no way to tell if a property entry had been left blank (and hence couldn't keep every entry in its proper column).
If I had a more effective way to parse PDF available to me, a searchable personal CRC database would already be in my possession. Simple full-text
searches like you can do in Acrobat Reader are much less powerful than searches on a database. Right now, I find it about as easy to flip through my
paper CRC handbook as it is to connect to the library (or even open my locally downloaded copy) and search in Acrobat. I think that would change if I
had a dedicated database/frontend for searching out substance properties.
If I find a better way to parse PDF (I know commercial libraries exist, but I don't have them) I will take this project off the
back burner and see how far my programming can take my imagination.
[Edited on 10-2-2004 by Polverone]
PGP Key and corresponding e-mail address
|
|
JohnWW
International Hazard
Posts: 2849
Registered: 27-7-2004
Location: New Zealand
Member Is Offline
Mood: No Mood
|
|
A good idea - if you are independently wealthy, living off investments, and can afford to do nothing else for a year or more. I would start with
getting the International Critical Tables (only recently released in electronic form) into an easily searchable form, and add the other information
databases like CRC Handbook of Chemistry & Physics, other CRC handbooks, Beilstein, Perry, etc. to it.
And you would have to come up with something sufficiently original to avoid copyright limitations. BTW, because the International Critical Tables were
originally published in the 1920s, the copyright in them should by now have expired in nearly all countries.
John W.
|
|
vulture
Forum Gatekeeper
Posts: 3330
Registered: 25-5-2002
Location: France
Member Is Offline
Mood: No Mood
|
|
I'd say limit yourself.
A good one to start with would to combine all formation enthalpies and gibbs energies of as many compounds as possible, because it's usually
troublesome to find that data for non trivial chemicals.
One shouldn't accept or resort to the mutilation of science to appease the mentally impaired.
|
|
Oxydro
Hazard to Others
Posts: 152
Registered: 24-5-2004
Location: NS, Canada
Member Is Offline
Mood: distracted
|
|
I have, much as I hate to admit it, never looked at any of those handbooks... up untill about a month ago I had been living far (150km) from any
university or indeed any significant library. Can those who have, tell me, how much of the volume is taken up by real chemical data, and how much is
taken up by formatting, other information, etc?
If you think about it, there are only so many pages in each of the reference, (The Merk Index, for example, is 2564 pages and the CRC Handbook is
2608). A fair typist should be able to enter at least several pages of data per hour, one would think. Assuming a rate of 5 pages an hour, that works
out to about 520 hours of data entry for the CRC (assuming all of every page is relevant!). That is quite a bit less than everyone where I lived
worked to make their hours (to collect unemployment), and they managed to do that in 2 months. I may be being optimistic with my pages/hour, but if
OCRing it could be possible, the rate of data entry would be much greater.
I'm trying to think of a usefull, inovative way to represent chemical data online, and I'm going through ideas at a great rate.
One thought is that a refinement of this system is to create a database of syntheses alone. Each "node" in the database would be for either
a single chemical, or for a class of chemicals. The node would also contain the precursors necessary, links to references if available (locally
hosted copies if possible) and maybe a rating on suitability for industry, the amateur lab, proffesionals. this would then be "very
searchable" so you could look up "product X which uses Z and Y and a maximum of 2 other chemicals, not counting organic solvents," for
instance.
The idea here isn't to make it complete, it's to be just like other references which detail a large number of syntheses, except larger and
easier to access.
This is irrelevant trivia, but Hermes mentioned Webster, the person.... I found out yesterday, that the whole Webster family (of lexicographic fame)
own sumner houses around the town where I live... they were having a wedding recently at a resort within site of my house, and they booked up the
entire (5 star, they claim) resort for a week.
I think that I'll (for now) go with the suggestions of something smaller, and try to make something at least a bit usefull, as fast as possible.
|
|
Organikum
resurrected
Posts: 2339
Registered: 12-10-2002
Location: Europe
Member Is Offline
Mood: frustrated
|
|
Hmmm
Thats a secret project of mine... LOL
No time to go into details now, but I believe a worked out an interesting way to solve aquisition and copyright problems.
Based on a system of mutual help and storing the results.
Later more.
ORG
|
|
Eliteforum
National Hazard
Posts: 571
Registered: 18-11-2002
Location: United Kingdom
Member Is Offline
Mood: Enjoying the journey
|
|
Oh! You tease!
All that glitters isn't gold.
|
|
sarcosuchus
Harmless
Posts: 25
Registered: 16-9-2004
Location: in server hell
Member Is Offline
Mood: sleepy
|
|
I for one think that the idea has some merit but the point about it taking years to complete is anything if understated,the rapid growth in science
would require near daily updating the data,just to keep up would take a life time,and a dozen T-1 lines. but the idea of a program to digest say pdf files is i think much more practical in the long term. what would be nice
is if say we had 4-5 code geeks,caffine IV`s and a couple of real hot babes that would give it there all in the name of science,in two or three months
we would be set
one thing that we all can do till there is this superchem9000 is collect all the ebooks/programs and any useful web related stuff we can find to feed
in to this database... i for one cant wait till there are 50gig dvd burners if nothing else to clean up the clutter..
famous last words\"hold my beer and watch this\"
|
|
solo
International Hazard
Posts: 3975
Registered: 9-12-2002
Location: Estados Unidos de La Republica Mexicana
Member Is Offline
Mood: ....getting old and drowning in a sea of knowledge
|
|
Oxydro ...........it's four years later ....what have you done or accomplished .........solo
It's better to die on your feet, than live on your knees....Emiliano Zapata.
|
|
|