Sciencemadness Discussion Board
Not logged in [Login ]
Go To Bottom

Printable Version  
 Pages:  1    3
Author: Subject: Chemical Price Comparison Software in the making..
SuperOxide
National Hazard
****




Posts: 539
Registered: 24-7-2019
Location: Devils Anus
Member Is Offline


[*] posted on 11-2-2025 at 12:46


Quote: Originally posted by bnull  
Sites may change from time to time, the code changing along with them. I recommend you inspect them occasionally to make sure your code is not outputting something else. Color codes, for example, or gibberish.


I was going to suggest the same thing. Since each site would have its own module, there could be a verify_continuity() method that makes sure the elements that are usually there are still present. Could also work for REST endpoints but that's probably less of an issue.
View user's profile View All Posts By User
SuperOxide
National Hazard
****




Posts: 539
Registered: 24-7-2019
Location: Devils Anus
Member Is Offline


[*] posted on 11-2-2025 at 20:35


Something else to think about - I like your idea of using the sitemap.xml, but its worth noting that the sitemap file can be named pretty much anything. But usually the sitemap filename is in the robots.txt (which does need to be there for the sitemap to be found by the search engines).

Using Carolina Chemical again as an example, looking at their robots.txt, at the very bottom you would see:
Quote:
Sitemap: https://www.carolina.com/cbs-site-index.xml

You can visit this in your browser and see that there's several sitemap files, one of which is the cbs-product-sitemap.xml which looks pretty useful. You can visit this in your browser as well (viewing it in your developer tools Elements pane makes it a lot easier to read, btw). But this only seems to contain the links, update frequency and priority for each URL. That might be helpful if you could find the last time it was updated, but I don't see where that is.

But the tricky part is that some websites don't want you to access them through anything but an actual browser. If you curl the sitemap file in the command line you'll get back a default response saying you need to enable javascript to view the page. This is usually checked for on the server side by looking at your User Agent value and making sure its a valid browser. This can typically be circumvented by just viewing the request in your browsers Network tab, right clicking on it and selecting "Copy as cURL". This should copy every header, coolie and parameter that was used to make the request from your browser....

Only, it doesn't seem to work for Carolina:

If you look at the responses, youll see it sets some data in a dd variable, then loads a javascript file which seems to be doing the cloaking. I see its doing something with an iframe.. But I would think that if there was an iframe, I would be able to see it.

Point is - Some websites are going to have roadblocks like this that will slow things down quite a bit. So expect to have some unexpected fun along the way. lol

P.S. Carolina left developer keys in the open (just look for "key"), lol. Hopefully those aren't actual access keys.

P.S.S. Im noticing for Carolina.com, if I open a new incognito window and go to their website, it first will prompt for a captcha. I'm not sure how this could be circumvented from the command line. I hope not too many sites are like this.

[Edited on 12-2-2025 by SuperOxide]
View user's profile View All Posts By User
Maui3
Hazard to Others
***




Posts: 139
Registered: 9-9-2024
Member Is Offline


[*] posted on 13-2-2025 at 02:51


Thank you all, this sounds good!

SuperOxide, It looks very nice with S3 Chemicals and Laboratoriumdiscounter, but for Onyxmet and the beautifulsoup, it says this?

I presume this isn't great, lol.
View user's profile View All Posts By User
SuperOxide
National Hazard
****




Posts: 539
Registered: 24-7-2019
Location: Devils Anus
Member Is Offline


[*] posted on 13-2-2025 at 03:56


Quote: Originally posted by Maui3  
Thank you all, this sounds good!

SuperOxide, It looks very nice with S3 Chemicals and Laboratoriumdiscounter, but for Onyxmet and the beautifulsoup, it says this?

I presume this isn't great, lol.

I don't get any of those. Can you still visit the website in your browser? If not, they blocked you for sending too many suspicious requests. Youll probably be unblocked in a day or so. But I'm guessing they're using some logic to determine if you're crawling the website/API's (which you are), and blocking you.

Try doing what I did earlier (in this video):

  1. Open up developer tools in your browser (In Chrome/Brave, its at View > Developer > Developer Tools)
  2. Go to the Network tab, click Preserve Log
  3. In the same window, try to perform an action that will trigger the call you want to make in your app. (This is easier if you filter it to only show Fetch/XHR and Doc requests)
  4. When you see the XHR or Doc call in the network tab, right click on it > Copy > Copy as cURL
  5. Go to curlconverter, paste the cURL request and select Python. You should end up with something like this.
  6. Try to copy/paste that to a local script. Remove any session specific cookies. Run it.
  7. Use the BeautifulSoup thing on the result output

Let me know if that works out.

Once that works, see if you can take a short cut and just use a python library to spoof a browser by using something like curl_cffi.

And if you send _too_ many curl requests, they may block you as a spammer or a bot. I would recommend doing a few requests you want, save the responses to a separate local file. Then in your code, stub (mock) the responses from the remote server by just importing/including the downloaded file. Then you can continue on as if that's a real response. Once your code is ready to be tested, remove the stub and run it.

Feel free to share the code on Github so I can check it out.

P.S. Maybe we should move this to a private chat, so we don't keep filling up the thread with technical stuff and end up in detritus, lol. Create a Discord account if you don't have one and PM me your username.

[Edited on 13-2-2025 by SuperOxide]
View user's profile View All Posts By User
SuperOxide
National Hazard
****




Posts: 539
Registered: 24-7-2019
Location: Devils Anus
Member Is Offline


[*] posted on 13-2-2025 at 04:52


I updated the gist to include a customized python3 script that sends a product search query using the CAS on Onyxmet, then grabs the URL for the first result and curl's that page for the actual product details. Try running BeautifulSoup on the product_response variable at the bottom.
Screenshot in the attachment:
Attachment: php0FeCXO (498kB)
This file has been downloaded 34 times

P.S. To the mods - why can I never get attached images to insert as images into the post body? I always have to upload to imgur then include from there >_<

[Edited on 13-2-2025 by SuperOxide]
View user's profile View All Posts By User
bnull
National Hazard
****




Posts: 596
Registered: 15-1-2024
Location: Home
Member Is Offline

Mood: Sneezing like there's no tomorrow. Stupid cat allergy.

[*] posted on 13-2-2025 at 05:03


Use HTTrack to save a copy of the site. You don't need the pictures, only the code, and HTTrack will preserve the structure.

I suppose that, as long as the discussion is going somewhere, there's no risk of it going to Detritus. It has been interesting, at least.

One more thing: the main advantage of it being a public discussion is that other programmers may chip in and offer suggestions, something that is not so easy in private. A middle ground would be moving it to a member-only area. In any case, only members can contribute to the discussion.




Quod scripsi, scripsi.

B. N. Ull

We have a lot of fun stuff in the Library.

Read The ScienceMadness Guidelines. They exist for a reason.
View user's profile View All Posts By User
SuperOxide
National Hazard
****




Posts: 539
Registered: 24-7-2019
Location: Devils Anus
Member Is Offline


[*] posted on 13-2-2025 at 05:14


I updated the working_requests_example.py script to even get the price from the product page using BeautifulSoup.



Attachment: phpWiDJG0 (430kB)
This file has been downloaded 32 times

View user's profile View All Posts By User
SuperOxide
National Hazard
****




Posts: 539
Registered: 24-7-2019
Location: Devils Anus
Member Is Offline


[*] posted on 13-2-2025 at 05:17


Quote: Originally posted by bnull  
Use HTTrack to save a copy of the site. You don't need the pictures, only the code, and HTTrack will preserve the structure.


HTTrack looks a little overkill. He does't want to crawl the entire website. He just wants to search for a product and then curl that page. It'll be 2 requests, maybe 3, sometimes even just 1.
View user's profile View All Posts By User
bnull
National Hazard
****




Posts: 596
Registered: 15-1-2024
Location: Home
Member Is Offline

Mood: Sneezing like there's no tomorrow. Stupid cat allergy.

[*] posted on 13-2-2025 at 06:48


Never mind.



Quod scripsi, scripsi.

B. N. Ull

We have a lot of fun stuff in the Library.

Read The ScienceMadness Guidelines. They exist for a reason.
View user's profile View All Posts By User
Maui3
Hazard to Others
***




Posts: 139
Registered: 9-9-2024
Member Is Offline


[*] posted on 13-2-2025 at 07:03


Great Superoxide!

I added the code you wrote for Onyxmet to mine, and now it works!
Here is a video:
https://i.imgur.com/7ITFNZl.mp4

I will send it to you soon, I just need to add more comments - which I am not very good at..

Also, for the laboratoriumdiscounter and S3 chemicals, I only use beautifulsoup to get the HTML..
We will also need to add a "quantity" for some of them, since they might not be specified in the title. For laboratoriumdiscounter, sometimes they do, other times they dont. They really make it easy for us, lol..
View user's profile View All Posts By User
SuperOxide
National Hazard
****




Posts: 539
Registered: 24-7-2019
Location: Devils Anus
Member Is Offline


[*] posted on 13-2-2025 at 07:24


Quote: Originally posted by Maui3  
Great Superoxide!
I added the code you wrote for Onyxmet to mine, and now it works!
Here is a video:
https://i.imgur.com/7ITFNZl.mp4

Well done! I'm a big nerd so I kinda prefer command line tools like that to web clients or GUI's. But that's just me. It does look great though.

Quote: Originally posted by Maui3  
Great Superoxide!
I will send it to you soon, I just need to add more comments - which I am not very good at..

ok, Did you create a Github account/repo yet? It should be very quick. Itll be a heck of a lot easier than sending zip files full of code back and fourth, lol.

Quote: Originally posted by Maui3  

Also, for the laboratoriumdiscounter and S3 chemicals, I only use beautifulsoup to get the HTML..
We will also need to add a "quantity" for some of them, since they might not be specified in the title. For laboratoriumdiscounter, sometimes they do, other times they dont. They really make it easy for us, lol..

Honestly, even if the quantity is in the title, I would only use it as a last resort. That's likely an unstandardized format that isn't reliably parsable. I would always go with grabbing the real value out of the page/response somehow. It shouldn't be too much more difficult than finding the price, but you need to find the quantity element relative to the price location, if possible.
I realize on sites like Onyxmet you might not be able to, so you'd just have to grab it from the h3.product-title > a element, but lets test that out first.

Search on Onyxmet for something, anything that will bring up some items (I searched for Zinc), and go to the search results page. Then open up your Javascript console (in devtools), and paste this:
Quote:

console.log(Array.from(document.querySelectorAll('.product-details > .caption > h4 > a')).map(elem => elem.innerHTML).join('\n'))

If you look through that list, youll see what I mean by non-standardized values. A few of the odd ones include:

  1. Zinc 99,995% - macro etched
  2. Zinc 99,999% - 129g SOLD!!!!
  3. Zinc 99,9999% - 17g SOLD!!!!!!!!!!!!
  4. Zinc acetate dihydrate 100g

You can see not all of these are in the same format. So what you would do then is copy that list out of the console output, go to regex101.com and try to come up with a pattern that will match as many as you can. here's an exampleI just came up with. You can see it matches the product name, quantity and even units into separate groups. It accounts for some of the odd characters and unreliable format. Hope it helps.

[Edited on 13-2-2025 by SuperOxide]
View user's profile View All Posts By User
Maui3
Hazard to Others
***




Posts: 139
Registered: 9-9-2024
Member Is Offline


[*] posted on 13-2-2025 at 10:01


I have made a github repo for it now!

I don't know if it is possible, but can I add you as a co-owner? If that is a thing, lol.

That would mean be both could edit the code, right?

Also, I have not made the comments better, or structured the code a lot better yet, I'll do that, I just wanted to upload the code now.
View user's profile View All Posts By User
SuperOxide
National Hazard
****




Posts: 539
Registered: 24-7-2019
Location: Devils Anus
Member Is Offline


[*] posted on 13-2-2025 at 11:30


Quote: Originally posted by Maui3  

I don't know if it is possible, but can I add you as a co-owner? If that is a thing, lol.

I think if you edit the repo, click on Collaborators, and search for me (jhyland87), you should be able to add me. I haven't done that before either though.
View user's profile View All Posts By User
Maui3
Hazard to Others
***




Posts: 139
Registered: 9-9-2024
Member Is Offline


[*] posted on 14-2-2025 at 04:38


I have added you now as a collaborator.

Also, it was difficult to upload the code - the folders didn't upload.. I need to fix that..
View user's profile View All Posts By User
SuperOxide
National Hazard
****




Posts: 539
Registered: 24-7-2019
Location: Devils Anus
Member Is Offline


[*] posted on 14-2-2025 at 06:03


Quote: Originally posted by Maui3  
I have added you now as a collaborator.

Also, it was difficult to upload the code - the folders didn't upload.. I need to fix that..

That's very simple. Did you read/watch what I linked you to earlier? It should be like 5 minutes to read/watch, and youll see what I mean.

Quote: Originally posted by SuperOxide  

But, first things first. Do you use Git (revision control)? If not, start using it. Sign up on Github.com (free), create a repository for this project, then install git on your local machine and commit some code to it. Start checking in your changes so you don't lose any of your files or anything.
- Install git
- Create a repo and add some code to it


In the 2nd link (starting at 0:29), you can see how he adds his local code to it.

Alternatively, you can download VisualStudio Code, which is what hes using in that video, and instead of doing it via the CLI, you can just select "Clone github repo, then give it the git repo URL, it checks it out, and you add your code and commit/push.
Attachment: phplY2wEm (1.3MB)
This file has been downloaded 35 times

I have to work now, but maybe we can set up some time to hop on a video conference call or something and I can walk you through it.



[Edited on 14-2-2025 by SuperOxide]
View user's profile View All Posts By User
SuperOxide
National Hazard
****




Posts: 539
Registered: 24-7-2019
Location: Devils Anus
Member Is Offline


[*] posted on 24-2-2025 at 07:26


If anyone was interested in an update, Maui3 and I are still working on this project and its actually coming along very well. I'm helping organize the backend and add the basic supplier modules, hes doing the UI, fixing issues I missed in the supplier modules and adding other functionality/logic.

Here's the repo for it: YourHeatingMantle/ChemPare.

For the supplier modules, so far we have:

  1. 3schem
  2. chemsavers
  3. esdrei (aka: S3 chem)
  4. laballey
  5. labchem
  6. laboratoriumdiscounter
  7. loudwolf
  8. onyxmet
  9. synthetika
  10. tcichemicals

And how it works is each supplier module is an extension of the abstract base_module which contains a common interface. So regardless of how messy the logic is to get the data from the supplier, its all encapsulated behind a common interface.

This allows us to use a factory pattern (search_factory.py) that handles executing the search for each module. It just looks at what suppliers there are in the folder, includes them, execute the search and combine the results.

Still a bit to go before its fully ready for people to try out, but it's definitely coming along great.
View user's profile View All Posts By User
imidazole
Harmless
*




Posts: 28
Registered: 18-10-2014
Location: Massachusetts, USA
Member Is Offline

Mood: Radical

[*] posted on 6-3-2025 at 10:34
Room for another?


I'm an okay programmer, and I sent this post from Linux, are you looking for more hands?
View user's profile Visit user's homepage View All Posts By User
SuperOxide
National Hazard
****




Posts: 539
Registered: 24-7-2019
Location: Devils Anus
Member Is Offline


[*] posted on 9-3-2025 at 08:24


Quote: Originally posted by imidazole  
I'm an okay programmer, and I sent this post from Linux, are you looking for more hands?


Well, this is meant to be an open source project, so why not :-)
Just follow the How To Contribute doc on Github.

But basically, it'll be:

  1. Login to your Github account and go to the ChemPare repo
  2. Fork the repository (Github doc on how to do that)
  3. Go to your repo and select "Contribute" -> "Open Pull Request" (Doc on that)

I too am originally a Linux engineer, but I mostly do software development for work now. Hopefully you have some Python experience, but if not, this is a great project to learn on.
View user's profile View All Posts By User
 Pages:  1    3

  Go To Top