Sciencemadness Discussion Board
Not logged in [Login ]
Go To Bottom

Printable Version  
 Pages:  1  
Author: Subject: Post #2000 and an apology for accidentally deleting a thread
Melgar
Anti-Spam Agent
*****




Posts: 2004
Registered: 23-2-2010
Location: Connecticut
Member Is Offline

Mood: Estrified

sad.gif posted on 13-12-2018 at 17:26
Post #2000 and an apology for accidentally deleting a thread


I was trying to think of something clever to do for my 2000th post, but then BotKilla accidentally deleted the "detecting Hg in street drugs thread". So I figured I owed everyone an explanation of how that happened and what I've done to make sure that it doesn't happen again.

When new threads are created, they're assigned thread ids in sequential order. So a new thread will always have the largest thread id of any thread ever created. As a means of ensuring that it was impossible for BotKilla to accidentally delete an old thread, there was a hard-coded cutoff constant, whereby BotKilla would totally ignore any thread with a thread ID below this number. This worked well enough for preserving old threads that are part of the SM legacy, but because the number was a constant, it didn't increase over time. When I started running BotKilla, thread ids were typically around 96000. I raised the thread cutoff once, manually, to 99000. But raising it over time to account for increasing thread ids wasn't a very high priority, and so this meant that threads created after 99000 could be deleted under certain unlikely circumstances. That's basically what happened for the "detecting Hg in street drugs thread", a bunch of unrecognized links were in it and it got flagged as spam.

Thread ids for new threads are now around 112000. This means that BotKilla has killed over 15,000 spambot-created threads. About 99.8% of new threads are spam threads, with a new one being created about 12 times per hour, or every 5 minutes. So clearly there is plenty of spam to automatically sort through, and the script does seem to pick up almost all of it. However, when I noticed that this legitimate thread got deleted, I made two changes: a) removed the code to penalize a post for additional unrecognized links that are included in it beyond the first one, and b) changed the hard-coded constant to a number that's automatically incremented, and can only ever be increased over time.

So a few key points:

  • Old threads (with thread ids below 99000) were never in any danger of getting deleted.
  • A more recent thread was accidentally deleted, because it had been started after the code was developed, and had a thread id above the cutoff constant.
  • The cutoff number is not a constant anymore, but it's initialized at 112000 and is increased over time in response to increasing thread ids of new threads. This means that only very new threads can be automatically deleted now, once the script is up and running.
  • Posting a lot of unrecognized links at once isn't penalized as much as before, which might lead to a slight increase in uncaught spam.
  • The script now restarts itself when it crashes, which makes things a lot more convenient for everyone, but also makes it less likely for me to notice bugs.

At this point, I don't think that it's an option to stop the script, and I think most people would agree. So I've restarted it with the above code changes. Sorry about accidentally deleting that thread. I can probably recover parts of the thread with some effort, but it wasn't very long, and the consensus seemed to be that you can buy mercury test strips for testing groundwater and paint and such, and that using those test strips is probably the way to go.

Anyway, I'm open to any feedback, and hope the 15,000 automatically-deleted spam threads were worth losing a few recently-started threads. There have only been three that I'm aware of, and measures are in place such that none of those three would be deleted if they were posted again.




The first step in the process of learning something is admitting that you don't know it already.

I'm givin' the spam shields max power at full warp, but they just dinna have the power! We're gonna have to evacuate to new forum software!
View user's profile View All Posts By User
JJay
International Hazard
*****




Posts: 3440
Registered: 15-10-2015
Member Is Offline


[*] posted on 13-12-2018 at 18:07


Why don't you use support vector machines or naive Bayes or some other machine-learning method for flagging spam?



View user's profile View All Posts By User
clearly_not_atara
International Hazard
*****




Posts: 2786
Registered: 3-11-2013
Member Is Offline

Mood: Big

[*] posted on 13-12-2018 at 18:16


Machine learning is a pain in the ass



Quote: Originally posted by bnull  
you can always buy new equipment but can't buy new fingers.
View user's profile View All Posts By User
fusso
International Hazard
*****




Posts: 1922
Registered: 23-6-2017
Location: 4 ∥ universes ahead of you
Member Is Offline


[*] posted on 13-12-2018 at 18:17


Why not make a post only flaggable once, and will become undeletable once it has been scanned once and determined to be genuine, like res judicata in laws (you can't sue someone for the same thing again, if proven innocent)?

[Edited on 181214 by fusso]




View user's profile View All Posts By User
JJay
International Hazard
*****




Posts: 3440
Registered: 15-10-2015
Member Is Offline


[*] posted on 13-12-2018 at 18:25


Quote: Originally posted by clearly_not_atara  
Machine learning is a pain in the ass


Not really... there are dozens of open-source packages for doing it... that's just how professional antispam software works.




View user's profile View All Posts By User
BromicAcid
International Hazard
*****




Posts: 3244
Registered: 13-7-2003
Location: Wisconsin
Member Is Offline

Mood: Rock n' Roll

[*] posted on 13-12-2018 at 18:32


Quote: Originally posted by Melgar  
This means that BotKilla has killed over 15,000 spambot-created threads. About 99.8% of new threads are spam threads, with a new one being created about 12 times per hour, or every 5 minutes.


Wow, thank you....

Did I mention thank you?




Shamelessly plugging my attempts at writing fiction: http://www.robvincent.org
View user's profile Visit user's homepage View All Posts By User
Phosphor-ing
Hazard to Others
***




Posts: 246
Registered: 31-5-2006
Location: Deep South, USA
Member Is Offline

Mood: Inquisitive

[*] posted on 13-12-2018 at 19:29


Wow, I didn’t realized we received that many spam posts everyday. Thank you for all you do.



"The nine most terrifying words in the English language are: 'I'm from the government and I'm here to help.'" -Ronald Reagan
View user's profile View All Posts By User
Melgar
Anti-Spam Agent
*****




Posts: 2004
Registered: 23-2-2010
Location: Connecticut
Member Is Offline

Mood: Estrified

[*] posted on 13-12-2018 at 19:57


Quote: Originally posted by fusso  
Why not make a post only flaggable once, and will become undeletable once it has been scanned once and determined to be genuine, like res judicata in laws (you can't sue someone for the same thing again, if proven innocent)?

[Edited on 181214 by fusso]

I actually had such a system, but it started with an empty array and added threads to it over time. The thing was, that array would have its contents reset when it was restarted. The problem arose from threads that had been started between the cutoff constant and when the current iteration of the script had started running.

As far as why I didn't use a third-party Bayesian filter library or whatever, that's easy: it wouldn't have been anywhere close to as accurate. The spam posts were designed to thwart common Bayesian filter algorithms, which only look at contents, and nothing else. By including user registration data, user post count, and checking link domains against a whitelist, accuracy got to about 1% false negatives and 0.03% false positives. That's better than my GMail's spam filter, for both figures.




The first step in the process of learning something is admitting that you don't know it already.

I'm givin' the spam shields max power at full warp, but they just dinna have the power! We're gonna have to evacuate to new forum software!
View user's profile View All Posts By User
JJay
International Hazard
*****




Posts: 3440
Registered: 15-10-2015
Member Is Offline


[*] posted on 13-12-2018 at 22:26


Quote: Originally posted by Melgar  
That's better than my GMail's spam filter, for both figures.


I call bullshit.

https://techcrunch.com/2017/05/31/google-says-its-machine-le...





View user's profile View All Posts By User
Tsjerk
International Hazard
*****




Posts: 3032
Registered: 20-4-2005
Location: Netherlands
Member Is Offline

Mood: Mood

[*] posted on 13-12-2018 at 23:46


You are great Melgar!
View user's profile View All Posts By User
JJay
International Hazard
*****




Posts: 3440
Registered: 15-10-2015
Member Is Offline


[*] posted on 14-12-2018 at 00:23


Quote: Originally posted by Tsjerk  
You are great Melgar!


I have very sound and logical reasons for calling bullshit, but whether or not what Melgar is saying is true is immaterial to the question of whether machine learning could improve an existing spam detection algorithm, and Melgar knows it, or he should know it. Please, let's set the blatant self-promotion aside.

Melgar, do you understand how machine learning could improve your algorithm, or not?




View user's profile View All Posts By User
Tsjerk
International Hazard
*****




Posts: 3032
Registered: 20-4-2005
Location: Netherlands
Member Is Offline

Mood: Mood

[*] posted on 14-12-2018 at 03:43


You are one angry bugger, aren't you?
View user's profile View All Posts By User
mayko
International Hazard
*****




Posts: 1218
Registered: 17-1-2013
Location: Carrboro, NC
Member Is Offline

Mood: anomalous (Euclid class)

[*] posted on 14-12-2018 at 06:35


Thanks for keeping the front page clean Melgar!



al-khemie is not a terrorist organization
"Chemicals, chemicals... I need chemicals!" - George Hayduke
"Wubbalubba dub-dub!" - Rick Sanchez
View user's profile Visit user's homepage View All Posts By User
woelen
Super Administrator
*********




Posts: 8012
Registered: 20-8-2005
Location: Netherlands
Member Is Offline

Mood: interested

[*] posted on 14-12-2018 at 12:54


The loss of that thread is a small accident and I appreciate very much that Melgar has been honest about this and let us know about it. I fully accept Melgar's apology.
Melgar's work is of great value for Sciencemadness. Without it the system would be next to useless and we would hardly be able to work with the forums anymore.

Such an error can happen. Good that the scripts are improved and that the chance of accidental deletion of a legit thread is further reduced.

@JJay: Being a little more constructive in your communication would be nice. Maybe with machine learning one could do even better, but the big difference between your words and Melgar's words is that he is showing a working system (in which he put a lot of effort) and lets us enjoy it for free, while you just have angry words and done no real work.




The art of wondering makes life worth living...
Want to wonder? Look at https://woelen.homescience.net
View user's profile Visit user's homepage View All Posts By User
JJay
International Hazard
*****




Posts: 3440
Registered: 15-10-2015
Member Is Offline


[*] posted on 14-12-2018 at 16:54


woelen: In response to my questions, Melgar gave a false and self-promoting excuse that wouldn't even logically excuse him if it were true. I will not assist in this, but are you really incapable of seeing how trivial it would be to use machine learning to improve Melgar's system?

Without Melgar's scripts, someone else would have put scripts into place. His scripts are a sum zero improvement to the board. A half dozen people here could have done better. I'm not saying that absolutely nothing would have been deleted, but we would be better off without Melgar's scripts.

I understand that you'd probably prefer that I work with Melgar on this, and I've considered it, but I consider the legal risks unacceptable. Who knows... maybe Melgar could say something to set my mind at ease. But I doubt it.




View user's profile View All Posts By User
fusso
International Hazard
*****




Posts: 1922
Registered: 23-6-2017
Location: 4 ∥ universes ahead of you
Member Is Offline


[*] posted on 14-12-2018 at 17:01


@JJ maybe he hates AI stuff? I don't think it's a problem to hate AI.



View user's profile View All Posts By User
JJay
International Hazard
*****




Posts: 3440
Registered: 15-10-2015
Member Is Offline


[*] posted on 14-12-2018 at 17:52


So let's say I come up with a working system, and it's better than Melgar's. Will it replace his? Seriously.



View user's profile View All Posts By User
phlogiston
International Hazard
*****




Posts: 1379
Registered: 26-4-2008
Location: Neon Thorium Erbium Lanthanum Neodymium Sulphur
Member Is Offline

Mood: pyrophoric

[*] posted on 14-12-2018 at 19:03


Quote: Originally posted by JJay  
Without Melgar's scripts, someone else would have put scripts into place. His scripts are a sum zero improvement to the board. A half dozen people here could have done better. I'm not saying that absolutely nothing would have been deleted, but we would be better off without Melgar's scripts.

I understand that you'd probably prefer that I work with Melgar on this, and I've considered it, but I consider the legal risks unacceptable. Who knows... maybe Melgar could say something to set my mind at ease. But I doubt it.


So, Melgar actually did the work, voluntarily. Moreover, he did a very good job. You, on the other hand, are only complaining about how worried you are about 'legal risks'.
Frankly, the forum was becoming unusable due to spam, and Melgar's script pretty much saved it.

Perhaps indeed 'half a dozen people could have done better'. But they didn't. Melgar did.

Also, Melgar was transparent and frank about accidentally deleting that thread, and I strongly feel that some collateral damage is acceptable in the battle against spam. I fully trust that with time, he'll be able improve his algorithms (using AI or not) to delete nearly only spam.




-----
"If a rocket goes up, who cares where it comes down, that's not my concern said Wernher von Braun" - Tom Lehrer
View user's profile View All Posts By User
Vomaturge
Hazard to Others
***




Posts: 286
Registered: 21-1-2018
Member Is Offline

Mood: thermodynamic

[*] posted on 14-12-2018 at 19:54


Can't speak for the mods, but if you made a system which was very obviously better than Melgars' I think it would be accepted.

For now, Botkilla has spoiled us as far as providing a low spam forum. Thankyou, Melgar!

It is also worth noting that at least one thread (the old "everyday chemistry" thread) was deleted for some reason prior to botkilla's startup.

Would machine learning make a better spam filter? Can't say, since I don't have the tech knowledge to know the practical limits of such a program. Is Melgar somehow wrong for not using it? No, he chose a different approach, and applied it competently with good results. Did he show poor judgement in not using machine learning? No, there wasn't a highly obvious case for it based on the trials he did do.

Should we keep our minds open to new solutions to the barrage of spam which was (still is) flowing towards the forum? Absolutely! Everyone should feel free to share their own ideas for improvements, so long as they aren't upset if their proposals get turned down.
View user's profile View All Posts By User
JJay
International Hazard
*****




Posts: 3440
Registered: 15-10-2015
Member Is Offline


[*] posted on 14-12-2018 at 22:48


So, I reiterate, and please, unless you are a mod, hold your peace:

If I write a superior spam-fighting program, will you use it instead of Melgar's?




View user's profile View All Posts By User
j_sum1
Administrator
********




Posts: 6320
Registered: 4-10-2014
Location: At home
Member Is Offline

Mood: Most of the ducks are in a row

[*] posted on 14-12-2018 at 23:40


Jjay. We are not about to install competing systems. What Melgar has done is working remarkably well within consideable constraints. He does not have access to the back end of the system and he has devised something that works. Could something similarbe done via a machine learning system? Probably. Would it be an improvement? Maybe marginal. It is tough to improve on something that is catching most spam within minutes and has made so few errors. The ultimate solution awaits a new platform.

In summary, you may well be right on technical details. But your approach to this discussion has been counterproductive. I don't think it wise to take down what Melgar has done just so that you can have your play in the sandbox.
View user's profile View All Posts By User
JJay
International Hazard
*****




Posts: 3440
Registered: 15-10-2015
Member Is Offline


[*] posted on 15-12-2018 at 00:16


My approach to this discussion was not counterproductive. Melgar's approach to this discussion was dishonest and counterproductive. I am extremely disappointed with the leadership of this forum, and I will be leaving.




View user's profile View All Posts By User
Metacelsus
International Hazard
*****




Posts: 2539
Registered: 26-12-2012
Location: Boston, MA
Member Is Offline

Mood: Double, double, toil and trouble

[*] posted on 15-12-2018 at 00:20


@JJay: would you consider collaborating with Melgar to improve the spam filter using machine learning? Spam filters don't have to be solo projects.



As below, so above.

My blog: https://denovo.substack.com
View user's profile View All Posts By User
JJay
International Hazard
*****




Posts: 3440
Registered: 15-10-2015
Member Is Offline


[*] posted on 15-12-2018 at 00:25


No, not a chance, absolutely not. As a U.S. citizen, I will not be a part of a forum where Melgar is running things. This is nothing personal against Melgar; seriously, I don't dislike you, Melgar. There are certain individuals I won't work with, though, and Melgar is one of those individuals. Sorry, that's just how it is.



View user's profile View All Posts By User
Loptr
International Hazard
*****




Posts: 1348
Registered: 20-5-2014
Location: USA
Member Is Offline

Mood: Grateful

[*] posted on 15-12-2018 at 08:33


I think this has to do with the posts that Melgar made about his drug synthesis accusations from his family, and not having someone that put all of that out there so publically be even remotely linked to direction of the forum.



"Question everything generally thought to be obvious." - Dieter Rams
View user's profile View All Posts By User
 Pages:  1  

  Go To Top