Pages:
1
2 |
Melgar
Anti-Spam Agent
Posts: 2004
Registered: 23-2-2010
Location: Connecticut
Member Is Offline
Mood: Estrified
|
|
Post #2000 and an apology for accidentally deleting a thread
I was trying to think of something clever to do for my 2000th post, but then BotKilla accidentally deleted the "detecting Hg in street drugs thread".
So I figured I owed everyone an explanation of how that happened and what I've done to make sure that it doesn't happen again.
When new threads are created, they're assigned thread ids in sequential order. So a new thread will always have the largest thread id of any thread
ever created. As a means of ensuring that it was impossible for BotKilla to accidentally delete an old thread, there was a hard-coded cutoff
constant, whereby BotKilla would totally ignore any thread with a thread ID below this number. This worked well enough for preserving old threads that
are part of the SM legacy, but because the number was a constant, it didn't increase over time. When I started running BotKilla, thread ids were
typically around 96000. I raised the thread cutoff once, manually, to 99000. But raising it over time to account for increasing thread ids wasn't a
very high priority, and so this meant that threads created after 99000 could be deleted under certain unlikely circumstances. That's basically what
happened for the "detecting Hg in street drugs thread", a bunch of unrecognized links were in it and it got flagged as spam.
Thread ids for new threads are now around 112000. This means that BotKilla has killed over 15,000 spambot-created threads. About 99.8% of new
threads are spam threads, with a new one being created about 12 times per hour, or every 5 minutes. So clearly there is plenty of spam to
automatically sort through, and the script does seem to pick up almost all of it. However, when I noticed that this legitimate thread got deleted, I
made two changes: a) removed the code to penalize a post for additional unrecognized links that are included in it beyond the first one, and b)
changed the hard-coded constant to a number that's automatically incremented, and can only ever be increased over time.
So a few key points:
Old threads (with thread ids below 99000) were never in any danger of getting deleted.
A more recent thread was accidentally deleted, because it had been started after the code was developed, and had a thread id above the
cutoff constant.
The cutoff number is not a constant anymore, but it's initialized at 112000 and is increased over time in response to increasing thread ids of
new threads. This means that only very new threads can be automatically deleted now, once the script is up and running.
Posting a lot of unrecognized links at once isn't penalized as much as before, which might lead to a slight increase in uncaught spam.
The script now restarts itself when it crashes, which makes things a lot more convenient for everyone, but also makes it less likely for me to
notice bugs.
At this point, I don't think that it's an option to stop the script, and I think most people would agree. So I've restarted it with the above code
changes. Sorry about accidentally deleting that thread. I can probably recover parts of the thread with some effort, but it wasn't very long, and
the consensus seemed to be that you can buy mercury test strips for testing groundwater and paint and such, and that using those test strips is
probably the way to go.
Anyway, I'm open to any feedback, and hope the 15,000 automatically-deleted spam threads were worth losing a few recently-started threads. There have
only been three that I'm aware of, and measures are in place such that none of those three would be deleted if they were posted again.
The first step in the process of learning something is admitting that you don't know it already.
I'm givin' the spam shields max power at full warp, but they just dinna have the power! We're gonna have to evacuate to new forum software!
|
|
JJay
International Hazard
Posts: 3440
Registered: 15-10-2015
Member Is Offline
|
|
Why don't you use support vector machines or naive Bayes or some other machine-learning method for flagging spam?
|
|
clearly_not_atara
International Hazard
Posts: 2786
Registered: 3-11-2013
Member Is Offline
Mood: Big
|
|
Machine learning is a pain in the ass
|
|
fusso
International Hazard
Posts: 1922
Registered: 23-6-2017
Location: 4 ∥ universes ahead of you
Member Is Offline
|
|
Why not make a post only flaggable once, and will become undeletable once it has been scanned once and determined to be genuine, like res judicata in
laws (you can't sue someone for the same thing again, if proven innocent)?
[Edited on 181214 by fusso]
|
|
JJay
International Hazard
Posts: 3440
Registered: 15-10-2015
Member Is Offline
|
|
Not really... there are dozens of open-source packages for doing it... that's just how professional antispam software works.
|
|
BromicAcid
International Hazard
Posts: 3244
Registered: 13-7-2003
Location: Wisconsin
Member Is Offline
Mood: Rock n' Roll
|
|
Quote: Originally posted by Melgar | This means that BotKilla has killed over 15,000 spambot-created threads. About 99.8% of new threads are spam threads, with a new one being created
about 12 times per hour, or every 5 minutes. |
Wow, thank you....
Did I mention thank you?
|
|
Phosphor-ing
Hazard to Others
Posts: 246
Registered: 31-5-2006
Location: Deep South, USA
Member Is Offline
Mood: Inquisitive
|
|
Wow, I didn’t realized we received that many spam posts everyday. Thank you for all you do.
"The nine most terrifying words in the English language are: 'I'm from the government and I'm here to help.'" -Ronald Reagan
|
|
Melgar
Anti-Spam Agent
Posts: 2004
Registered: 23-2-2010
Location: Connecticut
Member Is Offline
Mood: Estrified
|
|
Quote: Originally posted by fusso | Why not make a post only flaggable once, and will become undeletable once it has been scanned once and determined to be genuine, like res judicata in
laws (you can't sue someone for the same thing again, if proven innocent)?
[Edited on 181214 by fusso] |
I actually had such a system, but it started with an empty array and added threads to it over time. The thing was, that array would have its contents
reset when it was restarted. The problem arose from threads that had been started between the cutoff constant and when the current iteration of the
script had started running.
As far as why I didn't use a third-party Bayesian filter library or whatever, that's easy: it wouldn't have been anywhere close to as accurate. The
spam posts were designed to thwart common Bayesian filter algorithms, which only look at contents, and nothing else. By including user registration
data, user post count, and checking link domains against a whitelist, accuracy got to about 1% false negatives and 0.03% false positives. That's
better than my GMail's spam filter, for both figures.
The first step in the process of learning something is admitting that you don't know it already.
I'm givin' the spam shields max power at full warp, but they just dinna have the power! We're gonna have to evacuate to new forum software!
|
|
JJay
International Hazard
Posts: 3440
Registered: 15-10-2015
Member Is Offline
|
|
I call bullshit.
https://techcrunch.com/2017/05/31/google-says-its-machine-le...
|
|
Tsjerk
International Hazard
Posts: 3032
Registered: 20-4-2005
Location: Netherlands
Member Is Offline
Mood: Mood
|
|
You are great Melgar!
|
|
JJay
International Hazard
Posts: 3440
Registered: 15-10-2015
Member Is Offline
|
|
I have very sound and logical reasons for calling bullshit, but whether or not what Melgar is saying is true is immaterial to the question of whether
machine learning could improve an existing spam detection algorithm, and Melgar knows it, or he should know it. Please, let's set the blatant
self-promotion aside.
Melgar, do you understand how machine learning could improve your algorithm, or not?
|
|
Tsjerk
International Hazard
Posts: 3032
Registered: 20-4-2005
Location: Netherlands
Member Is Offline
Mood: Mood
|
|
You are one angry bugger, aren't you?
|
|
mayko
International Hazard
Posts: 1218
Registered: 17-1-2013
Location: Carrboro, NC
Member Is Offline
Mood: anomalous (Euclid class)
|
|
Thanks for keeping the front page clean Melgar!
al-khemie is not a terrorist organization
"Chemicals, chemicals... I need chemicals!" - George Hayduke
"Wubbalubba dub-dub!" - Rick Sanchez
|
|
woelen
Super Administrator
Posts: 8012
Registered: 20-8-2005
Location: Netherlands
Member Is Offline
Mood: interested
|
|
The loss of that thread is a small accident and I appreciate very much that Melgar has been honest about this and let us know about it. I fully accept
Melgar's apology.
Melgar's work is of great value for Sciencemadness. Without it the system would be next to useless and we would hardly be able to work with the forums
anymore.
Such an error can happen. Good that the scripts are improved and that the chance of accidental deletion of a legit thread is further reduced.
@JJay: Being a little more constructive in your communication would be nice. Maybe with machine learning one could do even better, but the big
difference between your words and Melgar's words is that he is showing a working system (in which he put a lot of effort) and lets us enjoy it for
free, while you just have angry words and done no real work.
|
|
JJay
International Hazard
Posts: 3440
Registered: 15-10-2015
Member Is Offline
|
|
woelen: In response to my questions, Melgar gave a false and self-promoting excuse that wouldn't even logically excuse him if it were true. I will not
assist in this, but are you really incapable of seeing how trivial it would be to use machine learning to improve Melgar's system?
Without Melgar's scripts, someone else would have put scripts into place. His scripts are a sum zero improvement to the board. A half dozen people
here could have done better. I'm not saying that absolutely nothing would have been deleted, but we would be better off without Melgar's scripts.
I understand that you'd probably prefer that I work with Melgar on this, and I've considered it, but I consider the legal risks unacceptable. Who
knows... maybe Melgar could say something to set my mind at ease. But I doubt it.
|
|
fusso
International Hazard
Posts: 1922
Registered: 23-6-2017
Location: 4 ∥ universes ahead of you
Member Is Offline
|
|
@JJ maybe he hates AI stuff? I don't think it's a problem to hate AI.
|
|
JJay
International Hazard
Posts: 3440
Registered: 15-10-2015
Member Is Offline
|
|
So let's say I come up with a working system, and it's better than Melgar's. Will it replace his? Seriously.
|
|
phlogiston
International Hazard
Posts: 1379
Registered: 26-4-2008
Location: Neon Thorium Erbium Lanthanum Neodymium Sulphur
Member Is Offline
Mood: pyrophoric
|
|
Quote: Originally posted by JJay | Without Melgar's scripts, someone else would have put scripts into place. His scripts are a sum zero improvement to the board. A half dozen people
here could have done better. I'm not saying that absolutely nothing would have been deleted, but we would be better off without Melgar's scripts.
I understand that you'd probably prefer that I work with Melgar on this, and I've considered it, but I consider the legal risks unacceptable. Who
knows... maybe Melgar could say something to set my mind at ease. But I doubt it.
|
So, Melgar actually did the work, voluntarily. Moreover, he did a very good job. You, on the other hand, are only complaining about how worried you
are about 'legal risks'.
Frankly, the forum was becoming unusable due to spam, and Melgar's script pretty much saved it.
Perhaps indeed 'half a dozen people could have done better'. But they didn't. Melgar did.
Also, Melgar was transparent and frank about accidentally deleting that thread, and I strongly feel that some collateral damage is acceptable in the
battle against spam. I fully trust that with time, he'll be able improve his algorithms (using AI or not) to delete nearly only spam.
-----
"If a rocket goes up, who cares where it comes down, that's not my concern said Wernher von Braun" - Tom Lehrer
|
|
Vomaturge
Hazard to Others
Posts: 286
Registered: 21-1-2018
Member Is Offline
Mood: thermodynamic
|
|
Can't speak for the mods, but if you made a system which was very obviously better than Melgars' I think it would be accepted.
For now, Botkilla has spoiled us as far as providing a low spam forum. Thankyou, Melgar!
It is also worth noting that at least one thread (the old "everyday chemistry" thread) was deleted for some reason prior to botkilla's startup.
Would machine learning make a better spam filter? Can't say, since I don't have the tech knowledge to know the practical limits of such a program. Is
Melgar somehow wrong for not using it? No, he chose a different approach, and applied it competently with good results. Did he show poor judgement in
not using machine learning? No, there wasn't a highly obvious case for it based on the trials he did do.
Should we keep our minds open to new solutions to the barrage of spam which was (still is) flowing towards the forum? Absolutely! Everyone should feel
free to share their own ideas for improvements, so long as they aren't upset if their proposals get turned down.
|
|
JJay
International Hazard
Posts: 3440
Registered: 15-10-2015
Member Is Offline
|
|
So, I reiterate, and please, unless you are a mod, hold your peace:
If I write a superior spam-fighting program, will you use it instead of Melgar's?
|
|
j_sum1
Administrator
Posts: 6320
Registered: 4-10-2014
Location: At home
Member Is Offline
Mood: Most of the ducks are in a row
|
|
Jjay. We are not about to install competing systems. What Melgar has done is working remarkably well within consideable constraints. He does not have
access to the back end of the system and he has devised something that works. Could something similarbe done via a machine learning system? Probably.
Would it be an improvement? Maybe marginal. It is tough to improve on something that is catching most spam within minutes and has made so few errors.
The ultimate solution awaits a new platform.
In summary, you may well be right on technical details. But your approach to this discussion has been counterproductive. I don't think it wise to take
down what Melgar has done just so that you can have your play in the sandbox.
|
|
JJay
International Hazard
Posts: 3440
Registered: 15-10-2015
Member Is Offline
|
|
My approach to this discussion was not counterproductive. Melgar's approach to this discussion was dishonest and counterproductive. I am extremely
disappointed with the leadership of this forum, and I will be leaving.
|
|
Metacelsus
International Hazard
Posts: 2539
Registered: 26-12-2012
Location: Boston, MA
Member Is Offline
Mood: Double, double, toil and trouble
|
|
@JJay: would you consider collaborating with Melgar to improve the spam filter using machine learning? Spam filters don't have to be solo projects.
|
|
JJay
International Hazard
Posts: 3440
Registered: 15-10-2015
Member Is Offline
|
|
No, not a chance, absolutely not. As a U.S. citizen, I will not be a part of a forum where Melgar is running things. This is nothing personal against
Melgar; seriously, I don't dislike you, Melgar. There are certain individuals I won't work with, though, and Melgar is one of those individuals.
Sorry, that's just how it is.
|
|
Loptr
International Hazard
Posts: 1348
Registered: 20-5-2014
Location: USA
Member Is Offline
Mood: Grateful
|
|
I think this has to do with the posts that Melgar made about his drug synthesis accusations from his family, and not having someone that put all of
that out there so publically be even remotely linked to direction of the forum.
"Question everything generally thought to be obvious." - Dieter Rams
|
|
Pages:
1
2 |