• 0

[Python] Filtering Spammy strings


Question

I am currently working on an analytics project where I have hundreds of thousands of social media interactions in JSON format. The most essential field I need is a "post" field, which contains what a user has posted (facebook post, tweet, etc). So suppose I had a bunch of posts:

 

{''post":"gBuy Exclusive hMedz Todayl"}

{''post":"We offer free pdf ebooks of In Town: Contemporary Design for Urban Living.pdf download free"}

{''post":"ONE58PRINCEOFWALESROAD - ONE58 PRINCE OF WALES"}

{''post":"The Ingrid Pitt legacy | trinkelbonker, First title is 

Link to comment
Share on other sites

5 answers to this question

Recommended Posts

  • 0

Spam is going to be very suggestive, how should the aggregator know if you are interested in cheap meds or not?

 

Firstly I would think about how you define legitimate interactions and see if there is an easy way to filter based on that. For example Facebook messages sent from a person who is not a friend of the recipient may not be interactions you care about and so can filter.

 

With the machine learning approach you are going to need a lot of spam messages that you manually classify as spam. Some of these papers might be useful.
For example this paper used "Around 25K users, 500K tweets, and 49M follower/friend relationships" from twitter to classify spam. So you really do need a lot of data!

Link to comment
Share on other sites

  • 0

Thanks for all the suggestions. Since this is just for a proof of concept project I was told that I can just search for typical spam terms "special offer", "viagra", etc in the string is fine for now. But in the future I will look into more complex libraries.

Link to comment
Share on other sites

This topic is now closed to further replies.
  • Recently Browsing   0 members

    • No registered users viewing this page.