This is a light-weight agent-based framework to help you schedule a workflow of scraping tasks, mainly focused on Twitter.
-
Fill in config.txt with your own values.
-
Next run the agent:
python agent.py
This agent runs in the backround and does things based on the tasks that are
assigned to it. You can schedule tasks using another command line took, monitor.py.
To sample from the twitter stream
python monitor.py --tasks add StartStream
To pull down tasks for a @edwardbenson
python monitor.py --tasks add ScrapeUser 0 0 @edwardbenson
To pull down tasks for a @edwardbenson, refreshing every day
python monitor.py --tasks add ScrapeUser 1 86400 @edwardbenson
To sample from the stream for tweets that match the query "Red Sox" Redsox and print these out every 10 seconds
python monitor.py --tasks add StartFilterStream "Red Sox" Redsox
python monitor.py --tasks add DumpTweets 1 10
To sample from the stream, and every hour pull down the last 20 tweets from 10 random users
python monitor.py --tasks add StartStream
python monitor.py --tasks add PullRandomUsers 1 3600 10 20
To shutdown the agent: python monitor.py --tasks add Die
To add a task:
python monitor.py --tasks add <TaskName> <Repeat> <Delta>
For adding a task, <Repeat> is either 0 or 1, and <Delta> is the time
between repetitions. Set it to 0 if <Repeat> is also 0.
To list tasks:
python monitor.py --tasks list
Pulls tweets from the provided user, such as @edwardbenson
python monitor.py --tasks add ScrapeUser <repeat> <delta> <username>
Starts pulling form the twitter stream
python monitor.py --tasks add StartStream 0 0
Stops pulling form the twitter stream
python monitor.py --tasks add StopStream 0 0
Starts pulling form the filtered twitter stream with the provided args
python monitor.py --tasks add StartFilterStream 0 0 <arg1> .. <argN>
Stops pulling form the filtered twitter stream
python monitor.py --tasks add StartFilterStream 0 0
Queries for N random users in the database with only 1 tweet recorded in the DB and pulls their latest M tweets.
python monitor.py --tasks add PullRandomUsers 0 0 <N> <M>
Implementing your own tasks is easy. Just dig around in the tasks/tasks.py file to see examples.