Poultry¶

Poultry is a tweet collection manager.

User guide¶

Installation¶

Use pip` to install poultry:

$ pip install poultry

Alternatively, you can use virtualenv and pip to install poultry:

$ virtualenv .env
$ .env/bin/pip install poultry

Tip

If you get a strange behavior, run poulry in the verbose mode by adding the -v flag to see the possible problems.

$ .env/bin/poultry -s twitter://sample group -c poultry.cfg -v
2013-12-14 12:40:54,922: poultry.stream - WARNING - An http error occurred. Reconnecting...
Traceback (most recent call last):
  File "/Users/dimazest/Documents/qmul/tools/src/poultry/poultry/stream.py", line 86, in run
    self._run()
  File "/Users/dimazest/Documents/qmul/tools/src/poultry/poultry/stream.py", line 73, in _run
    response.raise_for_status()
  File "/Users/dimazest/Documents/qmul/tools/src/poultry/.env/lib/python2.7/site-packages/requests/models.py", line 765, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
HTTPError: 401 Client Error: Unauthorized

The error above suggests that the Twitter credentials are incorrect.

Local tweet collection management¶

It is common to store a collection of tweets in compressed files grouped by hour. For example:

$ tree ./tweets
./tweets
├── 2012-04-19-00.gz
├── 2012-04-20-00.gz
├── 2012-04-21-00.gz
├── 2012-04-21-08.gz
├── 2012-04-21-09.gz
└── 2012-04-21-10.gz

Where each line in the files is the json representation of a tweet. The first two tweets in my collection look like:

$ zcat ./tweets/2012-04-19-00.gz | head -n 2
{"text":"100 days until summer Olympics","id_str":"192764446173708291","coordinates":null,"created_at":"Thu Apr 19 00:00:00 +0000 2012","in_reply_to_status_id_str":null,"favorited":false,"source":"web","in_reply_to_user_id_str":null,"entities":{"urls":[],"user_mentions":[],"hashtags":[]},"contributors":null,"place":null,"in_reply_to_screen_name":null,"in_reply_to_status_id":null,"geo":null,"user":{"is_translator":false,"statuses_count":861,"time_zone":"Quito","profile_background_color":"db4c39","id_str":"395132292","follow_request_sent":null,"verified":false,"profile_background_tile":true,"created_at":"Fri Oct 21 05:40:09 +0000 2011","profile_sidebar_fill_color":"48dbaa","default_profile_image":false,"notifications":null,"friends_count":128,"url":null,"description":"","favourites_count":0,"profile_sidebar_border_color":"e2e83f","followers_count":114,"profile_image_url":"http:\/\/a0.twimg.com\/profile_images\/1807429969\/Spring_2012_009_WarmingFilter_1_normal.jpg","screen_name":"MEL0L407","profile_use_background_image":true,"profile_background_image_url_https":"https:\/\/si0.twimg.com\/profile_background_images\/500309685\/056.JPG","location":"Floridaa","contributors_enabled":false,"lang":"en","geo_enabled":false,"profile_text_color":"0a090a","protected":false,"profile_image_url_https":"https:\/\/si0.twimg.com\/profile_images\/1807429969\/Spring_2012_009_WarmingFilter_1_normal.jpg","listed_count":0,"profile_background_image_url":"http:\/\/a0.twimg.com\/profile_background_images\/500309685\/056.JPG","name":"Melissa Townsend","profile_link_color":"7a0c41","id":395132292,"default_profile":false,"show_all_inline_media":false,"following":null,"utc_offset":-18000},"retweeted":false,"id":192764446173708291,"retweet_count":0,"in_reply_to_user_id":null,"truncated":false}
{"text":"Maeva et...? #ForeverAlone","id_str":"192764447666864129","coordinates":null,"created_at":"Thu Apr 19 00:00:00 +0000 2012","in_reply_to_status_id_str":null,"favorited":false,"source":"web","in_reply_to_user_id_str":null,"entities":{"urls":[],"user_mentions":[],"hashtags":[{"text":"ForeverAlone","indices":[13,26]}]},"contributors":null,"place":{"bounding_box":{"type":"Polygon","coordinates":[[[2.3894531,48.8832118],[2.4279991,48.8832118],[2.4279991,48.9180446],[2.3894531,48.9180446]]]},"place_type":"city","country":"France","url":"http:\/\/api.twitter.com\/1\/geo\/id\/35d2c646704fa4a1.json","country_code":"FR","attributes":{},"full_name":"Pantin, Seine-Saint-Denis","name":"Pantin","id":"35d2c646704fa4a1"},"in_reply_to_screen_name":null,"in_reply_to_status_id":null,"geo":null,"user":{"is_translator":false,"statuses_count":25433,"time_zone":"Paris","profile_background_color":"C0DEED","id_str":"379912464","follow_request_sent":null,"verified":false,"profile_background_tile":true,"created_at":"Sun Sep 25 19:26:25 +0000 2011","profile_sidebar_fill_color":"DDEEF6","default_profile_image":false,"notifications":null,"friends_count":179,"url":null,"description":"Tu m'as pas encore follow ? #RickRossSurToi !  \r\nMake people laugh, nigga that's my motto\r\n#TeamCuisseDodue #TeamSkinnyNigga","favourites_count":22,"profile_sidebar_border_color":"C0DEED","followers_count":236,"profile_image_url":"http:\/\/a0.twimg.com\/profile_images\/1839059455\/IMG-20120218-00089_normal.jpg","screen_name":"JulianSKEETER","profile_use_background_image":true,"profile_background_image_url_https":"https:\/\/si0.twimg.com\/profile_background_images\/528094149\/Women-Ruined-My-life-shirt.jpg","location":"Rack city","contributors_enabled":false,"lang":"fr","geo_enabled":true,"profile_text_color":"333333","protected":false,"profile_image_url_https":"https:\/\/si0.twimg.com\/profile_images\/1839059455\/IMG-20120218-00089_normal.jpg","listed_count":1,"profile_background_image_url":"http:\/\/a0.twimg.com\/profile_background_images\/528094149\/Women-Ruined-My-life-shirt.jpg","name":"Julian Freemann","profile_link_color":"0084B4","id":379912464,"default_profile":false,"show_all_inline_media":false,"following":null,"utc_offset":3600},"retweeted":false,"id":192764447666864129,"retweet_count":0,"in_reply_to_user_id":null,"truncated":false}

Showing the collection to humans¶

poultry show is a command which represents the tweets in the human readable form:

$ zcat ./tweets/2012-04-19-00.gz | head -n 2 | .env/bin/poultry show
MEL0L407: 100 days until summer Olympics
https://twitter.com/#!/MEL0L407/status/192764446173708291
2012-04-19 00:00:00

JulianSKEETER: Maeva et...? #ForeverAlone
https://twitter.com/#!/JulianSKEETER/status/192764447666864129
2012-04-19 00:00:00

Note

poultry can get data from several sources.

It cat read the input from the standard input. It also can read tweets from files in a directory. -s option specifies which directory has to be processed by poultry. If the source starts with twitter:// then the Twitter Streaming API is used, see Twitter streaming API Stream capturing for more details.

Another way to get the tweets from the ./twwets is to specify it via the -s option:

$ .env/bin/poultry show -s ./tweets
MEL0L407: 100 days until summer Olympics
https://twitter.com/#!/MEL0L407/status/192764446173708291
2012-04-19 00:00:00

JulianSKEETER: Maeva et...? #ForeverAlone
https://twitter.com/#!/JulianSKEETER/status/192764447666864129
2012-04-19 00:00:00

... many other tweets from the files in ./tweets ...

Grouping the collection by time¶

Sometimes it is necessary to group a tweet collection to files by tweet’s creation time. poultry group groups the tweets (either from the standard input or from the input directory to chunks which are written to files.

$ .env/bin/poultry group -t 'by_day/%Y-%m-%d.gz' -s ./tweets
by_day/2012-04-19.gz
by_day/2012-04-20.gz
by_day/2012-04-21.gz

-t defines the template. The default template is %Y-%m-%d-%H.gz which groups the tweets by hour and stores them in files in the current directory.

Filter the collection¶

It is possible to filter the tweets of interest from the collection. The tweets can be filtered by three main predicates. Either of them needs to be true in order for a tweet to be filtered:

follow IDs of the users of interest. In the configuration file one ID per line is expected.

track a list of phrases that a tweet should contain to be filtered. In the configuration file one phrase per line is expected.

locations a list of longitude, latitude pairs specifying a set of bounding boxes to filter tweets by. The South-West point of tweet’s place is used if its coordinate is not provided.

It is possible to provide a desired language of the tweets using the language predicate. Note that language is an additional predicate. A main predicate has to be provided if the POST statuses/filter endpoint is used, however it can be the only predicate if filtering is done locally. Also, the language predicate has to be true in order for a tweet to belong to a filter.

split_template defines the destination path of the filtered stream. Use -- to make it output to the console. This is useful in conjunction with the --filters parameter of poultry filter. Then you can have a general definition of a filter that, for example, selects the Dutch tweets and apply it to several collections by redirecting the standard output to different destinations.

The --mode parameter for the filter subcommand that sets the file opening mode. Use w to rewrite the files and a (the default) to append.

An example configuration file ./poultry.cfg:

# A trick to be able to specify hashtags as individual tracking words.
[DEFAULT]
hash = #

# Filter only by one word `work`.
[filter:work]
split_template = ./work-%Y-%m-%d.gz
track = work
follow =
locations =
language =

# Filter tweets with the phrase `visit London`, or
# which are created by or mention the user with ID `47319664`
[filter:london]
split_template = ./london-%Y-%m-%d.gz
track = visit London
follow = 47319664
locations =
language =

# It is possible to mention several phrases
[filter:love-like-hate]
split_template = ./love-like-hate-%Y-%m-%d.gz
track = love
        like
        hate
follow =
locations =
language =

# The Netherlands are defined as two rectangles.
[filter:netherlands]
split_template = ./netherlands-%Y-%m-%d.gz
track =
follow =
locations = 3.734090,51.560411,5.667684,52.493220
            3.821980,51.934515,7.040975,53.687342
# Request only the Dutch tweets from the location above.
language = nl

[filter:hashtags]
split_template = ./hashtags/%Y-%m-%d.gz
track =
        # this line is ignored
        # in the next line %(hash)s is substituted with #
        # see the top of the file for the hash definition.
        %(hash)slisten
        %(hash)smusic
        %(hash)slol
follow =
locations =
language =

[filter:debug]
# Print all English and Russian tweets to the console.
split_template = --
track =
follow =
locations =
language =
           en
           ru

Note

The main predicates (track, follow, locations) in the filter are ORed, meaning that a tweet to be filtered has to satisfy at least one predicate. language is ANDed afterwards.

Note

The directories defined in the split_template have to exist.

To filter the collection run:

$ .env/bin/poultry filter -c ./poultry.cfg  -s ./tweets

Twitter streaming API Stream capturing¶

To get the access to the Twitter Streaming API, you need to create an application at https://apps.twitter.com/ and obtain access_token, access_token_secret, consumer_key and consumer_secret. You can get them from the app dashboard:

and copy to poultry.cfg:

[twitter]
access_token = ...
access_token_secret = ...
consumer_key = ...
consumer_secret = ...

Accessing the public streams¶

Twitter provides several public streams. The most interesting are POST statuses/filter and GET statuses/sample.

POST statuses/filter¶

Returns public statuses that match one or more filter predicates. The filtering predicates are defined in the configuration file:

.env/bin/poultry -s twitter://filter show
GermaineBling: SJ's manager is like the 16th member of SJ 😃✨
https://twitter.com/#!/GermaineBling/status/411832441003704321
2013-12-14 12:18:00

JASMEENAJ: It's like I am seeing myself in the mirror
https://twitter.com/#!/JASMEENAJ/status/411832441045655553
2013-12-14 12:18:00

The best way to collect several streams of tweets is to use the filter command:

$ .env/bin/poultry -s twitter://filter filter -c poultry.cfg -v
./love-like-hate-2013-12-14.gz
./work-2013-12-14.gz
./netherlands-2013-12-14.gz

GET statuses/sample¶

Returns a small random sample of all public statuses:

.env/bin/poultry -s twitter://sample show
Ferry_Chai: @graciel_11 wkwkwkw sama aja boong --"
https://twitter.com/#!/Ferry_Chai/status/411833391395266560
2013-12-14 12:21:46

Fofoll110: RT @itzGhadh: اللهم إشف مرضى السرطان ، و إرحم من رحلوا عن الدُنيا بسببه ♥
https://twitter.com/#!/Fofoll110/status/411833391383052288
2013-12-14 12:21:46

The best way to capture a sample of tweets is to use the group command:

$ .env/bin/poultry -s twitter://sample group -c poultry.cfg
2013-12-14-11.gz
2013-12-14-12.gz
2013-12-14-13.gz

Integration with other tools¶

If you want just to collect tweets and pass their text to your application, you can use the text command, which replaces the new line symbol \n with a space, so it should be safe assume that you will get one tweet per line:

$ .env/bin/poultry -s twitter://sample text -c poultry.cfg
@m1a2n0a1e テキーラはどうですか？
@ru_tiroru 対戦ありがとうございましたー
Boleh kok boleh ka"@ekhaasyari: @deidrays dih gaboleh tah?"
siempre te llevo en mi mente pero ni idea donde estarás

and pipe it’s output to your app:

$ .env/bin/poultry -s twitter://sample text -c poultry.cfg | java TheUltimateTwitterSentimentor

Iterating over a collection of tweets¶

It is possible to iterate over a colletion of tweets:

>>> from poultry import readline_dir

>>> for tweet in readline_dir('./tweets/'):
...     print(tweet.id, tweet.parsed['user']['screen_name'])
535436910794387460 dimazest
535432030939791360 dimazest

New in version 1.3.0

Table of contents¶

Developer guide
- API
  - Producers
  - Consumers
  - Tweet
  - Stream

Poultry¶

User guide¶

Installation¶

Local tweet collection management¶

Showing the collection to humans¶

Grouping the collection by time¶

Filter the collection¶

Twitter streaming API Stream capturing¶

Accessing the public streams¶

POST statuses/filter¶

GET statuses/sample¶

Integration with other tools¶

Iterating over a collection of tweets¶

Table of contents¶

Indices and tables¶