Files

processing_code

jbanda

Version 11 and dailies from 5/23, 5/22 and 5/21

May 25, 2020

83a76a1 · May 25, 2020

Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md	Version 11 and dailies from 5/23, 5/22 and 5/21	May 25, 2020
combine1grams.py		combine1grams.py	Version 11 and dailies from 5/23, 5/22 and 5/21	May 25, 2020
combineNgrams.py		combineNgrams.py	Version 11 and dailies from 5/23, 5/22 and 5/21	May 25, 2020
fields.py		fields.py	Version 11 and dailies from 5/23, 5/22 and 5/21	May 25, 2020
getDataset.py		getDataset.py	Version 11 and dailies from 5/23, 5/22 and 5/21	May 25, 2020
getDataset_clean.py		getDataset_clean.py	Version 11 and dailies from 5/23, 5/22 and 5/21	May 25, 2020
getStats.py		getStats.py	Version 11 and dailies from 5/23, 5/22 and 5/21	May 25, 2020
get_1grams.py		get_1grams.py	Version 11 and dailies from 5/23, 5/22 and 5/21	May 25, 2020
get_ngrams.py		get_ngrams.py	Version 11 and dailies from 5/23, 5/22 and 5/21	May 25, 2020
parse_json_extreme.py		parse_json_extreme.py	Version 11 and dailies from 5/23, 5/22 and 5/21	May 25, 2020
parse_json_extreme_cleantweets.py		parse_json_extreme_cleantweets.py	Version 11 and dailies from 5/23, 5/22 and 5/21	May 25, 2020
parse_json_lite.py		parse_json_lite.py	Version 11 and dailies from 5/23, 5/22 and 5/21	May 25, 2020

README.md

Processing code

Sorry for the lack of documentation here, the processing code will be updated shortly.

The order of processing goes:

Extract JSON from hydrated tweets with: parse_json_extreme.py. If you want cleant tweets (no RTs) use parse_json_extreme_clean.py
Apply get_1grams.py to the parsed TSV files to get the term frequencies.
Apply get_ngrams.py to the parsed TSV files to get the bigrams and trigrams.
Combine all 1grams generated from each TSV file with combine1grams.py.
Combine all ngrams generated from each TSV file with combineNgrams.py.

These steps should be enough parse the hydrated JSON tweet files and calculate the ngrams.