-
Notifications
You must be signed in to change notification settings - Fork 0
Description
To play: python main.py --csv-file raw_dataframe.csv
Then add flags to explore the parameter space:
-
--source-thresh SOURCE_THRESH Min % of events a news source must cover, to be included.
Default 0.5; lowering this would include a broader set of news sources. -
--min-article-length MIN_ARTICLE_LENGTH Min # words in an article (pre-parsing)
Set to 250. Are longer articles more biased? -
--min-vocab-length MIN_VOCAB_LENGTH Min # words in an article (post-lemmatizing, vectorizing)
Set to 100. Are longer articles more biased? -
--lda-min-appearances LDA_MIN_APPEARANCES Min # appearances of a word, to be included in the vocabulary
Set to 2. Could raise this, to focus on the most common words. -
--lda-vectorization-type {count,tfidf} Type of vectorization of article to word counts, to do.
Set tocount
. Not 100%tfidf
is working, but if it is, we should use it. -
--lda-groupby {source,article} Run LDA on text separated by article, or by news source?
Set toarticle
right now. this just means: what are the "documents" (sets of words) sent into LDA? Could be by article, or could aggregate over source. -
--lda-topics LDA_TOPICS # of LDA topics
Set to 10. Clusters indicate that maybe a higher number could be helpful. -
--lda-iters LDA_ITERS # of LDA iterations
1500. Probably could be lowered for larger datasets. -
--truth-frequency-thresh TRUTH_FREQUENCY_THRESH % of articles in a news event that must mention a word, for it to be "truth" / removed.
Set to 0.5. Could be higher (e.g. 1.1 - force no words to be removed) or lower (e.g. 0.1, remove most words and leave only infrequent words for bias. Could also be implemented as a range, to say: bias words appear often, but not as often as truth words, and not as infrequently as random garbage.