Explore the parameter space to try and improve results.

To play: `python main.py --csv-file raw_dataframe.csv` Then add flags to explore the parameter space:

* `--source-thresh SOURCE_THRESH
                        Min % of events a news source must cover, to be
                        included.`
   Default 0.5; lowering this would include a broader set of news sources.

* `--min-article-length MIN_ARTICLE_LENGTH
                        Min # words in an article (pre-parsing)` 
   Set to 250. Are longer articles more biased?

*  `--min-vocab-length MIN_VOCAB_LENGTH
                        Min # words in an article (post-lemmatizing,
                        vectorizing)`
   Set to 100. Are longer articles more biased?

*  `--lda-min-appearances LDA_MIN_APPEARANCES
                        Min # appearances of a word, to be included in the
                        vocabulary`
   Set to 2. Could raise this, to focus on the most common words.

*  `--lda-vectorization-type {count,tfidf}
                        Type of vectorization of article to word counts, to
                        do.`
  Set to `count`. Not 100% `tfidf` is working, but if it is, we should use it.

* `--lda-groupby {source,article}
                        Run LDA on text separated by article, or by news
                        source?`
  Set to `article` right now. this just means: what are the "documents" (sets of words) sent into LDA? Could be by article, or could aggregate over source.

* `--lda-topics LDA_TOPICS
                        # of LDA topics`
   Set to 10. Clusters indicate that maybe a higher number could be helpful.

* `--lda-iters LDA_ITERS
                        # of LDA iterations`
   1500. Probably could be lowered for larger datasets.

* `--truth-frequency-thresh TRUTH_FREQUENCY_THRESH
                        % of articles in a news event that must mention a
                        word, for it to be "truth" / removed.`
  Set to 0.5. Could be higher (e.g. 1.1 - force no words to be removed) or lower (e.g. 0.1, remove most words and leave only infrequent words for bias.  Could also be implemented as a range, to say: bias words appear often, but not as often as truth words, and not as infrequently as random garbage.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Explore the parameter space to try and improve results. #11

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Explore the parameter space to try and improve results. #11

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions