Aggregated two counters to a single counter module named russia_counters under Counters. The main module is renamed to russia_main
Added EdgarParser that utilises all 3 parsers so as to make the procedure simpler
This is a project to:
- ✔️ crawl firm statements & periodic reports from EDGAR, a system constructed by SEC for firms to upload their statements & reports as well as for investors to check the operation of the firms they care;
- ✔️ extract info about risks from the reports;specifically, it would be some items that we want to extract and analyse;
- ✔️ train a pharse2vec model to get similar words for our dict at hand. It is a dict involving the conflict between Ukraine and Russia, and we want to expand that dict based on the contents of the reports filed by firms that may be affected by the conflic, and
- ✔️ calculate some indicators using the expanded dict.
It would be not easy to crawl EDGAR without the help of a package from R called EDGAR. For the docs, check EDGAR docs; for the paper introducing an example of how to use it, check this paper by the authors of the package. In the crawling procedure we simply use the well-packaged functions to capture overall filing info and download reports we need from EDGAR.
After the crawling is done, we define a func (checkedgarCrawler for details) to help us record and export the filing info to an excel file, including firm CIK codes, filing date, file name, etc.
In this part, we develop several parser classes to extract specific items from different forms, namely:
- Item1A & Item7 under form 10-K;
- Part1 Item2 & PartII Item1A under form 10-Q, and
- Exhibit 99.1, Item2.02, Item7.01 & Item8.01 under form 8-K.
All 3 parsers utilise a class called matching_strategies to match the start and end for each item. The matching_strategies is incorporated into each parser(parsing8K, parsing10K, and parsing10Q), and the forms saved in local dirs are parsed and the items needed are extracted and exported to txt files. Meanwhile, we add some methods to the parsers to record if the parsing and extraction is successful, as well as some dummies telling whether some words are mentioned in a certain item or not.
Update: A new class EdgarParser that aggregates all three individual parsers is available. Use that instead. A example of how to use the new comprehensive class has been uploaded, too. Check it here.
In trian_phrase2vec we train a phrase2vec model based on the corpus constructed by merging all texts under 10-K Item1A and Item7, where a bigram transformer is adopted to recognise bigram or trigram phrases in the texts. After the training, we use get_similar to expand our dict at hand, trying to find words or/and phrases that involve the conflict between Ukraine and Russia.
In this part, we develop 2 counters to calculate word freqs in the forms we collect. The main logic is isolated to 2 classes called russia_counter and russia_counter_lemma. Both counters are incorporated into russia_counters which makes them easier to call. The procedures of going over all forms and calculating the word/phrase freqs are performed by russia_main. An example has also been uploaded; check here for more info
Note: russia_counter calculates the word freq by exact words, i.e. only counting the word when it is in exactly the the form as in the dict. The russia_counter_lemma, however, counts the word when it has the same lemma as some words in the dict. The logic is almost the same, so you can go over only russia_counter_lemma in detail as it has more comments to help you understand.