Skip to content

GotoRyusuke/EDGAR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

47 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

EDGAR(美国证券交易委员会EDGAR报表爬取及报表项目抽取)


(R4/09/06) Updates

Aggregated two counters to a single counter module named russia_counters under Counters. The main module is renamed to russia_main

(R4/09/05) Updates

Added EdgarParser that utilises all 3 parsers so as to make the procedure simpler

Introduction

This is a project to:

  • ✔️ crawl firm statements & periodic reports from EDGAR, a system constructed by SEC for firms to upload their statements & reports as well as for investors to check the operation of the firms they care;
  • ✔️ extract info about risks from the reports;specifically, it would be some items that we want to extract and analyse;
  • ✔️ train a pharse2vec model to get similar words for our dict at hand. It is a dict involving the conflict between Ukraine and Russia, and we want to expand that dict based on the contents of the reports filed by firms that may be affected by the conflic, and
  • ✔️ calculate some indicators using the expanded dict.

Structure of the project

1.EDGAR CRAWLER

It would be not easy to crawl EDGAR without the help of a package from R called EDGAR. For the docs, check EDGAR docs; for the paper introducing an example of how to use it, check this paper by the authors of the package. In the crawling procedure we simply use the well-packaged functions to capture overall filing info and download reports we need from EDGAR.

After the crawling is done, we define a func (checkedgarCrawler for details) to help us record and export the filing info to an excel file, including firm CIK codes, filing date, file name, etc.

2. FORM PARSERS & ITEM EXTRACTION

In this part, we develop several parser classes to extract specific items from different forms, namely:

  • Item1A & Item7 under form 10-K;
  • Part1 Item2 & PartII Item1A under form 10-Q, and
  • Exhibit 99.1, Item2.02, Item7.01 & Item8.01 under form 8-K.

All 3 parsers utilise a class called matching_strategies to match the start and end for each item. The matching_strategies is incorporated into each parser(parsing8K, parsing10K, and parsing10Q), and the forms saved in local dirs are parsed and the items needed are extracted and exported to txt files. Meanwhile, we add some methods to the parsers to record if the parsing and extraction is successful, as well as some dummies telling whether some words are mentioned in a certain item or not.

Update: A new class EdgarParser that aggregates all three individual parsers is available. Use that instead. A example of how to use the new comprehensive class has been uploaded, too. Check it here.

3. PHRASE2VEC

In trian_phrase2vec we train a phrase2vec model based on the corpus constructed by merging all texts under 10-K Item1A and Item7, where a bigram transformer is adopted to recognise bigram or trigram phrases in the texts. After the training, we use get_similar to expand our dict at hand, trying to find words or/and phrases that involve the conflict between Ukraine and Russia.

4. COUNTERS

In this part, we develop 2 counters to calculate word freqs in the forms we collect. The main logic is isolated to 2 classes called russia_counter and russia_counter_lemma. Both counters are incorporated into russia_counters which makes them easier to call. The procedures of going over all forms and calculating the word/phrase freqs are performed by russia_main. An example has also been uploaded; check here for more info

Note: russia_counter calculates the word freq by exact words, i.e. only counting the word when it is in exactly the the form as in the dict. The russia_counter_lemma, however, counts the word when it has the same lemma as some words in the dict. The logic is almost the same, so you can go over only russia_counter_lemma in detail as it has more comments to help you understand.

About

Repo to save codes for EDGAR projects

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published