Skip to content

Conversation

@heaven00
Copy link
Contributor

@heaven00 heaven00 commented Dec 21, 2017

Labeling

  • Adding Random Forest Model to give binary marking for Groupings and non-groupings
  • Added string matching to make header and title labeling more robust.

Execution Script

  • Added resume capabilities
  • Extracted out Default Numeric Headers as script parameters

@heaven00 heaven00 changed the title ENH: Improving labelling and the execution script ENH: Improving labeling and the execution script Dec 21, 2017
Copy link
Member

@gggodhwani gggodhwani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please review the comments and make required changes

'''
import re
import pandas as pd
import joblib
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add it to the requirements.txt

if ('Actuals' in row['text'] or
'Budget' in row['text'] or
'Revised' in row['text'] or
'Estimate' in row['text']):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make a regex wherever you can, and add it on the top as a constant!

# check capitalization of letters
if row.is_text and row.text.isupper() and pd.isnull(row.label):
if ('REVENUE EXPENDITURE' in row.text or 'DETAILED ACCOUNT' in
row.text or 'ABSTRACT ACCOUNT' in row.text or 'CAPITAL EXPENDITURE'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make a regex wherever you can, and add it on the top as a constant!

if row.is_text and row.text.isupper() and pd.isnull(row.label):
if ('REVENUE EXPENDITURE' in row.text or 'DETAILED ACCOUNT' in
row.text or 'ABSTRACT ACCOUNT' in row.text or 'CAPITAL EXPENDITURE'
in row.text or 'LOAN EXPENDITURE' in row.text):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make a regex wherever you can, and add it on the top as a constant!

row.text or 'ABSTRACT ACCOUNT' in row.text or 'CAPITAL EXPENDITURE'
in row.text or 'LOAN EXPENDITURE' in row.text):
row['label'] = 'title'
if 'demand no' in row.text.lower():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make a regex wherever you can, and add it on the top as a constant!

def __init__(self, img, block_features, page_num, target_folder):
def __init__(self, img, block_features, page_num, target_folder,
default_headers):
self.img = img
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please explain all the init arguments

'''
Check which pdfs are already generated and remove them from the complete
list of pdfs.
'''
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Args and return type documentation is missing?

default_headers):
'''Process a folder of demand draft pdfs and store the output in the output
folder.
'''
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Args and return type documentation is missing?

vertical_ratio,
page_num,
pdf_file_path,
(25, 20),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Declare 25,20 as class variables and explain its rationale

block_features.to_csv('{0}/{1}.csv'.format(features_log_folder,
page_num), index=False)
# Blank page check
if len(block_features.index) > 3:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Declare constant 3 as Class variable?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants