Skip to content

Commit c41c6f8

Browse files
authored
feat: improve sentence splitting (#72)
1 parent 4a5439e commit c41c6f8

File tree

2 files changed

+52
-9
lines changed

2 files changed

+52
-9
lines changed

src/raglite/_split_sentences.py

Lines changed: 13 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -25,21 +25,25 @@ def get_markdown_heading_indexes(doc: str) -> list[tuple[int, int]]:
2525
start_line, end_line = token.map # type: ignore[misc]
2626
heading_start = char_idx[start_line]
2727
heading_end = char_idx[end_line]
28-
headings.append((heading_start, heading_end))
28+
headings.append((heading_start, heading_end + 1))
2929
return headings
3030

3131
headings = get_markdown_heading_indexes(doc.text)
3232
for heading_start, heading_end in headings:
33-
# Mark the start of a heading as a new sentence.
33+
# Extract this heading's tokens.
34+
heading_tokens = []
3435
for token in doc:
35-
if heading_start <= token.idx:
36-
token.is_sent_start = True
37-
break
38-
# Mark the end of a heading as a new sentence.
39-
for token in doc:
40-
if heading_end <= token.idx:
41-
token.is_sent_start = True
36+
if heading_start <= token.idx < heading_end:
37+
heading_tokens.append(token) # Include the tokens strictly part of the heading.
38+
elif heading_tokens:
39+
heading_tokens.append(token) # Include the first token after the heading.
4240
break
41+
# Mark the start of the heading as a new sentence, the heading body as not containing
42+
# sentence boundaries, and the first token after the heading as a new sentence.
43+
heading_tokens[0].is_sent_start = True
44+
heading_tokens[-1].is_sent_start = True
45+
for token in heading_tokens[1:-1]:
46+
token.is_sent_start = False
4347
return doc
4448

4549

tests/test_split_sentences.py

Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
"""Test RAGLite's sentence splitting functionality."""
2+
3+
from pathlib import Path
4+
5+
from raglite._markdown import document_to_markdown
6+
from raglite._split_sentences import split_sentences
7+
8+
9+
def test_split_sentences() -> None:
10+
"""Test splitting a document into sentences."""
11+
doc_path = Path(__file__).parent / "specrel.pdf" # Einstein's special relativity paper.
12+
doc = document_to_markdown(doc_path)
13+
sentences = split_sentences(doc)
14+
expected_sentences = [
15+
"# ON THE ELECTRODYNAMICS OF MOVING BODIES\n\n",
16+
"## By A. EINSTEIN June 30, 1905\n\n",
17+
"It is known that Maxwell’s electrodynamics—as usually understood at the\npresent time—when applied to moving bodies, leads to asymmetries which do\nnot appear to be inherent in the phenomena. ", # noqa: RUF001
18+
"Take, for example, the recipro-\ncal electrodynamic action of a magnet and a conductor. ",
19+
"The observable phe-\nnomenon here depends only on the relative motion of the conductor and the\nmagnet, whereas the customary view draws a sharp distinction between the two\ncases in which either the one or the other of these bodies is in motion. ",
20+
"For if the\nmagnet is in motion and the conductor at rest, there arises in the neighbour-\nhood of the magnet an electric field with a certain definite energy, producing\na current at the places where parts of the conductor are situated. ",
21+
"But if the\nmagnet is stationary and the conductor in motion, no electric field arises in the\nneighbourhood of the magnet. ",
22+
"In the conductor, however, we find an electro-\nmotive force, to which in itself there is no corresponding energy, but which gives\nrise—assuming equality of relative motion in the two cases discussed—to elec-\ntric currents of the same path and intensity as those produced by the electric\nforces in the former case.\n",
23+
"Examples of this sort, together with the unsuccessful attempts to discover\nany motion of the earth relatively to the “light medium,” suggest that the\nphenomena of electrodynamics as well as of mechanics possess no properties\ncorresponding to the idea of absolute rest. ",
24+
"They suggest rather that, as has\nalready been shown to the first order of small quantities, the same laws of\nelectrodynamics and optics will be valid for all frames of reference for which the\nequations of mechanics hold good.1 ",
25+
"We will raise this conjecture (the purport\nof which will hereafter be called the “Principle of Relativity”) to the status\nof a postulate, and also introduce another postulate, which is only apparently\nirreconcilable with the former, namely, that light is always propagated in empty\nspace with a definite velocity c which is independent of the state of motion of the\nemitting body. ",
26+
"These two postulates suffice for the attainment of a simple and\nconsistent theory of the electrodynamics of moving bodies based on Maxwell’s\ntheory for stationary bodies. ", # noqa: RUF001
27+
"The introduction of a “luminiferous ether” will\nprove to be superfluous inasmuch as the view here to be developed will not\nrequire an “absolutely stationary space” provided with special properties, nor\n1The preceding memoir by Lorentz was not at this time known to the author.\n\nassign a velocity-vector to a point of the empty space in which electromagnetic\nprocesses take place.\n",
28+
"The theory to be developed is based—like all electrodynamics—on the kine-\nmatics of the rigid body, since the assertions of any such theory have to do\nwith the relationships between rigid bodies (systems of co-ordinates), clocks,\nand electromagnetic processes. ",
29+
"Insufficient consideration of this circumstance\nlies at the root of the difficulties which the electrodynamics of moving bodies\nat present encounters.\n\n",
30+
"## I. KINEMATICAL PART § **1. Definition of Simultaneity**\n\n",
31+
"Let us take a system of co-ordinates in which the equations of Newtonian\nmechanics hold good.2 ",
32+
]
33+
assert isinstance(sentences, list)
34+
assert all(
35+
sentence == expected_sentence
36+
for sentence, expected_sentence in zip(
37+
sentences[: len(expected_sentences)], expected_sentences, strict=True
38+
)
39+
)

0 commit comments

Comments
 (0)