How to partition text with Python
The advertools partition
function was introduced in version 0.17.0, and this is a tutorial on how to use it, when you might use it, and a few examples that might be helpful.
What is text partitioning?
And how does it differ from splitting?
The Python standard library has a str.partition
method already, and it works by partitioning the given text on the given pattern, and returning a 3 tuple as a result:
('What ',
'time',
" is it now? The time is five o'clock. Time is of the essence.")
Note that the there are multiple instances of the required pattern “time” and that the text was partitioned only once.
Splitting text in Python
Now we have the text split by every occurrence of the given pattern, but that pattern is removed. In other words the text is split by those character sequences.
Using advertools to partition text
['What',
'time',
'is it now? The',
'time',
"is five o'clock. Time is of the essence."]
We now have the text split by the sequence “time”, we preserved those instances of “time”, except that we have another “Time” in the string that wasn’t picked up, because it starts with an upper case “T”.
A very important feature of this function is that it allows you to partition using a regular expression.
['What',
'time',
'is it now? The',
'time',
"is five o'clock.",
'Time',
'is of the essence.']
In many cases you might have special requirements for your regex, and you can use the flags
parameter to do so. The previous code can be rewritten using the re.IGNORECASE
flag as follows:
Partitioning examples
Partitioning text into questions and answers
We need to create a regex pattern to match questions. This one uses two of the ready-made regular expressions from the advertools.regex
module, one which matches a character that could serve as the end of a sentence, and another matching a question mark chartacter.
Partitioning a Python tutorial into code and narrative sections
py_code_regext = "```python.*?```"
print(*adv.partition(tutorial, py_code_regext, re.DOTALL), sep="\n--------\n")
This is how you import advertools:
--------
```python
import advertools as adv
```
--------
This is how you crawl a website with advertools:
--------
```python
adv.crawl(url_list="https://example.com", output_file="output.jsonl", follow_links=True)
```
--------
I hope you enjoyed the tutorial.
Partitioning a mardown string into headings and regular text
markdown = """
# The title of the article
Here is some intro text.
## First heading
Some info about the first heading.
## Second heading
Now that you read the first, here's the second heading info.
This paragraph is slightly longer, just showing that paragraphs can work as well.
Let's takle the final conclusion.
## Conclusion
I hope you enjoyed reading
"""
['# The title of the article',
'Here is some intro text.',
'## First heading',
'Some info about the first heading.',
'## Second heading',
"Now that you read the first, here's the second heading info.\nThis paragraph is slightly longer, just showing that paragraphs can work as well.\nLet's takle the final conclusion.",
'## Conclusion',
'I hope you enjoyed reading']
Genarating chunks from a split markdown list
Using the batched
function from itertools
, we can go through a seccession of n
items in the given iterable.
In this case we want to capture every two successive items [(heading, text), (heading, text), ...]
[('# The title of the article', 'Here is some intro text.'),
('## First heading', 'Some info about the first heading.'),
('## Second heading',
"Now that you read the first, here's the second heading info.\nThis paragraph is slightly longer, just showing that paragraphs can work as well.\nLet's takle the final conclusion."),
('## Conclusion', 'I hope you enjoyed reading')]
More readable:
for i, (heading, text) in enumerate(batched(md_parts, 2), 1):
print(f"Chunk: {i}")
print(heading)
print(text)
print("-----")
Chunk: 1
# The title of the article
Here is some intro text.
-----
Chunk: 2
## First heading
Some info about the first heading.
-----
Chunk: 3
## Second heading
Now that you read the first, here's the second heading info.
This paragraph is slightly longer, just showing that paragraphs can work as well.
Let's takle the final conclusion.
-----
Chunk: 4
## Conclusion
I hope you enjoyed reading
-----