How to partition text with Python

text
markdown
advertools
python
A tutorial on how to use the advertools partition function to flexibly partition text using regular expressions.
Published

July 10, 2025

The advertools partition function was introduced in version 0.17.0, and this is a tutorial on how to use it, when you might use it, and a few examples that might be helpful.

What is text partitioning?

And how does it differ from splitting?

text = "What time is it now? The time is five o'clock. Time is of the essence."

The Python standard library has a str.partition method already, and it works by partitioning the given text on the given pattern, and returning a 3 tuple as a result:

text.partition("time")
('What ',
 'time',
 " is it now? The time is five o'clock. Time is of the essence.")

Note that the there are multiple instances of the required pattern “time” and that the text was partitioned only once.

Splitting text in Python

text.split("time")
['What ', ' is it now? The ', " is five o'clock. Time is of the essence."]

Now we have the text split by every occurrence of the given pattern, but that pattern is removed. In other words the text is split by those character sequences.

Using advertools to partition text

import advertools as adv

adv.partition(text, "time")
['What',
 'time',
 'is it now? The',
 'time',
 "is five o'clock. Time is of the essence."]

We now have the text split by the sequence “time”, we preserved those instances of “time”, except that we have another “Time” in the string that wasn’t picked up, because it starts with an upper case “T”.

A very important feature of this function is that it allows you to partition using a regular expression.

adv.partition(text, "[Tt]ime")
['What',
 'time',
 'is it now? The',
 'time',
 "is five o'clock.",
 'Time',
 'is of the essence.']

In many cases you might have special requirements for your regex, and you can use the flags parameter to do so. The previous code can be rewritten using the re.IGNORECASE flag as follows:

import re

adv.partition(text, "time", re.IGNORECASE)
['What',
 'time',
 'is it now? The',
 'time',
 "is five o'clock.",
 'Time',
 'is of the essence.']

Partitioning examples

Partitioning text into questions and answers

qa = """
How are you?
I'm good.

What are you doing?
I'm having a coffee.

What is your plan for today?
I'm going to see some friends.

"""

We need to create a regex pattern to match questions. This one uses two of the ready-made regular expressions from the advertools.regex module, one which matches a character that could serve as the end of a sentence, and another matching a question mark chartacter.

adv.regex.SENTENCE_END
'[!¡՜߹᥄‼⁈⁉︕﹗!𖺚𞥞.։۔܁܂።᙮᠃᠉⳹⳾⸼。꓿꘎꛳︒﹒.。𖫵𖺘𛲟𝪈?¿;՞؟፧᥅⁇⁈⁉⳺⳻⸮꘏꛷︖﹖?𑅃𞥟ʔ‽]'
adv.regex.QUESTION_MARK_RAW
'[?¿;՞؟፧᥅⁇⁈⁉⳺⳻⸮꘏꛷︖﹖?𑅃𞥟ʔ‽]'
question_regex = rf"(?=){adv.regex.SENTENCE_END}?.*?{adv.regex.QUESTION_MARK_RAW}"
qa_parts = adv.partition(qa, question_regex, re.MULTILINE)
print(*qa_parts, sep="\n-----\n")
How are you?
-----
I'm good.
-----
What are you doing?
-----
I'm having a coffee.
-----
What is your plan for today?
-----
I'm going to see some friends.

Partitioning a Python tutorial into code and narrative sections

tutorial = """

This is how you import advertools:

```python
import advertools as adv
```

This is how you crawl a website with advertools:

```python
adv.crawl(url_list="https://example.com", output_file="output.jsonl", follow_links=True)
```

I hope you enjoyed the tutorial.

"""
py_code_regext = "```python.*?```"

print(*adv.partition(tutorial, py_code_regext, re.DOTALL), sep="\n--------\n")
This is how you import advertools:
--------
```python
import advertools as adv
```
--------
This is how you crawl a website with advertools:
--------
```python
adv.crawl(url_list="https://example.com", output_file="output.jsonl", follow_links=True)
```
--------
I hope you enjoyed the tutorial.

Partitioning a mardown string into headings and regular text

markdown = """

# The title of the article

Here is some intro text.

## First heading

Some info about the first heading.

## Second heading

Now that you read the first, here's the second heading info.
This paragraph is slightly longer, just showing that paragraphs can work as well.
Let's takle the final conclusion.

## Conclusion

I hope you enjoyed reading

"""
md_regex = "^#+ .*?$"
md_parts = adv.partition(markdown, md_regex, re.MULTILINE)
md_parts
['# The title of the article',
 'Here is some intro text.',
 '## First heading',
 'Some info about the first heading.',
 '## Second heading',
 "Now that you read the first, here's the second heading info.\nThis paragraph is slightly longer, just showing that paragraphs can work as well.\nLet's takle the final conclusion.",
 '## Conclusion',
 'I hope you enjoyed reading']

Genarating chunks from a split markdown list

Using the batched function from itertools, we can go through a seccession of n items in the given iterable.

In this case we want to capture every two successive items [(heading, text), (heading, text), ...]

from itertools import batched

list(batched(md_parts, 2))
[('# The title of the article', 'Here is some intro text.'),
 ('## First heading', 'Some info about the first heading.'),
 ('## Second heading',
  "Now that you read the first, here's the second heading info.\nThis paragraph is slightly longer, just showing that paragraphs can work as well.\nLet's takle the final conclusion."),
 ('## Conclusion', 'I hope you enjoyed reading')]

More readable:

for i, (heading, text) in enumerate(batched(md_parts, 2), 1):
    print(f"Chunk: {i}")
    print(heading)
    print(text)
    print("-----")
Chunk: 1
# The title of the article
Here is some intro text.
-----
Chunk: 2
## First heading
Some info about the first heading.
-----
Chunk: 3
## Second heading
Now that you read the first, here's the second heading info.
This paragraph is slightly longer, just showing that paragraphs can work as well.
Let's takle the final conclusion.
-----
Chunk: 4
## Conclusion
I hope you enjoyed reading
-----