The concept of wpautop is simple:
Basically it takes PHP’s nl2br function to the logical next step and converts double line breaks to paragraphs where applicable, does line breaks as before, and best of all it’s aware of block-level HTML tags so it won’t mess up your page.
In practice, this process is not simple, and wpautop has lacked a clear spec detailing what it should and should not do in a variety of cases. I propose that wpautop should return the same markup that the following algorithm creates.
Definitions
Here, as in CSS2.1, an element is “block-level” if—in the default stylesheet for HTML—its “display” property is ‘block’, ‘list-item’, ‘run-in’ or ‘table’, but not if the element is br.
An “inline sequence” is a set of consecutive childNodes which are not block-level elements. E.g. consider:
<div>Hello<!-- foo --><br><a href="foo">World</a> <p>Bar</p> </div>
The div contains two inline sequences:
- [text “Hello”][comment ” foo “][element br][element a][text “\n “]
- [text “\n”]
Guiding Principle: wpautop should limit alterations to specifically designated elements.
The first argument is a string of markup representing the contents of a document fragment.
1. Replace each instance of “\r\n” and “\r” with “\n”, and create a DOM document object with the fragment contents inside the body element.
2. Normalize the body element, and find elements in [body, div, blockquote, section, article, aside, header, footer, details, li, td].
3. In each inline sequence found in each element:
3a. From left to right, trim leading whitespace from text nodes until non-whitespace or a child element is found. From right to left, trim trailing whitespace from text nodes until non-whitespace or a child element is found.
E.g. Consider: [text “\n\n “][comment ” foo “][text ” Hello”][element b][text “! \n”]
After 3a, the sequence would be: [comment ” foo “][text “Hello “][element b][text “!”]
3b. Place a new autop element directly before the sequence, and move the sequence into the autop.
4. Replace with their contents any autop elements that contain only whitespace and/or comment nodes.
5. Within childNodes in each autop element, remove two or more consecutive “\n” characters in a text node, and split the autop element into two autop elements at that point. E.g.:
<autop>Foo <b>bar</b> baz gaz</autop>
Becomes:
<autop>Foo <b>bar</b></autop><autop> baz gaz</autop>
6. Consider each element in [div, li, td]. If it contains exactly one autop element, and no other block-level elements, replace the autop element with its contents.
7. Replace each autop element with a p element with the same contents.
If the second argument is false, skip to step 9.
8. Find recognized HTML5 elements that may contain text content, and which are not in [pre, script, style].
8a. In each element, split text nodes containing “\n” in two, removing the “\n” and replacing it with a new br element between the two text nodes. E.g.:
<b>Hello World!</b>
becomes:
<b>Hello<br /><br />World!</b>
9. Serialize the document object and return the markup contents of body.
Notes
One limitation of this algorithm is that, inside inline elements, p elements cannot be split (instead two br elements would be created). Allowing splits inside inlines would require a complex algorithm to close all open inline elements and re-open them in the following p. IMO, wpautop should avoid taking on this requirement.
The current function crudely “splits” inlines by leaving an unmatched closing p tag after the first inline sequence, causing browsers to render an empty p element in the DOM between the two sequences.