Topic: How to strip out these headers?

I have this wiki-markup in a string:

=Places=
=Eating=
==Fast Food==
===McDonald's===
===Burger King===
Burger King is an American restaurant chain that....

Now, when I Wikitext::Parser.new.parse that string, I get back the following HTML:

<h1>Places</h1>
<h1>Eating</h1>
<h2>Fast Food</h2>
<h3>McDonald's</h3>
<h3>Burger King</h3>
<p>Burger King is an American restaurant chain that....</p>

This is fine but what I want to do is remove the headers that don't have a paragraph immediately following.  So I should have:

<h3>Burger King</h3>
<p>Burger King is an American restaurant chain that....</p>

Also, there could be any mixture of the headings.  H2 could follow an H3, H4 could follow an H1, etc.  Also, an H1 could follow a lesser header like H4.

Any suggestions?

Thanks!

Last edited by cbmeeks (2010-05-03 09:55:13)

Re: How to strip out these headers?

You could use an HTML parser (like nokogiri or hpricot) and then write code that removes headers according to your rules. Or directly work on your wiki-markup string with regexps.

Re: How to strip out these headers?

Right, that's actually what I'm doing.  I'm using Nokogiri to parse the text.  Then, under each section, I call Wikitext to convert the markup to HTML.

The problem is that the HTML contains "too much" information.  It contains headers that don't have any paragraph data to follow. 

I really don't care where I strip the empty headers.  It could be before I convert to HTML or after.

So if I have:

=Header1=
=Header2=
Only header 2 contains data...which is this string you are reading.

I currently get back:
<h1>Header<h1>
<h1>Header2</h1>
<p>Only header 2 contains data...which is this string you are reading.</p>

I only want Header2 and it's content in this example.  I should only get:
<h1>Header2</h1>
<p>Only header 2 contains data...which is this string you are reading.</p>

That's because Header2 contains non-header text just afterwards while Header1 contains nothing (other than Header2).

Hope that makes sense.

Thanks.

Re: How to strip out these headers?

OK, sorry for my useless answer smile

I don't know Nokogiri, but is it possible with Nokogiri to check if the next tag following a header if a paragraph or not? If so it would be easy to mark for deletion a header not followed by paragraph.