php - How can I get the Plain text AND the HTML of a DOM element created from XML? -
we have thousands of closed caption xml files have import database plain text, preserve html markup conversion cc format. have been able extract plain text quite easily, can't seem find correct way of extracting raw html well.
is there way accomplish "->htmlcontent
" in same way ->textcontent
works below?
$ctx = stream_context_create(array('http' => array('timeout' => 60))); $xml = @file_get_contents('http://blah-blah-blah/16th.xml', 0, $ctx); $dom = new domdocument; $dom->loadxml($xml); $ptags = $dom->getelementsbytagname( "p" ); foreach( $ptags $p ) { $text = $p->textcontent; }
typical <p>
being processed:
<p begin="00:00:14.83" end="00:00:18.83" tts:textalign="left"> <metadata ccrow="12" cccol="8"/> (male narrator)<br></br> 16th , 17th centuries<br></br> formative 200 years </p>
successful ->textcontent
result
(male narrator) 16th , 17th centuries formative 200 years
desired html result
(male narrator)<br></br> 16th , 17th centuries<br></br> formative 200 years
in other word save specific nodes - br
elements , text nodes. can dom+xpath:
$document = new domdocument(); $document->preservewhitespace = false; $document->loadxml($html); $xpath = new domxpath($document); foreach ($xpath->evaluate('//p') $p) { $content = ''; foreach ($xpath->evaluate('.//br|.//text()', $p) $node) { $content .= $document->savehtml($node); } var_dump($content); }
output:
string(86) " (male narrator)<br> 16th , 17th centuries<br> formative 200 years "
the xpath expression
any descendant br
: .//br
descendant text node: .//text()
combined expression: .//br|.//text()
namespaces
if xml uses namespaces have register , use them.
$document = new domdocument(); $document->preservewhitespace = false; $document->loadxml($html); $xpath = new domxpath($document); $xpath->registernamespace('tt', 'http://www.w3.org/2006/04/ttaf1'); foreach ($xpath->evaluate('//tt:p') $p) { $content = ''; foreach ($xpath->evaluate('.//tt:br|.//text()', $p) $node) { $content .= $document->savehtml($node); } var_dump($content); }
Comments
Post a Comment