php - Function sometimes skipping first lines when reading Word doc -


i've been using useful tool reading word documents submitted accepted answer here: how extract text word file .doc,docx,.xlsx,.pptx php

it works quite apart omits first few lines of text .doc files.

here function read .doc file:

private function read_doc() {     $filehandle = fopen($this->filename, "r");     $line = @fread($filehandle, filesize($this->filename));        $lines = explode(chr(0x0d),$line);     $outtext = "";     foreach($lines $thisline)       {         $pos = strpos($thisline, chr(0x00));         if (($pos !== false)||(strlen($thisline)==0))           {           } else {             $outtext .= $thisline." ";           }       }      $outtext = preg_replace("/[^a-za-z0-9\s\,\.\-\n\r\t@\/\_\(\)]/","",$outtext);     return $outtext; } 

it seems issue part:

$pos = strpos($thisline, chr(0x00));         if (($pos !== false)||(strlen($thisline)==0)) 

while correctly removes parts of document aren't text content, seems responsible removing first line of text content.

how function amended avoid problem when reading .doc files?

i came following workaround seems trick. used strrpos instead of strpos last occurrence in line of 00x0 character, because text after in line text content. if it's last bit of document coding before content starts, adds text part of line output.

private function read_doc() {     $filehandle = fopen($this->filename, "r");     $line = @fread($filehandle, filesize($this->filename));        $lines = explode(chr(0x0d),$line);     $outtext = "";     $content_started=false;     foreach($lines $thisline){         $pos = strrpos($thisline, chr(0x00));         if (($pos !== false)||(strlen($thisline)==0)){                   }          else {             if(!$content_started){                 $outtext.=substr($lastline,$lastpos)." ";             }             $content_started=true;             $outtext .= $thisline." ";         }           $lastline=$thisline;           $lastpos=$pos;       }     $outtext = preg_replace("/[^a-za-z0-9\s\,\.\-\n\r\t@\/\_\(\)]/","",$outtext);     return $outtext; } 

Comments

Popular posts from this blog

javascript - Slick Slider width recalculation -

jsf - PrimeFaces Datatable - What is f:facet actually doing? -

angular2 services - Angular 2 RC 4 Http post not firing -