PDF text parser

This library is a parser for XML text files obtained via pdftotext

You can install it using composer require skuola/pdf-text-parser

Suppose you're just converted a pdf file, getting some text like the following:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<body>
<doc>
  <page width="595.200000" height="841.800000">
    <word xMin="56.640000" yMin="59.770680" xMax="118.022880" yMax="72.406680">Lorem</word>
    <word xMin="121.209960" yMin="59.770680" xMax="176.485440" yMax="72.406680">ipsum</word>
  </page>
</doc>
</body>
</html>

The above text is the result of a command like pdftotext -htmlmeta -bbox-layout yourfile.pdf -.

You can use this library as follows:

<?php

require_once 'vendor/autoload.php';

$data = '...';  // the text above

$converter = new \Skuola\PdfTextParser\Converter($data);
// get as plain text...
$txt = $converter->getAsText();
// ...or get as HTML
$html = $converter->getAsHtml();

As alternate mode, you can save your HTML file and pass it to library:

<?php

require_once 'vendor/autoload.php';

$path = '...';  // a path containing the same text as previous example

$converter = new \Skuola\PdfTextParser\Converter(null, $path);
$html = $converter->getAsHtml();

Generated HTML is composed by a <h2> tag or an <p> tag for each document line (depending on the line being a title or not).

More informations to come...

Name	Name	Last commit message	Last commit date
Latest commit gfabrizi Add html escaping option to Converter::getAsHtml() Jul 7, 2018 f4f22db · Jul 7, 2018 History 14 Commits
data	data	first commit	Apr 26, 2018
examples	examples	first commit	Apr 26, 2018
src	src	Add html escaping option to Converter::getAsHtml()	Jul 7, 2018
tests	tests	Add html escaping option to Converter::getAsHtml()	Jul 7, 2018
.gitattributes	.gitattributes	add travis	Apr 26, 2018
.gitignore	.gitignore	first commit	Apr 26, 2018
.php_cs	.php_cs	add .php_cs	Apr 26, 2018
.travis.yml	.travis.yml	downgrade php min version to 7.1	Jun 7, 2018
LICENSE	LICENSE	first commit	Apr 26, 2018
README.md	README.md	add installation command	May 31, 2018
composer.json	composer.json	downgrade php min version to 7.1	Jun 7, 2018
phpunit.xml.dist	phpunit.xml.dist	first commit	Apr 26, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDF text parser

About

Releases

Packages

Languages

License

gfabrizi/pdf-text-parser

Folders and files

Latest commit

History

Repository files navigation

PDF text parser

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages