=pod =head1 NAME README for dta-tokwrap - programs, scripts, and perl modules for DTA XML corpus tokenization =cut ##====================================================================== =pod =head1 DESCRIPTION This package contains various utilities for tokenization of DTA "base-format" XML documents. see L</INSTALLATION> for requirements and installation instructions, see L</USAGE> for a brief introduction to the high-level command-line interface, and see L</TOOLS> for an overview of the individual tools included in this distribution. =cut ##====================================================================== =pod =head1 INSTALLATION =cut ##-------------------------------------------------------------- =pod =head2 Requirements =cut ##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ =pod =head3 C Libraries =over 4 =item expat tested version(s): 1.95.8, 2.0.1 =item libxml2 tested version(s): 2.7.3, 2.7.8 =item libxslt tested version(s): 1.1.24, 1.1.26 =back =cut ##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ =pod =head3 Perl Modules See F<DTA-TokWrap/README.txt> for a full list of required perl modules. =cut ##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ =pod =head3 Development Tools =over 4 =item C compiler tested version(s): gcc / linux: v4.3.3, 4.4.6 =item GNU flex (development only) tested version(s): 2.5.33, 2.5.35 Only needed if you plan on making changes to the lexer sources. =item GNU autoconf (SVN only) tested version(s): 2.61, 2.67 Required for building from SVN sources. =item GNU automake (SVN only) tested version(s): 1.9.6, 1.11.1 Required for building from SVN sources. =back =cut ##-------------------------------------------------------------- =pod =head2 Building from SVN To build this package from SVN sources, you must first run the shell command: bash$ sh ./autoreconf.sh from the distribution root directory B<BEFORE> running F<./configure>. Building from SVN sources requires additional development tools to present on the build system. Then, follow the instructions in L</"Building from Source">. =cut ##-------------------------------------------------------------- =pod =head2 Building from Source To build and install the entire package, issue the following commands to the shell: bash$ cd dta-tokwrap-0.01 # (or wherever you unpacked this distribution) bash$ sh ./configure # configure the package bash$ make # build the package bash$ make install # install the package on your system More details on the top-level installation process can be found in the file F<INSTALL> in the distribution root directory. More details on building and installing the DTA::TokWrap perl module included in this distribution can be found in the F<perlmodinstall(1)> manpage. =cut ##====================================================================== =pod =head1 USAGE The perl program L<dta-tokwrap.perl|/dta-tokwrap.perl> installed from the F<DTA-TokWrap/> distribution subdirectory provides a flexible high-level command-line interface to the tokenization of DTA XML documents. =cut ##-------------------------------------------------------------- =pod =head2 Input Format The L<dta-tokwrap.perl|dta-tokwrap.perl> script takes as its input DTA "base-format" XML files, which are simply (TEI-conformant) UTF-8 encoded XML files with one C<E<lt>cE<gt>> element per character: =over 4 =item * the document B<MUST> be encoded in UTF-8, =item * all text nodes to be tokenized should be descendants of a C<E<lt>textE<gt>> element, and may optionally be immediate daughters of a C<E<lt>cE<gt>> element (XPath C<//text//text()|//text//c/text()>). C<E<lt>cE<gt>> elements may not be nested. Prior to dta-tokwrap v0.38, C<E<lt>cE<gt>> elements were required. =back =cut ##-------------------------------------------------------------- =pod =head2 Example: Tokenizing a single XML file Assume we wish to tokenize a single DTA "base-format" XML file F<doc1.xml>. Issue the following command to the shell: bash$ dta-tokwrap.perl doc1.xml ... This will create the following output files: =over 4 =item F<doc1.t.xml> "Master" tokenizer output file encoding sentence boundaries, token boundaries, and tokenizer-provided token analyses. Source for various stand-off annotation formats. This format can also be passed directly to and from the L<DTA::CAB(3pm)|DTA::CAB> analysis suite using the L<DTA::CAB::Format::XmlNative(3pm)|DTA::CAB::Format::XmlNative> formatter class. =back =cut ##-------------------------------------------------------------- =pod =head2 Example: Tokenizing multiple XML files Assume we wish to tokenize a corpus of three DTA "base-format" XML files F<doc1.xml>, F<doc2.xml>, and F<doc3.xml>. This is as easy as: bash$ dta-tokwrap.perl doc1.xml doc2.xml doc3.xml For each input document specified on the command line, master output files and stand-off annotation files will be created. See L<"the dta-tokwrap.perl manpage"|dta-tokwrap.perl> for more details. =head2 Example: Tracing execution progess Assume we wish to tokenize a large corpus of XML input files F<doc*.xml>, and would like to have some feedback on the progress of the tokenization process. Try: bash$ dta-tokwrap.perl -verbose=1 doc*.xml or: bash$ dta-tokwrap.perl -verbose=2 doc*.xml or even: bash$ dta-tokwrap.perl -traceAll doc*.xml =cut ##-------------------------------------------------------------- =pod =head2 Example: From TEI to TCF and Back Assume we have a TEI-like document F<doc.tei.xml> which we want to encode as TCF to the file F<doc.tei.tcf>, using only whitespace tokenizer "hints", but not actually tokenizing the document yet. This can be accomplished by: $ dta-tokwrap.perl -t=tei2tcf -weak-hints doc1.tei.xml If the output should instead be written to STDOUT, just call: $ dta-tokwrap.perl -t=tei2tcf -weak-hints -dO=tcffile=- doc1.tei.xml Assume that the resulting TCF document has undergone further processing (e.g. via L<WebLicht|http://weblicht.sfs.uni-tuebingen.de/weblichtwiki/index.php/Main_Page>) to produce an annotated TCF document C<doc.out.tcf>. selected TCF layers (in particular the C<tokens> and C<sentences> layers) can be spliced back into the TEI document as F<doc.out.xml> by calling: $ dta-tokwrap.perl -t=tcf2tei doc.out.tcf -dO=tcffile=doc.out.tcf -dO=tcfcwsfile=doc.out.xml =cut ##====================================================================== =pod =head1 TOOLS This section provides a brief overview of the individual tools included in the dta-tokwrap distribution. =cut ##-------------------------------------------------------------- =pod =head2 Perl Scripts & Programs The perl scripts and programs included with this distribution are installed by default in F</usr/local/bin> and/or wherever your perl installs scripts by default (e.g. in C<`perl -MConfig -e 'print $Config{installsitescript}'`>). =over 4 =item dta-tokwrap.perl Top-level wrapper script for document tokenization using the L<DTA::TokWrap|DTA::TokWrap> perl API. =item dtatw-add-c.perl Script to insert C<E<lt>cE<gt>> elements and/or C<xml:id> attributes for such elements into an XML document which does not yet contain them. Guaranteed not to clobber any existing //c IDs. //c/@xml:id attributes are generated by a simple document-global counter ("c1", "c2", ..., "c65536"). See L<"the dtatw-add-c.perl manpage"|dtatw-add-c.perl> for more details. =item dtatw-cids2local.perl Script to convert C<//c/@xml:id> attributes to page-local encoding. Never really used. See L<"the dtatw-cids2local.perl manpage"|dtatw-cids2local.perl> for more details. =item dtatw-add-ws.perl Script to splice C<E<lt>sE<gt>> and C<E<lt>wE<gt>> elements encoded from a standoff (.t.xml or .u.xml) XML file into the I<original> "base-format" (.chr.xml) file, producing a .cws.xml file. A tad too generous with partial word segments, due to strict adjacency and boundary criteria. In earlier versions of dta-tokwrap, this functionality was split between the scripts C<dtatw-add-w.perl> and C<dtatw-add-s.perl>, which required only an I<id-compatible> base-format (.chr.xml) file as the splice target. As of dta-tokwrap v0.35, the splice target base-format file must be I<original> source file itself, since the current implementation uses byte offsets to perform the splice. See L<"the dtatw-add-ws.perl manpage"|dtatw-add-ws.perl> for more details. =item dtatw-splice.perl Script to splice generic standoff attributes and/or content into a base file; useful e.g. for merging flat DTA::CAB standoff analyses into TEI-structured *.cws.xml files. See L<"the dtatw-splice.perl manpage"|dtatw-splice.perl> for more details. =item dtatw-get-ddc-attrs.perl Script to insert DDC-relevant attributes extracted from a base file into a *.t.xml file, producing a pre-DDC XML format file (by convention *.ddc.t.xml, a subset of the *.t.xml format). See L<"the dtatw-get-ddc-attrs.perl manpage"|dtatw-get-ddc-attrs.perl> for more details. =item dtatw-get-header.perl Simple script to extract a single header element from an XML file (e.g. for later inclusion in a DDC XML format file). See L<"the dtatw-get-header.perl manpage"|dtatw-get-header.perl> for more details. See L<"the dtatw-get-header.perl manpage"|dtatw-get-header.perl> for more details. =item dtatw-pn2p.perl Script to conver insert E<lt>pE<gt>...E<lt>/pE<gt> wrappers for C<//s/@pn> key attributes in "flat" *.t.xml files. =item dtatw-xml2ddc.perl Script to convert *.ddc.t.xml files and optional headers to DDC-XML format. See L<"the dtatw-xml2ddc.perl manpage"|dtatw-xml2ddc.perl> for more details. =item dtatw-t-check.perl Simple script to check consistency of tokenizer output (*.t) offset + length fields with input (*.txt) file. =item dtatw-add-c.perl Script to add C<E<lt>cE<gt>> elements to an XML document which does not already contain them. Not really useful as of dta-tokwrap v0.38. =item dtatw-rm-c.perl Script to remove C<E<lt>cE<gt>> elements from an XML document. Regex hack, fast but not exceedingly robust, use with caution. See also L</"dtatw-rm-c.xsl"> =item dtatw-rm-w.perl Fast regex hack to remove C<E<lt>wE<gt>> elements from an XML document =item dtatw-rm-s.perl Fast regex hack to remove C<E<lt>sE<gt>> elements from an XML document. =item dtatw-rm-lb.perl Script to remove C<E<lt>lbE<gt>> (line-break) elements from an XML document, replacing them with newlines. Regex hack, fast but not robust, use with caution. See also L</"dtatw-rm-lb.xsl"> =item dtatw-lb-encode.perl Encodes newlines under //text//text() in an XML document as C<E<lt>lbE<gt>> (line-break) elements using high-level file heuristics only. Regex hack, fast but not robust, use with caution. See also L</"dtatw-ensure-lb.perl">, L</"dtatw-add-lb.xsl">, L</"dtatw-rm-lb.perl">. =item dtatw-ensure-lb.perl Script to ensure that all //text//text() newlines in an XML document are explicitly encoded with C<E<lt>lbE<gt>> (line-break) elements, using optional file-, element-, and line-level heuristics. Robust but slow, since it actually parses XML input documents. See also L</"dtatw-lb-encode.perl">, L</"dtatw-add-lb.xsl">, L</"dtatw-rm-lb.perl">. =item dtatw-tt-dictapply.perl Script to apply a type-"dictionary" in one-word-per-line (.tt) format to a token corpus in one-word-per-line (.tt) format. Especially useful together with standard UNIX utilities such as cut, grep, sort, and uniq. =item dtatw-cabtt2xml.perl Script to convert DTA::CAB::Format::TT (one-word-per-line with variable analysis fields identified by conventional prefixes) files to expanded .t.xml format used by dta-tokwrap. The expanded format should be identical to that used by the DTA::CAB::Format::Xml class. See also L<dtatw-txml2tt.xsl>. =item file-substr.perl Script to extract a portion of a file, specified by byte offset and length. Useful for debugging index files created by other tools. =back =cut ##-------------------------------------------------------------- =pod =head2 GNU make build system template The distribution directory F<make/> contains a "template" for using GNU F<make> to organizing the conversion of large corpora with the dta-tokwrap utilities. This is useful because: =over 4 =item * F<make>'s intuitive, easy-to-read syntax provides a wonderful vehicle for user-defined configuration files, obviating the need to remember the names of all 64 (at last count) C<dta-tokwrap.perl|/dta-tokwrap.perl> options, =item * F<make> is very good at tracking complex dependencies of the sort that exist between the various temporary files generated by the dta-tokwrap utilities, =item * F<make> jobs can be made "robust" simply by adding a C<-k> (C<--keep-going>) to the command-line, and =item * last but certainly not least, F<make> has built-in support for parallelization of complex tasks by means of the C<-j N> (C<--jobs=N>) option, allowing us to take advantage of multiprocessor systems. =back By default, the contents of the distribution F<make/> subdirectory are installed to F</usr/local/share/dta-tokwrap/make/>. See the comments at the top of F<make/User.mak> for instructions. =cut ##-------------------------------------------------------------- =pod =head2 Perl Modules =over 4 =item L<DTA::TokWrap|DTA::TokWrap> Top-level tokenization-wrapper module, used by L<dta-tokwrap.perl|dta-tokwrap.perl>. =item L<DTA::TokWrap::Document|DTA::TokWrap::Document> Object-oriented wrapper for documents to be processed. =item L<DTA::TokWrap::Processor|DTA::TokWrap::Processor> Abstract base class for elementary document-processing operations. =back See the L<DTA::TokWrap::Intro(3pm)|DTA::TokWrap::Intro> manpage for more details on included modules, APIs, calling conventions, etc. =cut ##-------------------------------------------------------------- =pod =head2 XSL stylesheets The XSL stylesheets included with this distribution are installed by default in F</usr/local/share/dta-tokwrap/stylesheets>. =over 4 =item dtatw-add-lb.xsl Replaces newlines with C<E<lt>lb/E<gt>> elements in input document. =item dtatw-assign-cids.xsl Assigns missing C<//c/@xml:id> attributes using the XSL C<generate-id()> function. =item dtatw-rm-c.xsl Removes C<E<lt>cE<gt>> elements from the input document. Slow but robust. =item dtatw-rm-lb.xsl Replaces C<E<lt>lb/E<gt>> elements with newlines. =item dtatw-txml2tt.xsl Converts "master" tokenized XML output format (F<*.t.xml>) to TAB-separated one-word-per-line format (F<*.mr.t> aka F<*.t> aka F<*.tt> aka "tt" aka "CSV" aka DTA::CAB::Format::TT aka "TnT" aka "TreeTagger" aka "vertical" aka "moot-native" aka ...). See the F<mootfiles(5)> manpage for basic format details, and see the top of the XSL script for some influential transformation parameters. =back =cut ##-------------------------------------------------------------- =pod =head2 C Programs Several C programs are included with the distribution. These are used by the L<dta-tokwrap.perl|dta-tokwrap.perl> script to perform various intermediate document processing operations, and should not need to be called by the user directly. B<Caveat Scriptor>: The following programs are meant for internal use by the C<DTA::TokWrap> modules only, and their names, calling conventions, and very presence is subject to change without notice. =over 4 =item dtatw-mkindex Splits input document F<doc.xml> into a "character index" F<doc.cx> (CSV), a "structural index" F<doc.sx> (XML), and a "text index" F<doc.tx> (UTF-8 text). =item dtatw-rm-namespaces Removes namespaces from any XML document by renaming "C<xmlns>" attributes to "C<xmlns_>" and "C<xmlns:*>" attributes to "C<xmlns_*>". Useful because XSL's namespace handling is annoyingly slow and ugly. =item dtatw-tokenize-dummy Dummy C<flex> tokenizer. Useful for testing. =item dtatw-txml2sxml Converts "master" tokenized XML output format (F<*.t.xml>) to sentence-level stand-off XML format (F<*.s.xml>). =item dtatw-txml2wxml Converts "master" tokenized XML output format (F<*.t.xml>) to token-level stand-off XML format (F<*.w.xml>). =item dtatw-txml2axml Converts "master" tokenized XML output format (F<*.t.xml>) to token-analysis-level stand-off XML format (F<*.a.xml>). =back =cut ##====================================================================== =pod =head1 SEE ALSO perl(1). =head1 AUTHOR Bryan Jurish E<lt>moocow@cpan.orgE<gt> =head1 COPYRIGHT AND LICENSE Copyright (C) 2009-2018 by Bryan Jurish This package is free software. Redistribution and modification of C portions of this package are subject to the terms of the version 3 or greater of the GNU Lesser General Public License; see the files COPYING and COPYING.LESSER which came with the distribution for details. Redistribution and/or modification of the Perl portions of this package are subject to the same terms as Perl itself, either Perl version 5.24.1 or, at your option, any later version of Perl 5 you may have available. =cut