Extract comments and target texts from iWork Pages

Background and rationale

Using iWork Pages to write a moderate length review with some 80 citations seemed all very good, that is, until deciding to submit it to a journal that uses numbered citations.

Unfortunately, Zotero has no plugin for Pages, so automating numerical referencing is only possible with Zotero RTF scan after exporting from Pages to an RTF file. Sadly, though, Zotero RTF scan is rather flakey (especially for Turkish names) and its disambiguation tedious. So numerical citations were added directly, with a comment added to each containing its author-year citation on the assumption that these comments could somehow be extracted/printed and used to create the bibliography.

Great idea, but wait, there is no obvious way to extract or print comments from iWork Pages (post iWork '09).

So the following method was developed using an R script.

Procedure

The Pages document is exported in Pages '09 format. The script imports from document.09.pages, so the export should use that filename. The exported file is saved in the same folder as the script, and the script run from that location.

The script extracts index.xml from the Pages '09 file (which is an compressed archive file).

The script uses regex expressions to collect the comments and the target texts, writing them to comments.csv in the same folder. The CSV file also contains the tags used in index.xml, but these are just used to visually check that the extraction worked as expected.

The extraction of the comments and target texts has been tested for comments on unformatted text, some formatted texts (e.g. text set in italics), text with tracked changes, in footnotes and across paragraph boundaries (in the later case inserting "<br>" as an indicator of a new paragraph). However, it has not been widely tested in other contexts, so further provisions for exceptional cases might be need to be coded. Also, it will not work on files saved directly from Pages `09.

Examples

R code

R code file

@copyright: 2018 Ian Riley <ian@riley.asia>

License

GNU GPL, see COPYING for details.

Known issues and limitations


Hits

230

comments.R (last edited 2019-04-10 17:52:17 by IanRiley)