Edit - History - Recent changes

Preamble (Edit)

I often need to convert from HTML to t2t.

There is already https://txt2tags.googlecode.com/svn/trunk/extras/unhtml.vim but the result is not always clean and it's not scriptable.

Pandoc (http://johnmacfarlane.net/pandoc/) can also convert from html to several other formats, but I didn't manage to adapt it to txt2tags (it seems complicated to compile so I didn't go further than a simple installation)

But I've just discovered this handy piece of software (in Perl):


It converts from html to some wiki formats, such as dokuwiki or mediawiki. The good new is it's very easy to adapt to new syntax. For example part of the definition file is like this:

b => { start => '**', end => '**' },
    strong => { alias => 'b' },
    i => { start => '//', end => '//' },
    em => { alias => 'i' },
    u => { start => '__', end => '__' },

I've created a txt2tags export:

Installation (Edit)

  • General installation:
    • Get and install HTML-WikiConverter first: http://search.cpan.org/dist/HTML-WikiConverter/

    • Get a released version of the txt2tags importer there: Attach:HTML-WikiConverter-Txt2tags-0.03.zip

    • Unzip and install this module (see README if needed): - perl Makefile.PL
      • make
      • # make test (is not working at the moment)
      • make install

  • Alternative installation: on Debian-like linux distribution, you can use those commands:
    • sudo apt-get install libhtml-wikiconverter-perl
    • cpan -f install HTML::WikiConverter::Txt2tags

Use (Edit)

Then you can invoke it this way:

html2wiki --dialect Txt2tags file.html

You can even get remote files and convert them like this:

curl --silent  http://theody.net/elements.html |  html2wiki --dialect Txt2tags > elements.t2t

Improve it! (Edit)

sample.html was output from the sample_org.t2t from txt2tags. You can compare it with sample_test.t2t and see the difference. The result is quite close to the original!

What is not working very well:

  • lists add extra white spaces at the begining of the lines. This can be changed with this line in Txt2tags.pm:
    ul => { line_format => 'multi', block => 1, line_prefix => ' ' },
    But if I remove the extra space with line_prefix, then all lists will start at the beginning of the line, which is worse.

  • It'd like to make br react like p (because when you convert from html, a single line break won't be shown in the final (outputed) document. This part is responsible for that:
    br => { start => "\n", trim => 'leading'  },
    I tried to use start => "\n\n" but the extra \n is not working.

    On the other hand users can visually see it so they can correct them manually. Or we can add ''<br/>'' in the code.

    I still don't know what is the best option.

  • For headers, because they are too close to the previous line, some of them won't be processed by txt2tags later. I tried to add an extra \n before the =, to be sure, but again, this \n is not working:

    for( 1..5 ) {
        my $str = ( '=' ) x ( $_ );
        $rules{"h$_"}!{ start => "\n$str ", end => " $str", block => 1, trim => 'both', line_format> 'single' };

  • For pre, in the sample document it adds extra spaces before and after the ```, because of this code:

    pre => { start => "```\n", end => "\n```", block => 1, line_format => 'blocks', trim => 'none' },

    it looks ugly, but please don't remove the \n at the moment because on some html, it will miss line break:

    ```like this


    end of pre```

  • for some badly formed HTML, for example when people use a wysiwyg tool and select extra spaces around the words they want to underline ou make as italic. Then it will result in a // bad export// and it won't display correctly.

  • Some specific parts from the original DokuWiki.pm are still there, the code should be cleaned up. I've kept the DokuWiki.pm as reference, we could remove it later.

  • The credits could probably be improved

  • there is a perl cgi for html2wiki: add txt2tags and run on our own server