Mediawiki
If you want to import articles from Mediawiki, you can create an XML dump of the pages and then use a Django management command to import it. The management command is not provided as part of django-wiki, but we’ll show you how to build one for your own Django app.
In the management command, we are going to use the lxml library to parse the MediaWiki XML and the unidecode to convert non-latin characters to ascii (so as to create their slug). Finally it uses pandoc to do conversion from MediaWiki markup to GitHub Flavored Markdown (which in this case renders fine in django-wiki).
For the lxml and unidecode you can install them with pip install lxml unidecode
and for pandoc you can
download it from https://pandoc.org/installing.html (make sure the pandoc binary is in your PATH).
The following snippet of code should be placed in <your-app>/management/commands/import_mediawiki_dump.py
:
from django.core.management.base import BaseCommand
from wiki.models.article import ArticleRevision, Article
from wiki.models.urlpath import URLPath
from django.contrib.sites.models import Site
from django.template.defaultfilters import slugify
import unidecode
from django.contrib.auth import get_user_model
import datetime
import pytz
from django.db import transaction
import subprocess
from lxml import etree
def slugify2(s):
return slugify(unidecode.unidecode(s))
def convert_to_markdown(text):
proc = subprocess.Popen(
["pandoc", "-f", "mediawiki", "-t", "gfm"],
stdout=subprocess.PIPE,
stdin=subprocess.PIPE,
)
proc.stdin.write(text.encode("utf-8"))
proc.stdin.close()
return proc.stdout.read().decode("utf-8")
def create_article(title, text, timestamp, user):
text_ok = (
text.replace("__NOEDITSECTION__", "")
.replace("__NOTOC__", "")
.replace("__TOC__", "")
)
text_ok = convert_to_markdown(text_ok)
article = Article()
article_revision = ArticleRevision()
article_revision.content = text_ok
article_revision.title = title
article_revision.user = user
article_revision.owner = user
article_revision.created = timestamp
article.add_revision(article_revision, save=True)
article_revision.save()
article.save()
return article
def create_article_url(article, slug, current_site, url_root):
upath = URLPath.objects.create(
site=current_site, parent=url_root, slug=slug, article=article
)
article.add_object_relation(upath)
def import_page(current_site, url_root, text, title, timestamp, replace_existing, user):
slug = slugify2(title)
try:
urlp = URLPath.objects.get(slug=slug)
if not replace_existing:
print("\tAlready existing, skipping...")
return
print("\tDestorying old version of the article")
urlp.article.delete()
except URLPath.DoesNotExist:
pass
article = create_article(title, text, timestamp, user)
create_article_url(article, slug, current_site, url_root)
class Command(BaseCommand):
help = "Import everything from a MediaWiki XML dump file. Only the latest version of each page is imported."
args = ""
def add_arguments(self, parser):
parser.add_argument("file", type=str)
@transaction.atomic()
def handle(self, *args, **options):
user = get_user_model().objects.get(username="root")
current_site = Site.objects.get_current()
url_root = URLPath.root()
tree = etree.parse(options["file"])
pages = tree.xpath('// *[local-name()="page"]')
for p in pages:
title = p.xpath('*[local-name()="title"]')[0].text
print(title)
revision = p.xpath('*[local-name()="revision"]')[0]
text = revision.xpath('*[local-name()="text"]')[-1].text
timestamp = revision.xpath('*[local-name()="timestamp"]')[0].text
timestamp = datetime.datetime.strptime(timestamp, "%Y-%m-%dT%H:%M:%SZ")
timestamp_with_timezone = pytz.utc.localize(timestamp)
import_page(
current_site,
url_root,
text,
title,
timestamp_with_timezone,
True,
user,
)
Usage
Once the management command is provided by your Django application, you can invoke it from the command-line:
python manage.py import_mediawiki_dump <mediawiki-xml-dump-file>``
Further work and customizing
Please note the following:
The script defines a
root
user to assign the owner of the imported pages (you can leave that as None or add your own user).Multiple revisions of each page have not been implemented. Instead, the script tries to pick the text of the latest one (
text = revision.xpath('*[local-name()="text"]')[-1].text
). Because of this, it’s recommended to only include the latest revision of each article on your MediaWiki dump.You can pass
True
orFalse
toimport_page()
in order to replace or skip existing pages.