Wikipedia / Wikivoyage articles as POI in Navigator
  • Hi,

    I wrote a python script that writes an "xml like" txt a GPX file, and/or .CSV file and/or .OSM file (and planning on sqlite db-files) directly from the bz2 compressed dumps from http://dumps.wikimedia.org/. I plan to release it as GPL-3 somewhat later.
    I only extract pages that contain coordinates, otherwise they are useless to be used.
    However, I run into some limitiations of the MCA file structure and indexing (I think).

    Note: This is not a question about how to create mca files. I know very well how to do that and also wrote a howto some year ago. I can't find it myself though :) )

    For example: For the Dutch wikipedia I have 183.966 articles with coordinates out of 2.626.774 pages. I already limit the text section to 750 characters otherwise the POI info/select screen after search can't handle it (but I could easily make this even shorter).
    I have "NAME, LATITUDE, LONGITUDE, URL, CONTENT", where content is of course the text section limited to 750 characters. URL is the website of the object in the article (only limited available: 16.984), not the wikipedia article url itself. I will build that in as well so that people with a databundle or wifi can easily access the full wiki page if they want to.
    I want the content to be displayed as "Note" of course, but to how many characters is this limited? Both physically and "nicely working" (my remark about the info/select screen)

    I can do this relativley easy in OsmAnd (hence the OSM file as basis for OsmAndMapCreator and other apps), but I prefer to use it in MNF.

    I already did more or less the same for the Dutch nlwikivoyage articles (2615 in total) and the English enwikivoyage (228.500 in total). The Dutch works fine, the English searches very slow and crashed MNF twice (but maybe that was due to incorrect "things" inside the mca).

    Another serious and really annoying limitation of digger is its CSV (Comma Separate Values) import: It can only handle CSV files which are really separated by a  "," even though the standard already allows for tens of years other separators. You can't image the number of typos in these over 127.255.775 lines of Dutch articles resulting in the occurrence of "," or alikes. I really need to use ; as separator. Anyway: I import the ";" separated csv into sqlite and immediately export it to "," separated csv file for digger, but this shouldn't be neccessary.

    So please @mdx, @tomas and @lubos give me some answers. I am still building on the functionality, but if MNF stops/limits me I will stop the MNF part.
  • 27 Comments sorted by
  • I do not know answers(only mdx knows).
    Can you share us your mca(s)?
    I really like your idea, I would like to see it.
  • Hi,

    Please find attach one for the Czech language (wikipedia articles are by language, not by country) which I created this lunch (7306 entries). I had to reduce the content length to 600 characters otherwise it does not work correctly on my phone screen (1280x720) when you select the POI and want to "navigate, set as destination, show on map, .etc..". A feature option here would be to be able to scroll the top half where the text is, just like the bottom half where the navigation options are.
    It mentions "Wikipedia page" and "Article url" on top and I translated that (using google) as "Wikipedia webová stránka" and "článek url".
    These names (and translations) need some work though.

    It still needs to  "finetuning". Links do work, but some parts of the text are highlighted as links but are not links.
  • And about the digger csv import issue: I already had the "xml like" text file. It was 5 minutes work to change that to a GPX waypoint format file. That one is easily processed by digger
  • Hi hvdwolf,
    do you have your python script visible somewhere? I suppose that if you have
    >>> import csv
    >>> f = csv.writer(open("tmp.csv", "wb"))
    >>> text = "info, with comma"
    >>> note = 'potential "problem"'
    >>> f.writerow( [text, note] )

    then the csv file looks like
    cat tmp.csv
    "info, with comma","potential ""problem"""

    and digger should accept it. If if you have any counterexample, where digger fails for valid CSV file, please let me know
    thanks
       Martin

  • Hi Martin,

    The , (comma) as such is not a problem. The "info, with comma","potential ""problem""" of your example works fine. It is not my problem building the csv, it is the text in the wikipedia articles.

    The problem is with constructions (typos) like "info "," with comma". It happens every now and then that these constructions are encountered where editors did not pay attention enough on spaces. The double quotes are used regularly. It happens not very often but when parsing +100.000.000 lines it occurs more then once (twice, thrice, etc)

    My script is currently extremely "rough cut". I want to polish it at least a little.

    Edit: 
    Anyway. I use this line to open my csv file for writing.
    csv_file = csv.writer(open(write_to_csv_file, 'wb'), delimiter=';',  quotechar='"', quoting=csv.QUOTE_ALL)

    If I use the csv_file = csv.writer(open(write_to_csv_file, 'wb'), delimiter=',',  quotechar='"', quoting=csv.QUOTE_ALL) I get the issues with texts containing themselves ","  like described above
  • Hi hvdwolf,
    I still do not understand what is the problem with csv - in your case
    >>> from csv import writer
    >>> a = '"info ","'
    >>> a
    '"info ","'
    >>> f = writer(open("tmp.csv", "wb"))
    >>> f.writerow( [a] )

    cat tmp.csv
    """info "","""

    So is the problem that digger then does not open such a file??
  • Hi Martin,

    You are right and I was wrong. I use different "cleaning" search&replace options to make the text strings a bit cleaner. I had an error in there. Maybe it was not even an error but at least it generated incorrect csv strings.
    I changed it and then my gpx strings did not work correct anymore. So for the time being I removed it.
    Now both my gpx file and my comma separated (",") csv file works in diggerQT.




  • @hvdwolf
    You are what we call in German "positiv verrückt", I don't know if a translation into English is possible as "positively crazy".

    Chapeau to what you are doing here. It is indeed very very interesting and certainly far above my own capabilities.
  • @mdx or @lubos,

    Are there "forbidden, unnallowed" characters in the mca files?
    Until now I created a NL, a CS and a DE wikipedia. The NL and CS are working fine. The DE crashes MNF. Of course the structure is the same and the mca is also created correctly without errors in the Digger.log.
    So I assume that some German characters are not allowed, even though it should be fully utf-8.
  • There could be problem with 3 and more multibyte characters (Chinese, Japanese, ...). When it crashed? Just during map display or when you searched?
  • When I was trying to search MNF crashes. If I dont "touch" the POI search MNF remains stable.
    The DE wikipedia icons are not displayed on the map either. Some of the wiki articles are the same in Czech, Dutch and German, but only the Czech and Dutch are displayed.

    I think I will start the English wikipedia tonight but that's compressed 12GB and uncompressed about 5 times as much. I also think that MNF can't handle it.
  • I'm still working on it but it's tedious. I wrote many python scripts already, but chose a new strategy by now.

    Still breaking on some languages.

    @mdx: You mention "3 and more multibyte characters (Chinese, Japanese, ...)". 
    Do you mean utf-16 characters? Because all the dumps are utf-8 encoded.
  • No, I really meant utf-8 encoding, where character is stored in more than 2 bytes, for example:
    http://www.utf8-chartable.de/unicode-utf8-table.pl?start=64512
  • I see what you mean but it happened for the polish and German language. I had not even touched the arabic, chinese, or japenese languages. Maybe I will get more problems there.
  • Hi all, almost there.

    I completely rewrote the scripts using a new approach. Then I switched to "in memory" processing as processing the wikipedia dumps could take up days to even a week, literally! The con of this "in memory" approach is that if anything goes wrong, you completely need to redo it.

    I created a github repository for the scripts at https://github.com/hvdwolf/wikiscripts
    (I already had a repository there).
    You can also find a wikivoyage_mca.zip there containing mca files for languages DE, EN, ES, FR, IT, NL, PL, PT, RO, VI, SV. (wikipedia/wikivoyage is per language, not per country.)
    Wikivoyage is a spin-off of wikipedia and only contains articles like "where to go to/ what to see / where to eat, etc.". It only contains a (mini) subset of wikipedia articles aimed at tourists. Also: wikipedia is currently in 43 languages, wikivoyages in 11 (or so). Another problem is that my script currently can't handle multi-byte languages like greek, chinese, korean, japan and a few more. I use utf-8 but python (and other programming languages) need special approaches for string handling for these languages.

    Note: Copy only one mca to your Navigator data folder. Multiple files will slow down MNF.

    Currently I have 2 laptops and 1 mini bananapi server running on some of the wikipedia dumps. I will deliver these later. My third laptop has suddenly a not functioning screen.

    Please also have a look at the HowTo pdf.

    Note also that I'm constantly making changes to the repository.

    Edit: I also changed the title of this thread as it is also about wikivoyage now.
  • In case anyone wants to try the scripts: Don't do it yet. I just rewrote one of the scripts again (partly that is). I finally got some intelligence back and changed the parsing approach as such. The scripts are now way faster, but I have to propagate some of the changes to other scripts as well and update the HowTo.


  • By now I created a great number of wikipedia files. Still no "real" utf-8 or utf-32 languages though as I have still issues parsing those wikipedia dumps (russian, greek, hebrew, chinese, japanese, farsi, hindi, arabic, etc.).
    For the time being you can find the created mca files on my website or directly from the wikipedia or wikivoyage pages on my website. I hope that at some time they become hosted on the Mapfactor side or downloadable from within MNF.

    Note: Searching can be slow as the wikipedia mca's can contain many items/articles (EN: 189928; DE: 101557; FR: 89031; NL: 97642). Some of the other current "unparsable" ones are that big as well.

    Note2: Don't use multiple wikipedia.mca's together. It will slow down your MNF.

    Note3: The wikipedia are per language. DEwikipedia contains German articles for all over the world.

    Note4: The articles are currently limited to 600 characters. The info pane is not scrollable right now. More characters block other functions. I tested this on my own 1280x720 phone. Phones with lower resolution might be worse. A set of 1000 characters would be optimal (I think) with regard to readability, conciseness and completeness.

  • I solved the issue of the russian, greek, hebrew, chinese , etcetera languages. It turned out that the titles were "percent encoded" (I had never heard of that). So I can now create mca files in all languages.
    So far I created mca files for the 47 languages having more 100.000 articles in wikipedia (and 17 for all available wikivoyage languages).

    I now encounter a new problem. Some of the wikipedia mca files make MNF crash, like Russian, Chinese, japanese, kaukasian and a few others. The strange thing is that Arabic and Hebrew for example do work (and they are "right-to-left" oriented where I expected more issues).

    This is in the experimental 0.38.78-Android_gles_renderer version.
    I set debug level to high. Please find the log files here
    Edit: I also added a zip with all csv files here (156MB)
  • I have added both DE files (wikipedia and wikivoyage) to my MFN installation (on my tablet it is the gles version). But maybe I miss some point. When searching the pois, I have a lot of places with no map on my phone. Starting with Acapulco, Mexiko; but with a very few exceptions I have only maps of neighbouring countries on my phone (to germany, ie nl, cz, pl, lu, fr, etc.). Why should I have infos to Alcapulco on my phone, when there is no map for it?

    When I tap the place, I do get a lot of information to it but have no choice, to navigate, show on map etc.
  • Hi,

    That's what I mentioned. Wikipedia/wikivoyage is for a certain language and can contain articles of all over the world. It is not limited to the maps you have downloaded. The drawback of this is that you also have a lot of articles which are not for your downloaded maps. Currently I have already the country/region as part of the meta-data. I can't do anything with that at this moment in an mca file as created by digger. 
    I do think that Mapfactor has the options to use the country to be used as discriminator for searching.

    I need to communicate with Mapfactor if they think it is worthwile to do something together. 
    And actually: I prefer them to take over. I'm only a hobby programmer, they have professionals.

    I was about to post about requesting feedback from both users and Mapfactor.

    With regard to your last remark: "When I tap the place, I do get a lot of information to it but have no choice, to navigate, show on map etc."
    This is a bug/omission in MNF. The application is not written for POIs containing so much information, so the bottom half of the popup is not visible. The GLES version has this solved now: It only shows the first 3 lines of the article.
  • I updated the scripts.
    At this moment the country and region are also specifically written as a data filed to the csv, osm and sql files.

    I did not update the mca files as it is not used there anyway.
  • When navigating recently I passed an icon of those two files, what I found more interesting rather than just searching for poi in the menu.
    The info there was a general info of a suburb of the city, what was not that much interesting. This certainly depends on what is published in wikipedia and wikivoyage. I found that there is only very limited information regarding the city of Chemnitz when browsing wikivoyage. Now I wonder if it may be worthwhile for me and take the time and work on that. It seems to me, that not many people will read that ...
  • It will be a kind of a "niche" area anyway. Of course there are people interested in wikipedia and wikivoyage, just like they are interested in e.g.  "all hotels in United Kingdom", or  "all camper places in Europe", or "all liquid gas fuel stations in Germany".
    But these are all relatively small groups of interested people.
    Until now I only got response from you and from Lubos and mdx. And nobody did an upvote (the blue button on the left) on the first post of this thread. So it looks like as if people are not interested.
    That's also a reason why I slowed down. If nobody is interested, why invest the time.
    Another reason is that MNF needs a few adaptations/enhancements to properly handle these huge mca POI files. The development versions already have some modifications but my assumption is, that it is still far from the Google Play versions.

    I still think it is a valuable addition, but it needs some more time.
  • At least you got an upvote from me now. I never bothered what this blue field means :-o
  • Some of you might ask why I did not further update this after having spent quite some time and effort into it.
    There were still some optimisations I wanted to make. Currently I take for example the German wikipedia and collect all articles with coordinates from all over the world. One of the improvements would be to only include articles of locations inside Germany and if an article about something in Germany is only available in the english wikipedia, add that to the German mca.

    The fact that I stopped is simply the reduced functionality in MNF. I started with articles of max. 1000 characters. They could not be displayed by MNF correctly. Then I switched to 600 characters (to my idea less optimal but OK)  which could be displayed both in search mode as well on the map when you selected it and clicked the info button (at the right of the screen).
    Then the displayed information was reduced to 3 lines of text in one ofthe newer versions of MNF, which is OK when searching, or for a short overview of the information from the map when  you "run into" a wikipedia icon. However, it was restricted to 3 lines of text and there was no single way to access the rest of the text anymore. When you put 600 characters in an article and you can only see the first 120-150 (or so), these files become immediately useless.
    Especially as I structured the data to show the title on the first line, some "metadata" on the second line and the article starting on the third line effectively reducing the text to max. 80 characters.

    If some extra infobutton or option becomes available to display the complete text again, I will continue with my efforts.

    Another reason was also because there was not so much interest, but I also blamed that on the fact that this wikipedia/wikivoyage addition was in such an early stage.

    Too bad :(
  • I disabled reduction to 3lines when you click on info from map(in search there is reduction still on) in version 2.0.17+
  • Thanks. I will await the release then. I just tried but it was still at 2.0.16.

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

In this Discussion