Feed Block example with mandatory options

[Feed VAOS Dokuwiki]
url = "http://dokuwiki.virtual-access.org/feed.php"
title = "Dokuwiki"
vapath = "VAOS/Dokuwiki"
version = "rss1"
version_date = "rss1"
charset = "utf-8"
fetch_links = 1
disabled = 0

Main options in detail (mandatory)

[Feed VAOS Dokuwiki]

name of the feed, has to start with "Feed", should contain only ascii characters and numbers, designates a new feed block starting

url = "http://dokuwiki.virtual-access.org/feed.php"

the URL of the feed as given by the website, usually you would click on the RSS symbol on the website and then copy the URL from the addressbar of your browser

title = "Dokuwiki"

title of the feed as you want it to appear in VA, can be anything you want and doesn't have to match what the site names it

vapath = "VAOS/Dokuwiki"

VA will import feeds to the service Feeds in the node and folder given by vapath.
In this example "VAOS" is the node that in normal services might be named "Mail" or "Comp" (first part of the newsgroup name) and "Dokuwiki" is the folder name. You can use anything you like, but we need a full "node/folder" path and not just "node". You should only use ascii characters and numbers.
Attention: the path is not allowed to have spaces! If you want spaces use "_" (underscore) instead, this will be translated to a space in VA. You do not have to create any folders, the importer will do this for you! The only thing that needs to exist is the Feeds service as created according to the install instructions!

version = "rss1" Known version strings are: rss1, rss2, atom and atom_2005. (RSS 0.9 should be given as "rss1".)

This is the feed version as given by the website. If the website doesn't tell you the version then "View Source" on the feed page in your browser and look in the first few lines for the version. This is typically something like version="2.0" or so.
Unfortunately, feed specifications are vague at some points and some feeds do not follow them at all, although they announce a specific version. So, the version string may not always work. Sometimes you will need to use a different one than the feed announces itself or we need yet another format. There is a short list of the different tags used by different feed versions at the top of feed.ini. If you want to be sure that the feed version will be okay for the Feeder look if the feed contains the tags as listed there for the specific feed version.
Attention: if a site provides more than one feed format I suggest using the RSS 2.0 feed.

version_date = "rss1" Known version_date strings are: rss1, rss2, atom, order, none.

This option is a result of the loose specification problem. Some feeds announce as RSS 2.0 but use date tags that are from RSS 1.0 or give no date at all. This option caters for this.
You can determine the date version tag of the feed the same way as the version: look with "View Source". Here are the tags that match the formats:

  • rss1 = dc:date
  • rss2 = pubdate
  • atom = issued, published, updated

If you don't see any of these tags the feed doesn't offer dates of articles. (If you use Internet Explorer 7 you can easily see this without viweing source by looking at the line following the green arrow. If that shows no date, then it's an undated feed.) In that case use one of these two:

  • order = the feed is reliably ordered by publishing date, but doesn't announce dates (use article_order = "asc" if the order is not descending, which is the standard). (This option is new as of version 0.92 and behaves like "none" in earlier versions)
  • none = the "newness" is determined by comparing the hashes of the last 50 article headers, use this if the feed order isn't reliable. (This option is new with version 0.92)

none will work with all feeds as it doesn't rely on dates or order but on the article titles. I recommend to always use none with all un-dated feeds. order may get removed from the Feeder in the future as it doesn't offer an advantage over none since 0.92.

charset = "utf-8"

This is the character-set encoding that the articles (web pages) use, not the character-set used by the feed page. So, go on the feed page, click one of the article links and then determine the character-set, e.g. with IE you right-click on the page and look under "Encoding". Warning: some browsers show a "beautified" character-encoding (for instance IE) and not the "real code". In that case you have to "View Source" and look in the head of the document for a tag like this
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
and use the code that is given after the charset attribute. Typical character-sets in Western Europe are either "iso-8859-1" or "utf-8", using the lower-cased version is preferred. ("iso-8859-1" is the equivalent of "Latin-1" or "Western European (ISO)".)

fetch_links = 1

This tells the Feeder if it should fetch only the feed list or the articles (links) as well. Normally you will want to get the articles. However, if the feed contains direct links to binary files (e.g. our VA Wiki Download feed) this would download the whole file and try to parse it. You have to disable fetch_links for such a feed. You may also disable this option for feeds that don't work well, e.g. our VA Wiki Articles feed. See use_inline_content option below for other usage (get content instead of linked articles).
So, if this option is set to 0 the Feeder will check the feed but won't create anything (unless you use use_inline_content). But it will update the internal dates, when it was last checked etc. So, this is not the right option if you want to temporarily disable the feed, unless you want to disable and skip the articles that come in the meantime.

disabled = 0

If this is set to 1 the feed is completely disabled. It will not be checked for new articles and no internal data will be written. So, this is the right option if you want to stop reading a feed for a while and later want to continue where you stopped.
Attention: if you wait too long the list of articles in the feed may have "scrolled over", so you will nevertheless loose some articles! There is nothing we can do about that. Typically feeds contain a limited number of articles, e.g. 5, 10 or 15.

Remember, all options above are mandatory!

Additional options (non-mandatory)

article_order = "asc" Known article_order strings: asc.

This option shows if the feed is in non-standard order. Feeds are normally ordered descending by date, so that the latest article is on top. The order shouldn't matter for feeds that show a date for each article, but nevertheless, if you encounter a "dated" feed that is in ascending order better use this option. Just look at the feeds page and check if the given publishing dates are ascending or descending (= standard). (Use Internet Explorer 7 for that, it shows dates in the line with the green arrow.)
It's the un-dated feeds with version_date = "order" that this option is important for. If you set version_date = "order" then you must determine the order in which the articles appear. If new articles appear at the top of the page then the feed is in standard order and you don't set this option. If new articles appear at the bottom of the page then the feed is in non-standard order and you have to set article_order = "asc". With non-dated feeds you have to click several article links and determine by yourself if the articles at the bottom are newer or older. This option is not relevant for version_date = "none" feeds.

create_summary = 1

This does nothing at the moment and is meant to be used normally in conjunction with fetch_links = 0. In the future the Feeder will create a message ("summary") which shows the new articles in a clickable list, maybe a short description and a date and such. So, basically something that is very similar to what you see as a feed list in your browser. At the moment this list isn't created yet. You can then use this list to click on the links and read the articles in your browser, download the files etc.

use_inline_content = 1

Some feeds do provide the complete article already in the feed list in some sort of content tag. This option is for getting this instead of fetching the link. This works only if fetch_links = 0. Using this option makes only sense if the feed provides well-formatted (layouted) content in a content tag. If there is no content we fall back to description (which is usually not formatted). Normally you won't need any filters if you use this option.

id = 10

The id is used only for testing modes at the moment as it was later added to the code. Nevertheless you should give each feed its own id. Ids should be numbers (integers). All testing modes specify the feed to be tested by id which is much easier as typing a name.

Print Version (non-mandatory)

Many sites provide a print version of their single pages. The print version usually doesn't contain ads, navigation and other stuff we don't want. So, if there is a print version available that's usually what we want to get instead of the normal web page. However, sites do not provide print version feeds. So, what we do is get the feed (= the article list), get the article URL and then translate this URL in the print version URL. Of course, this only works if there is some recognizable translation scheme. The feeds I'm fetching for myself all have very well recognizable URL schemes for the print version. Using the print version is optional, but recommended if available. Example of a print version block:

print_version = 1
url_sample = "http://www.spiegel.de/wissenschaft/weltall/0,1518,466045,00.html";
url_pattern = "(http:\/\/www.spiegel.de\/\w+\/\w+\/0,\d+,)(\d{6,})(,00.html)"
url_replace_pattern = "$1druck-$2$3"

All options of that option block (except for the sample) are mandatory and form a print version translation block. The url_pattern is a Perl compatible regular expression that uses grouping and the url_replace_pattern is a Perl compatible replace pattern using the usuall group substitution variables $1 and so on.

print_version = 1enables using the print version
url_sample = "http://www.spiegel.de/wissenschaft/weltall/0,1518,466045,00.html"only for reference as it helps to see the original URL while constructing the regexp for it. It's not mandatory!
url_pattern = "(http:\/\/www.spiegel.de\/\w+\/\w+\/0,\d+,)(\d{6,})(,00.html)"regexp pattern that matches url_sample
url_replace_pattern = "$1druck-$2$3"regexp pattern used for replacement when translating to the print version URL

In this case it translates

http://www.spiegel.de/wissenschaft/weltall/0,1518,466045,00.html to:

$1, $2 and $3 are the groups that are matched by the three bracketed groups in the url_pattern. So, what we do is insert a string "druck-" in the URL.

Filters (non-mandatory)

Many article pages contain advertisements and other content that is not wanted or just looks bad if viewed as a message. Filters are there to filter in what you want and filter out what you don't want. Filters are optional, but I strongly recommend using at least the simple_include filter if there is no print version! With print versions you usually don't need a filter, but even then sometimes it may be necessary to strip some parts out. There are two types of filters: simple and regexp. Simple and regexp filters cannot be combined, you have to use either the one or the other. Print version and one of the filters can be combined.

simple_include and simple_exclude are strings that have to match literally in the source code of the web page. First the simple_include is applied, then the simple_exclude (if there is one). A simple_exclude without a simple_include is not possible. You can use the simple_include to cut out the part of the page that you want to get (thus removing navigation, header, footer etc.) and then the simple_exclude to remove unwanted content from that excerpt, for instance an advertisement that gets embedded in the middle of an article.

filter_regexp also has an include and an exclude variant and can be used the same way as the simple filter. The important difference is that it uses a Perl compatible regular expression. filter_regexp is the recommended way of filtering for experts as it provides more fine-grained matching.

It's not easy to define correct filters, so you may get unwanted content with your first tries before you get it right. That is why I provide test_feed.php and test_regexp.php. Check the Testing page for details. I strongly recommend using test_feed.php for each new feed you add!

Apart from these filters the Feeder also does some automatic filtering that cannot be switched off. In particular, it removes any scripting, image tags, iframes and some other unwanted or undisplayable content.

simple filter

There are two types of simple filters that complement each other: simple_include and simple_exclude.

Example of a simple_include filter block (all three lines have to be present if the filter is used):

simple_include = 1
simple_include_start = "<!-- Text -->"
simple_include_end = "<!-- news-steuerung ende -->"
simple_include = 1 enables the filter
simple_include_start = "<!– Text –>" is the start string/sequence
simple_include_end = "<!– news-steuerung ende –>" is the stop string/sequence

No wildcards or some such allowed, you have to use the literal string as it appears in the source of the article! That's why it's called simple.
Attention: for " (double quotes) you have to use a "QUOTE" sequence instead (including the lead-in and lead-out quote if it's in the middle of text)!
In other words, "one double quote" gets translated to "one double quote QUOTE one double quote". Single quotes are probably not allowed either, don't use them!

simple_include_start = "<div id="taskdetails">" has to be translated to:
simple_include_start = "<div id="QUOTE"taskdetails"QUOTE">"

The Feeder will search the fetched article for the start and end sequence and use only what's in-between. You have to look again in the source of the article and find unique identifier strings for the start and end. The feeder will look for the first start and then will look for the first end and then will use that. If it cannot find any it will use the whole article. Sites often use HTML comments (e.g. "<!– headline starts here –>") to identify parts of the document, so look for comments that are very near to the start and end of the real article. If you can't use comments try to identify other unique things before and after the article, for instance "<div id="content">" before an article and "<div id="footer">" after an article and before the page footer. It should be clear that you cannot use something like "</div>" as an end because there may be dozens of this in the document!
So, simple_include should give you a first approximation of what you really want to see. If that satisfies you you are done here and don't need an exclude filter.
But: Sites love to embed advertisements in the middle of articles so they can be sure you see them. The simple_exclude filter is just for that. It will only be used (if I recall correct :-) if you first use the simple_include filter to grab the base article. Now you define again two start and end strings *within* the base article and the Feeder will look in the excerpt we got from the include filter and try to cut out the unwanted embeddiment. It's used in the same way as the simple_include filter and with the same kind of strings.

simple_exclude = 1
simple_exclude_start = "cadv>"
simple_exclude_end = "cadv>"

regexp filter

The regexp filter uses Perl compatible regular expressions for matching, so it doesn't need separate start and stop sequences. You can use ( ... ) brackets (grouping) to determine the exact part of your pattern that should be excerpted. Besides using regular expressions there is an important difference to the simple filter: the exclude filter is applied even if there is no include filter. So, in this respect both regexp filters are optional, you can either use one of them, both or none. Example of a regexp filter block:

filter_regexp = 1
filter_regexp_include = "(<HEISETEXT>.*)<!-- IVW-Pixel -->"
filter_regexp_exclude = ""
filter_regexp = 1 enables using the regexp filter
filter_regexp_include = "(<HEISETEXT>.*)<!– IVW-Pixel –>" the regular expression for getting the excerpt ( ... )
filter_regexp_exclude = "" the regular expression for removing unwanted stuff (in this case: none)

As with simple filters you have to translate " (double quotes) to "QUOTE" instead! Look at feeds.ini for examples!


More (working) examples for feed configuration can be found in the feeds.examples.ini file. You can share your own feeds if you like.

vaosfeeds/feeds.ini.txt · Last modified: 01.07.2007 18:27 by kai
Recent changes RSS feed Driven by DokuWiki