Tuesday, March 26, 2013

Much ado about backfilling and importing

First, just a couple clarifications on terminology here. Backfill refers to manually downloading older header information to generate releases, whereas import refers to importing releases from a batch of NZB files.

Post processing:

The post processing stage of NN is where almost all useful information about a particular release is generated, however for the sake of simplicity I will only be discussing the "Additional PostProcessing" stage, which by default will say "PostPrc : Performing additional post processing on last x releases" Lookups and nfo processing are handled in a different part of the post processing.

This stage by default is limited to 100 releases processed per run of update_releases, and handles deep rar inspection (including password checking and rar contents) as well as the generation of mediainfo and ffmpeg previews.

This limit can be increased, but it will increase the time that update_releases takes to run, and may delay grabbing of new releases more than you like.


So why am I talking about this?

It's important to know this in order to understand why it's a bad idea to import 200,000 NZBs at once, or backfill 100 days all at once.  Apart from the brutal database bloat, and potential for MyISAM crashes, you will be left with a huge backlog of postprocessing.

How does backfill work?

Backfill can be done in 2 different ways, either by backfilling a certain number of target days, or by backfilling to a specific date.  In your database, each group has a first post and a last post value which is unique to your particular Usenet Service Provider (USP). 

The backfill process will first determine the target post number for your specified target day number (x days back) or for the target date specified if using backfill_date.  Then it will download all the articles between that target post number, and your "first post" for each group.

For example, if your first post on a.b.somegroup is "123456" and you have a backfill target of 10 days, it will first look to see what post number corresponds to 10 days ago, let's say it's 100000 in this case, then it will proceed to download headers for posts 100000 to 123456.

How should I backfill?

There are a couple of choices here for decent methods of backfilling, but first let's discuss what you probably shouldn't do.

Don't set the backfill target on a bunch of groups to your final goal all at once, especially if it's a couple hundred days.  You will wind up with an absolutely giant, bloated database, and if you're running MyISAM, it almost certainly will crash.  This goes the same for using backfill_date.  If you specify a date that's 100 days ago, you're gonna have a bad time.

In order to do this properly, there are a couple good options.  First, if your server has the horsepower and you want to utilize it, check out jonnyboy's tmux suite, which will properly handle backfilling (among other things) without completely bloating your database, and it parallelizes processing of releases to speed things up.  However, if you are like myself and are running on some older spare hardware, you may want to utilize my safebackfill.sh script. For both scripts, make sure to fully read the readme on their respective main github pages.

Both of these scripts will backfill incrementally until it hits your final target, with the tmux suite having 2 different methods for backfilling, and saferbackfill relying on the existing backfill.php.


How should I import?

As already mentioned, I don't really recommend importing a gigantic batch all at once.

Again, jonnyboy's suite as well as my own import script will allow you to import a desired number of NZBs per loop, and will properly process them before importing any more.

The main idea here, is that you keep your number of NZBs imported equal to your $numtoprocess amount in www/lib/postprocess.php so you don't fall behind on post processing.


Overall, as long as you're patient, and aren't expecting to have millions of releases overnight, you should have no problems populating your database.

-Thracky

3 comments:

  1. These are great posts. I hope you keep them coming. I've got it bookmarked.

    ReplyDelete
  2. Hi Thracky great script! I have one issue which I just cannot seem to fix.

    The script runs perfectly in Screen up to the point when it needs to update the mysql database to increase the backfill days by 1 I just get an error

    safebackfill.sh: line 82: mysql: not found

    I have triple checked the login details and the database is on the same machine so using localhost works for everything else the communicates with the database.

    this is on a Synology 412+ would you have any ideas as to the problem

    Thanks

    ReplyDelete
    Replies
    1. Hi Michael,

      I'm not sure how the path setup is on the Synology boxes but it would indicate to me that mysql is not in your path.

      First from a terminal try "echo $PATH" and if the mysql binary is not directly in one of those directories, you need to either add the appropriate directory to your path, or give the full path when the script calls the mysql command line.

      So on line 50 and 81 of the safebackfill script, you would change "mysql" to /full/path/to/mysql.

      Hopefully that helps!

      Delete