Cron job on Mac OSX that saves files to Dropbox

My work machine is on all the time, and I figured it might work well for simple tasks like text file processing:
– Take input file
– Do something with it (add information, transform text, add columns, etc)
– Save file (either overwrite old one, or create a new output file)

I used Dropboxâ??s shared folder functionality to let multiple people submit their files for processing. Your Mac will see those shared folders as local, and if you set up a cron job, you are pretty much done.

Here are some steps that can help:

  • Write your script
  • Create a folder on Dropbox. Share it with people who will need access
  • Edit your crontab. Hereâ??s a great detailed blog post on how to do this
  • Confirm itâ??s working
  • Sit back and relax, as your work is done :)

Note: remember to use absolute paths for your scripts and if youâ??re passing file/folder paths as parameters, make sure those paths are also absolute.

Hereâ??s my example:

0,30 8,19 * * 1-5 python /Users/[name]/Dropbox/ 

This runs the script every half hour, between the hours of 8am and 7pm, Monday-Friday and processes all .csv files in the designated Dropbox folder.

And here’s a simplified Python script, maybe some will find it helpful.

import csv
import sys
import os.path

for filename in sys.argv[1:]:
  # don't need to process my example file 
  # or anything with the _out in the file name, 
  # since my script creates them     
  if (filename.find('example.csv') == -1) and (filename.find('_out')== -1):  
  	# also check if script already ran 
        # and you have _out.csv file saved in your folder
  	if os.path.isfile(filename.replace('.csv', '_out.csv')) == False:
          with open(filename) as f:
	    input = csv.reader(open(filename, 'rU'), delimiter=',', quotechar='"')
	    output = csv.writer(open(filename.replace('.csv','_out.csv'), 'ab'), 
                     delimiter=',',quotechar='"', quoting=csv.QUOTE_MINIMAL)
	       rowcount = 0
		for row in input:
		  if rowcount ==  0:  # skips the first header row
		       rowcount += 1 
			# do your file processing here
			rowcount += 1

Sometimes the solution is easier than you think

I’m lazy. But in a good way. The way that makes you simplify, automate, and get rid of unncessary work.

Here’s a quick example. Our insights team had a task at hand that could not be done manually. Writing a script to do it took half an hour. They were happy. Once in a while they would send me a file to be processed, I run a simple command, and the script spits out the file with results, which I send back.

But that’s too much work. How do I remove myself out of the picture and let them handle it all? I figured let’s use Dropbox. Users drop their files into a shared folder, the script checks every so often if anything new was placed there, then processes the files and saves results.

Somehow in the beginning I got too entangled in details, thinking about a server where I’d put the script (probably Heroku), then having to add Dropbox API integration, then making sure all the dependencies are installed… Then it occured to me: my work machine already sees the Dropbox directories as local folders. Why not just run the cron job on my work machine and be done with it?

So with a little script tweaking, instruction writing and cron job testing, this is done, and I’ve just removed a task from my list (however simple it might be).

MongoDB explain command to show all execution plans

So the MongoDB courses are almost finished, and now both MongoDB for Developers and MongoDB for DBAs tracks are in the finals phase.

One of the questions is about indexes, where you’re given the list of indexes that exists for a collection, and an example of a query. The question is to identify which indexes could potentially be used for optimization for that query.

I had my ideas for the answer, but to be sure I wanted to create a sample collection, add indexes and view all execution plans. It is in fact doable, if you pass a parameter to the “explain()” command:

//these also worked in version 2.2:

The output will include a field called “allPlans”, which is an array of all possible combinations of indexes that could be used. Neat.

Customized Wufoo forms inside a WordPress site

I’m a huge fan of Wufoo. They made the annoying form-building/validation/emailing of confirmation/reports super easy. Anyone can build a form in a matter of minutes and run with it! Their system also supports payment integration with Paypal, and even Stripe. So it’s a no-brainer: use Wufoo and save yourself a ton of time.

I had to build a form for friend’s website (running hosted WordPress), where users can order items without the actual shopping cart functionality. So I went ahead, setup the Wufoo forms, customized the fields and design, and embedded the form via Javascript. All seemed good.

However, the order process turned out to be actually very custom (and some of the requirements changed), and as much as I tried to bend this form within Wufoo to do what I needed, it just wasn’t enough: there were a lot of custom calculations with Javascript and that required re-building the form using the API. So I wanted to share some tips on customizing a Wufoo form within WordPress.

First, set up a new page template in WordPress as you normally do.

Then, create your form in Wufoo, so all the form and field information is captured and saved within the Wufoo system. You will be using the API to pull field names and such.

Next, in the Wufoo dashboard, click on “Code” link which will take you to the code manager screen. In the top right corner you’ll see the “API information” button, clicking on which will give you the API key. Copy it.

In your WordPress template, include links to the Wufoo CSS and Javascript files, which should look play australian pokies online like Online Pokies this:

<link href=“https://[your_account_name]
forms/css/index.XXX.css” rel="stylesheet">
<script src=“https://[your_account_name]

Wufoo API will give you form/field data as either XML or JSON, I prefer JSON and conveniently enough, PHP has encode_json() and decode_json() functions, which basically will do parsing for you.

Here’s a snippet of code that will get you form data, such as form name, ID, description:

// Get form data from the Wufoo API
$url = ‘https://[your_account_name]’;

$ch = curl_init($url);

curl_setopt($ch, CURLOPT_USERPWD, ‘YOUR_API_KEY:footastic’);  
//keep ‘footastic’ - it’s actually what the API expects, hehe

$json = curl_exec($ch);

// Setting 2nd parameter to ‘true’ will create an assoc. array.  
// If you omit it - result will be returned as an object
$result = json_decode($json,true); 

// You can use this to check what results you’re getting

//now you’re ready to move through your array and get values
$form_name = $result["Forms"][0]["Name"];
$form_desc = $result["Forms"][0]["Description"];

To get fields, you will perform a similar operation:

//Get field data from the Wufoo API
$url = ‘https://[your_account_name][form_id]/fields.json’;
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_USERPWD, ‘YOUR_API_KEY:footastic’);
$json = curl_exec($ch);
$result = json_decode($json,true);
$fields = $result['Fields'];

Now that you have fields, you can create your page output as you desire, add custom functionality, and don’t forget to test.

This seems like still coding a good chunk of functionality, right? Well, yes, but you don’t actually need to do validation, reporting, payment processing and such, so it’s still less that you normally would do, if you coded everything from scratch. Also, admins will be making updates right in the Wufoo dashboard, so they won’t be touching the code or asking you to change copy for fields. As long as they save their changes, they will be dynamically pulled into the form on WordPress.

WordPress: Exclude directory from URL rewrite with .htaccess

As you know, WordPress uses .htaccess to rewrite URLs so they become nice and SEO-friendly.

Sometimes though, you might want to use a directory within your web root that can be accessed directly.

To do this, open your .htaccess file and add a line right under the first RewriteRule instruction:

 RewriteCond %{REQUEST_URI} !^/(mydir|mydir/.*)$

Replace “mydir” with your actual directory name.

So your full .htaccess instruction set will look like this:

# BEGIN WordPress
<IfModule mod_rewrite.c>
RewriteEngine On
RewriteBase /
RewriteRule ^index.php$ - [L]
RewriteCond %{REQUEST_URI} !^/(mydir|mydir/.*)$
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule . /index.php [L]
# END WordPress

Now you should be able to access that directory directly, like this:

10gen MongoDB online courses for DBAs: Replica sets

I’ve already written about the excellent MongoDB online courses that launched a few weeks ago. I signed up for both developer and DBA tracks, and even though I knew that it would be super useful and great, let me tell you now that Iâ??m in week 4 – the value is immense, with all the detailed video lectures, quizzes and homework, especially considering that the courses are absolutely free!

Week 4 in the DBA track is about replica sets, and even though I did some projects with Mongo before, they weren’t of such scale that required replica sets. That’s why this week content was especially interesting to me. There’s one quick tip that I’ve learned and wanted to share.

When initiating and configuring a replica set, make sure you’re connected to the primary set member only. I thought that I had to connect to each instance and initiate the set on each, but that’s not the case, and I ended up with two primaries instead of one.

So the steps would be (example assumes all 3 instances on localhost):

  1. Create data directories for each instance:
    mkdir -m /db/data/1
    mkdir -m /db/data/2
    mkdir -m /db/data/3
  2. Start 3 instances of mongod – they will be members of your set:
    mongod --port 27001 --dbpath /db/data/1 --replSet rs0
    mongod --port 27002 --dbpath /db/data/2 --replSet rs0
    mongod --port 27003 --dbpath /db/data/3 --replSet rs0
  3. Connect to the first instance only:
    mongo localhost:27001
  4. Initiate your replica set:
  5. Optional: check your configuration with rs.conf()
  6. Add two more members to your replica set:
  7. Done! You can check your set by running rs.status()

Man, it’s so cool that such a crucial concept like database replication is implemented so gracefully in Mongo, and you can learn and do it!

Meetup recap: scaling AppNexus

Just got back home from a meetup hosted at AppNexus, by AppNexus.

The topic of today’s talk was scaling the complex ad serving and bidding system that supports billions of ad impressions and stores terabytes of data. I have to admit, it was one of the most technical meetup presentations that I’ve attended, and it was excellent!

Mike, the CTO of AppNexus, is a fast-talker, who started with the story of how the company was born, and covered a lot of info from building the company up and adding functionality (ad bidding system was added when ad-selling business demanded it), to hard drive performance, to build vs buy vs outsource topic.

I especially liked the overview of data warehouse and crunching tools: Netezza, Vertica and Hadoop, and how changing requirements dictated choice of tools. Netezza worked great as a single instance, but did not scale with clustering, Hadoop was very hard to learn and configure from scratch, but meets most of the needs now and will suffice for the next 2 years.

There was a good question from the audience about ideas for startups: what are the technological painpoints and gaps that need to be filled. Interestingly enough, Mike named monitoring as one of the areas that’s lacking a great tool. Someone next to me mentioned New Relic, but I was wondering if that’s enough to monitor thousands of servers.

So for me, this talk was full of information on areas that I had little knowledge in, and it’s always great to see who develops breakthrough solutions in technology and how they solve problems (big data problems are very very interesting).

Another bit that got me wondering, since I’m into MongoDB lately, was when to use Mongo vs Hadoop. And sure enough, I found a really good deck from 10gen with not only the answer, but also great practical demos. Yay for the internet and sharp minds in NY tech community! I feel proud to live in this huge tech hub, and humbled because there are just so many things yet to learn.

Fixing encoding issues on WordPress (comments display as question marks)

One of my dear friends in Russia subscribes and reads my blog :) how cool is that?

Well, we also discovered that her comments that she posts in Russian (using Cyrillic characters) don’t get saved properly into the database, and are displayed as a bunch of question marks ‘??????’

As I suspected, the character encoding was not set right. Older version of WordPress had latin1-swedish collation as default (weirdly enough). Latest version had this corrected, and the default is utf8, and you can also specify it in settings in wp-config.php file (as described in this Codex article):

define('DB_COLLATE', 'utf8_general_ci');

To fix this issue on existing installations, open your MySQL console (command line, MySQL Admin or PhpMyAdmin), backup your database (just in case), and follow these steps:

  1. Starting from the top, the database level, let’s make sure that the database encoding is correct:

    ALTER DATABASE my_wp_database CHARACTER SET utf8;
  2. While this applies to all the new tables within this database, it does not change encoding of the existing tables. So that’s why, we’ll need to go a bit deeper to the table level.

  3. Table level: make sure your table collation is set correctly:

    ALTER TABLE wp_comments CHARACTER SET utf8;

  4. Again – this change will be effective for all new columns within this table. For existing columns, we’ll need to go down one more level.

  5. Column level: set collaction on individual columns within your table.

    alter table wp_comments change comment_content comment_content LONGTEXT CHARACTER SET utf8;

    alter table wp_comments change comment_author comment_author LONGTEXT CHARACTER SET utf8;

  6. This will not change your existing data, but going forward, your comments will be saved and displayed with correct encoding.

Source: Codex article on converting database character sets

Free MongoDB online courses – developer and DBA tracks

I tweeted about the new MongoDB online education portal when it was announced, and signed up for both DBA and developer tracks (because I’m curious, mostly, even if I don’t become a Mongo DBA it’s still free knowlegde!).

The courses are available online and you can still sign up:

As a reminder, your friends can still join the course. We will be dropping the lowest week’s homework grade in the calculation of your score so students who join can still receive a certificate of completion.

I hope that my NYC peeps are all safe and sound after the Sandy hit, and those still with power and internet can take this chance and learn something new.

Startup infrastructure talk by Paul Hammond

Earlier this week Etsy hosted an excellent talk by Paul Hammond on scaling Typekit.

If you don’t know – Typekit is a service that lets you use beautiful fonts on our website. They provide fonts and hosting, and were acquired by Adobe in late 2011.

Here’s a summary of the main points from the talk:

Running a startup looks like this – Wallace trying to lay tracks in front of the moving train. Things change quickly, and you a task at hand – make sure your site is up and running.

*I love Wallace and Gromit so I had to borrow and include this pic as well :) Hope Paul doesn’t mind

The life of startup has 3 types of endings:
– Become profitable
– Get acquired (for technology, talent or product)
– Fail

Scaling HGH startup HGH can be summed in 3 simple steps:
– Find your biggest problem
– Fix it
– Repeat

There’s one main rule when you’re running a startup: don’t run out of money.
And one thing that requires the most money is human talent and time – don’t waste it. Outsource/buy everything that saves you time.

In the beginning, it doesn’t really matter what you use, as long as you get the job done. Paul calls this the “minimum viable infrastructure”.

Minimum viable infrastructure looks like this:
– Source countrol = Git
– Configuration management = Rsync and bash
– Servers = EC2
– Backups = via command line with S3cmd
– Monitoring = Pingdom

If you get a chance – definitely go see Paul’s presentation in person, he’s great. Full deck of his talk is available on his site.