Topic: Nokogiri/Sanitize scraping advice


I'm trying to extract only "human language" text from pages. I thought using inner_text in Nokogiri might work, or Sanitize.clean(), but I still get lots of var, document, timer, window, function (i.e. "code" words, not human language words) showing up in my results.

Am I going about this the wrong way, or just using the wrong Nokogiri/Sanitize functions?

I have looked at tf-idf but I thought there might be an easier way to remove these words that appear to be coming from inline code or something.

Last edited by asfarley (2011-09-09 12:44:58)