Topic: Scraping blog posts

Hi!

On a site that I am building I have an option for users to write blog posts. Now since some of them already have blogs on other sites (blogspot, wordpress, ...) they would obviously not want to write everything twice if they would want it to be displayed on my site. What I would like to do is to give the user an option to either write the post or provide a link to an existing one, which would then be scraped, populating correct fields (title, content and publish date). I thought I would be able to achieve this fairly simply with nokogiri, but the problem is that as soon as the blogger has done some customization, the css is completely different (for example I checked two different wordpress blogs and the title of the first one was a link with a class of 'entry-title' and the second blog post's title was just an h2 tag). So I cannot just take the contents of predefined css elements.
Is there something that can be done here?

Thank you in advance

Re: Scraping blog posts

http://web-harvest.sourceforge.net/

Joe got a job, on the day shift, at the Utility Muffin Research Kitchen, arrogantly twisting the sterile canvas snout of a fully charged icing anointment utensil.

Re: Scraping blog posts

Well, this is some kind of tool, but it doesn't really answer what I was looking for.
Anyway, I have now decided to do it differently. When someone provides a link I will extract an RSS feed link and then search for appropriate post title with Feedzirra. This raises the following questions:

-can I somehow fetch just the entry with a certain title or do I have to fetch all entries every time and then loop through to check for the title?

-I would like a user to only enter post's url, so what I was thinking, was to extract the title from url (since these blogs usually have a format like blogger.domain.com/year/month/day/formatted-blog-title.html). Now if I parse the url, take the path, with some regex I can extract the title and I could replace dashes with spaces, but that isn't the only thing replaced in these things. For example, if a title of a blog entry is 'About the weather...', the link will be 'about-weather.html'. So, three dots and word 'the' were ommitted. How do these links get generated, so I could reverse it?

-is there a better way of doing this?

Thanks for the answers