Tuesday, October 02, 2007

Use GREP to find URL's

In a previous Blog entry I wrote about the GREP search capability in InDesign CS3. GREP searches are great for making a computer recognize complex patterns. I recently developed the following GREP string for a client, and thought I'd share it here, hoping you might find it useful or educational.

The procedure below will allow you to quickly search for all URL's (Web addresses) in your text, and format them as non-breaking, or blue, or whatever formatting you wish.

1. Start InDesign, and choose Edit > Find/Change

2. Type the string below into the Find what: field (all on one line)

(?i)(http|ftp|www)(\S+)|(\S+) (\.gov|\.us|\.net|\.com|\.edu|\.org|\.biz)

3. Copy the string below into the Change to: field

$0

4. Click on the More Options button, and then click on the magnifying glass icon next to the Change Format area. Enter whatever formatting you would like here, then click the OK button.

5. Click on the small disc icon at the top of the Find/Change dialog box, and give the Query a name. Now, any time that you want to find and change URL's, you can just choose this name from the Query drop-down list, indicate the scope of the search, and click on the Find button. Another GREP example is here.

23 comments:

Eric said...

This is great. Just what I needed to add links to a PDF I'm creating.

Do you have a similar GREP for email addresses?

Thanks.

Keith Gilbert said...

Eric, here is a "Find What" string for email addresses that I picked up somewhere. Hopefully this will work for you.

[\w-]+(?:\.[\w-]+)*@(?:[\w-]+\.)+[a-zA-Z]{2,7}

Unknown said...

I found the following GREP string to catch more addresses variations:

[\S]+[.][\S]+

It will even pickup formats that do not have a standard prefixes, for example:

w3.adobe.com
adobe.com

However, the problem is that it will also pickup any string that includes periods with the obvious culprits being email addresses, for example:

john.doe@adobe.com
jdoe@adobe.com

I still haven't been able to figure out the negative/positive lookahead/lookbehind metacharacters to omit instances that include the "@" symbol.

As a workaround, I usually set up a character style (sometimes only defining it and not applying any ovverides to it) and utilize it to tag the email addresses using the following GREP string:

[\S]+[@][\S]+

I perform the "Email Address" find/change to apply the charcter style first. I follow this up with the "Web Address" find/change which only picks up strings that have NOT had any character styles applied, hence leaving the formatted addresses alone.

Any ideas on how to make this easier?

Eric said...

Thanks. The email GREP works great.

I'm finding that the URL GREP sometimes grabs too much, like a period at the end of a URL at the end of a sentence.

Keith Gilbert said...

This comment thread illustrates how tricky it can be to set up "bulletproof" GREP searches. Oftentimes, a GREP string can be hammered out quickly that will solve 95% of one's needs. But to get it bulletproof so that each and every exception is considered can take a LOT more development time. This also illustrates that there is more than one way to skin a cat when it comes to GREP.

Sometime when I have a chance, I'll revisit the URL GREP search and see if I can't fine-tune it a bit.

Anonymous said...

Try this modification:

(?i)(http|ftp|www)(\S+)(\.\l{2,4})|(\S+)(\.\l{2,4})

Anonymous said...

yay for Anon, that's the one!

cheers fella

Colleen said...

WOW -- both the Emails and URL GREP were perfect!! THANK YOU THANK YOU!!

claidheamdanns said...

Anon, your expression misses some URLs that were caught by Keith's, for instance: http://www.website.com/links

claidheamdanns said...

Interestingly, Keith's email address search does end up (correctly) missing the period at the end of a sentence, even if the last thing in the sentence is an email address.

claidheamdanns said...

BTW, thanks Keith. I've been puzzling over this URL one for about a week now, trying to find the right combo of code to get it all.

Another awesome way to use this, rather than as a find and replace, use it as a GREP style in the paragraph style(s) used in the document. For instance, a Character style to make the font blue, and underlined, can be automatically applied via GREP within the Paragraph style, so that emails and URLs are automatically formatted as they are typed or added to the document.

Keith Gilbert said...

@Claidheamdanns: I'm glad you found the GREP examples useful. Yes indeed, GREP styles are wonderful! The original blog post was written "way back in the old days" before GREP styles existed (they made their debut in CS4.

claidheamdanns said...

Yes, that's true. And that alone is a great reason to upgrade. I'm spoiled at work. At home I am still on CS3, and I miss all these good features. Work that used to literally take 2 hours before can now be done in 5 minutes or less (no exaggeration).

Have you figured out a way to get the URLs to exclude the period when they fall at the end of a sentence? That would be very helpful for a piece we are working on.

Thanks for your tips on here. Very helpful!

Keith Gilbert said...

@Claidheamdanns: I've put the period at the end of the sentence issue on my list of things to look into when I have a minute. If I find a solution I'll create a new blog post about it. Thanks.

Halvor said...

Hi, thank you so much for this article, and all the useful comments!
I'm a complete novice, I found out about GREP yesterday, so bear with me if my modifications are crude, but these seem to work:

url-GREP (finish with a letter or digit)
(?i)(http|ftp|www)\S+[\l\u|\d]

e-mail-GREP (finish with a letter)
[\S]+[@][\S]+[\l]

sampletheworld said...

I've been working at it for a while and don't understand all of it, but this one works great and leaves out the periods at the end. The only hitch I've found is that it also picks up email addresses since it's essentially looking for ".com" or ".org" etc and catching anything attached.

\S*?(\.org|\.com|\.net|\.edu|\.us|\.gov|\.biz)+\S*\>/?

Anonymous said...

Don't no much about grapes but how come this grep also finds URLs like www.testsite.nl? So ending in '.nl'?

(?i)(http|ftp|www)(\S+)|(\S+) (\.gov|\.us|\.net|\.com|\.edu|\.org|\.biz)

I looks like it only should find URLS ending with .gov, .us .net etc. But I'm probably reading it wrong? ;-)

Keith Gilbert said...

@Anonymous: The reason that this GREP string finds "www.testsite.nl" even though .nl isn't referenced in the GREP is this:

The GREP is saying "look for 1 or more consecutive non-whitespace characters preceeded by http, ftp, or www

OR

1 or more consecutive non-whitespace characters followed by .gov, .us, .net, .com, .edu, .org, or .biz"

It's written this way to try to accommodate a wide variety of urls.

Anonymous said...

What about this one ?:

(http(s)?.{3})?(www\.\S*\w*\d*\.\S*)

claidheamdanns said...

Close. That caught http://www.sterlingstudiosinc.com and www.sterlingstudiosinc.com, but it also caught the periods after them, if they fell at the end of a sentence. And it didn't cat we addresses that did not include a www.

Makk1000 said...

This should work on all but the craziest web addresses

[-\u\l\d._/:]+[.][-\u\l\d_]+[-\u\l\d._%]+\.[\u\l\d/%:]{2,100}

If it doesn't work on any address let me know and I'll see if I can fix it :)

Anonymous said...

(?i)(http|ftp|www)(\S+)|(\S+) (\.gov|\.us|\.net|\.com|\.edu|\.org|\.biz)

This very first string saved my day. I could find and remove 500 nested URL's in an instant.

THANK YOU!

jaceks said...

(http|ftp|www)+(:|.|)\S+