RSS
 

Java code to get URL from a string

15 Aug

This little code snippet / function will effectively extract URL strings from a string in Java. I found the basic regex for doing it here, and used it in a java function.

I expanded on the basic regex a bit with the part “|www[.]” in order to catch links not starting with “http://”

Enough talk (it is cheap), here’s the code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
//Pull all links from the body for easy retrieval
private ArrayList pullLinks(String text) {
ArrayList links = new ArrayList();
 
String regex = "\\(?\\b(http://|www[.])[-A-Za-z0-9+&@#/%?=~_()|!:,.;]*[-A-Za-z0-9+&@#/%=~_()|]";
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(text);
while(m.find()) {
String urlStr = m.group();
if (urlStr.startsWith("(") && urlStr.endsWith(")"))
{
urlStr = urlStr.substring(1, urlStr.length() - 1);
}
links.add(urlStr);
}
return links;
}
 
4 Comments

Posted in Code

 

Leave a Reply

 

 
  1. scb

    November 5, 2010 at 09:23

    Nice article.

    As I’m new to regex, please help me to find out odfuscated urls, as following. Thanks in advance

    www(dot)example(dot)com

     
    • Houen

      November 8, 2010 at 17:47

      You’ll want something like this:
      www[(][.][)][a-zA-Z][a-zA-Z0-9]+[(][.][)][a-zA-Z]+

       
      • scb

        November 9, 2010 at 01:51

        Thanks Houen, for your prompt reply.

        Probably my earlier post not so clear. We are filtering URLs like “www.xyz.com” , but some intelligent users :) using obfuscated URLs like following

        www(dot)xyz(dot)com
        www (dot) xyz (dot) com
        www[dot]xyz[dot]com
        www [dot] xyz [dot] com
        www{dot}xyz{dot}com
        www {dot} xyz {dot} com

        So my question is how to find out above patterns from a string. Thanks in adavance

         
      • scb

        November 9, 2010 at 02:47

        Hi Houen

        I resolved this issue as following, Please let me know your comments. Thanks

        import java.util.regex.*;

        public class Replacement {
        public static void main(String[] args) throws Exception {

        // Create a pattern to match cat
        Pattern p = Pattern.compile(“\\((dot\\))|\\[dot\\]|\\{dot\\}”);

        // Create a matcher with an input string
        Matcher m = p.matcher(“www(dot)example(dot)com www[dot]example[dot]com www{dot}example{dot}com”);

        // Loop through and create a new String with the replacements
        boolean result = m.find();
        StringBuffer sb = new StringBuffer();
        while(result) {
        m.appendReplacement(sb, “.”);
        result = m.find();
        }

        // Add the last segment of input to the new String
        m.appendTail(sb);
        System.out.println(sb.toString());
        }
        }