This little code snippet / function will effectively extract URL strings from a string in Java. I found the basic regex for doing it here, and used it in a java function.
I expanded on the basic regex a bit with the part “|www[.]” in order to catch links not starting with “http://”
Enough talk (it is cheap), here’s the code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | //Pull all links from the body for easy retrieval private ArrayList pullLinks(String text) { ArrayList links = new ArrayList(); String regex = "\\(?\\b(http://|www[.])[-A-Za-z0-9+&@#/%?=~_()|!:,.;]*[-A-Za-z0-9+&@#/%=~_()|]"; Pattern p = Pattern.compile(regex); Matcher m = p.matcher(text); while(m.find()) { String urlStr = m.group(); if (urlStr.startsWith("(") && urlStr.endsWith(")")) { urlStr = urlStr.substring(1, urlStr.length() - 1); } links.add(urlStr); } return links; } |
scb
November 5, 2010 at 09:23
Nice article.
As I’m new to regex, please help me to find out odfuscated urls, as following. Thanks in advance
www(dot)example(dot)com
Houen
November 8, 2010 at 17:47
You’ll want something like this:
www[(][.][)][a-zA-Z][a-zA-Z0-9]+[(][.][)][a-zA-Z]+
scb
November 9, 2010 at 01:51
Thanks Houen, for your prompt reply.
Probably my earlier post not so clear. We are filtering URLs like “www.xyz.com” , but some intelligent users
using obfuscated URLs like following
www(dot)xyz(dot)com
www (dot) xyz (dot) com
www[dot]xyz[dot]com
www [dot] xyz [dot] com
www{dot}xyz{dot}com
www {dot} xyz {dot} com
So my question is how to find out above patterns from a string. Thanks in adavance
scb
November 9, 2010 at 02:47
Hi Houen
I resolved this issue as following, Please let me know your comments. Thanks
import java.util.regex.*;
public class Replacement {
public static void main(String[] args) throws Exception {
// Create a pattern to match cat
Pattern p = Pattern.compile(“\\((dot\\))|\\[dot\\]|\\{dot\\}”);
// Create a matcher with an input string
Matcher m = p.matcher(“www(dot)example(dot)com www[dot]example[dot]com www{dot}example{dot}com”);
// Loop through and create a new String with the replacements
boolean result = m.find();
StringBuffer sb = new StringBuffer();
while(result) {
m.appendReplacement(sb, “.”);
result = m.find();
}
// Add the last segment of input to the new String
m.appendTail(sb);
System.out.println(sb.toString());
}
}
John Ortiz
April 17, 2012 at 21:31
Thanks for this. It is working correctly. See you later.
lmn4971
August 28, 2012 at 13:04
Another way of doing this is to split the string with parameters that define where the URL is:
String extractedurl = in.readLine().split(“=’|'”)[1];
where the URL is =”URL”
voji
April 21, 2013 at 22:40
Great article. I tweaked the regexp first part to:
(https?://|www[.]|ftp://)
now can accept https and ftp too.