2013-07-02 10 views
12

Próbuję HTMLUnit, aby zautomatyzować pobieranie danych z aplikacji internetowej. Jednak otrzymuję cały bałagan ostrzeżeń na getPage() (większość z nich wydaje się zajmować połączonymi skryptami, których nie potrzebuję nawet potrzebować), a następnie fatalny com.gargoylesoftware.htmlunit.ScriptException: Wyjątek wywołujący setOuterHTML, gdy Próbuję uruchomić getByXPath, aby pobrać dane, których szukam. I z powodu błędów, które dostaję, nie mogę do końca zrozumieć, co się dzieje. Masz jakieś pomysły?HTMLUnit: Mnóstwo przestarzałej treści i nie można tworzyć ostrzeżeń obiektów na getPage(), a następnie kończy się niepowodzeniem z wywołaniem wyjątku setOuterHTML na getByXPath()

Oto mój kod:

import java.util.List; 

import com.gargoylesoftware.htmlunit.WebClient; 
import com.gargoylesoftware.htmlunit.BrowserVersion; 
import com.gargoylesoftware.htmlunit.html.HtmlAnchor; 
import com.gargoylesoftware.htmlunit.html.HtmlPage; 

public class ScrapperApp { 

    private static void go() throws Exception { 
     HtmlPage nextPage; 
     String url = "http://media.ethics.ga.gov/search/Campaign/Campaign_Name.aspx?NameID=5751&FilerID=C2009000085&Type=candidate"; 

     final WebClient webclient = new WebClient(); 
     final HtmlPage page = webclient.getPage(url); 

     System.out.println("PULLING LINKS:"); 

     List<HtmlAnchor> articles = (List<HtmlAnchor>) page.getByXPath("//div[@class='hform1']/a[@class='lblentrylink']"); 

     /*for(int x=0; x<articles.size(); x++) { 
      nextPage = articles.get(x).click(); 
      System.out.println(nextPage.getBody()); 
     }*/ 
    } 

    public static void main(String[] args) throws Exception { 
     go(); 
     System.out.println("COMPLETE"); 
    } 

} 

i tu jest moje wyjście konsoli:

Jul 2, 2013 6:19:51 PM com.gargoylesoftware.htmlunit.IncorrectnessListenerImpl notify 
WARNING: Obsolete content type encountered: 'text/javascript'. 
Jul 2, 2013 6:19:51 PM com.gargoylesoftware.htmlunit.javascript.host.ActiveXObject jsConstructor 
WARNING: Automation server can't create object for 'ShockwaveFlash.ShockwaveFlash.7'. 
Jul 2, 2013 6:19:51 PM com.gargoylesoftware.htmlunit.javascript.StrictErrorReporter runtimeError 
SEVERE: runtimeError: message=[Automation server can't create object for 'ShockwaveFlash.ShockwaveFlash.7'.] sourceName=[http://www.google-analytics.com/urchin.js] line=[443] lineSource=[null] lineOffset=[0] 
Jul 2, 2013 6:19:51 PM com.gargoylesoftware.htmlunit.javascript.host.ActiveXObject jsConstructor 
WARNING: Automation server can't create object for 'ShockwaveFlash.ShockwaveFlash.6'. 
Jul 2, 2013 6:19:51 PM com.gargoylesoftware.htmlunit.javascript.StrictErrorReporter runtimeError 
SEVERE: runtimeError: message=[Automation server can't create object for 'ShockwaveFlash.ShockwaveFlash.6'.] sourceName=[http://www.google-analytics.com/urchin.js] line=[448] lineSource=[null] lineOffset=[0] 
Jul 2, 2013 6:19:51 PM com.gargoylesoftware.htmlunit.javascript.host.ActiveXObject jsConstructor 
WARNING: Automation server can't create object for 'ShockwaveFlash.ShockwaveFlash'. 
Jul 2, 2013 6:19:51 PM com.gargoylesoftware.htmlunit.javascript.StrictErrorReporter runtimeError 
SEVERE: runtimeError: message=[Automation server can't create object for 'ShockwaveFlash.ShockwaveFlash'.] sourceName=[http://www.google-analytics.com/urchin.js] line=[456] lineSource=[null] lineOffset=[0] 
Jul 2, 2013 6:19:51 PM com.gargoylesoftware.htmlunit.IncorrectnessListenerImpl notify 
WARNING: Obsolete content type encountered: 'application/x-javascript'. 
Jul 2, 2013 6:19:52 PM com.gargoylesoftware.htmlunit.IncorrectnessListenerImpl notify 
WARNING: Obsolete content type encountered: 'application/x-javascript'. 
Jul 2, 2013 6:19:53 PM com.gargoylesoftware.htmlunit.javascript.host.html.HTMLDocument execCommand 
WARNING: Nothing done for execCommand(BackgroundImageCache, ...) (feature not implemented) 
Jul 2, 2013 6:19:54 PM com.gargoylesoftware.htmlunit.DefaultCssErrorHandler error 
WARNING: CSS error: 'http://media.ethics.ga.gov/search/theethics.css' [1621:72] Error in style rule. (Invalid token ":". Was expecting one of: <EOF>, <S>, <NUMBER>, "inherit", <IDENT>, <STRING>, <PLUS>, <COMMA>, <HASH>, <IMPORTANT_SYM>, <EMS>, <EXS>, <LENGTH_PX>, <LENGTH_CM>, <LENGTH_MM>, <LENGTH_IN>, <LENGTH_PT>, <LENGTH_PC>, <ANGLE_DEG>, <ANGLE_RAD>, <ANGLE_GRAD>, <TIME_MS>, <TIME_S>, <FREQ_HZ>, <FREQ_KHZ>, <DIMENSION>, <PERCENTAGE>, <URI>, <FUNCTION>, "}", ";", "/", "-".) 
Jul 2, 2013 6:19:54 PM com.gargoylesoftware.htmlunit.DefaultCssErrorHandler warning 
WARNING: CSS warning: 'http://media.ethics.ga.gov/search/theethics.css' [1621:72] Ignoring the following declarations in this rule. 
Jul 2, 2013 6:19:54 PM com.gargoylesoftware.htmlunit.DefaultCssErrorHandler error 
WARNING: CSS error: 'http://media.ethics.ga.gov/search/theethics.css' [1722:1] Error in style sheet. (Invalid token ".123". Was expecting one of: <EOF>, <S>, <IDENT>, "<!--", "-->", <HASH>, <IMPORT_SYM>, <PAGE_SYM>, <MEDIA_SYM>, ".", ":", "*", "[", <ATKEYWORD>.) 
Jul 2, 2013 6:19:54 PM com.gargoylesoftware.htmlunit.DefaultCssErrorHandler error 
WARNING: CSS error: 'http://media.ethics.ga.gov/Search/WebResource.axd?d=12a7FOCbnwgUAwtiPjKWh6wDEhgkTfdV9_FCfkqzSp1sZ_YdcvnAj941ZFWBBPCjl5RQqmB3TVerNjIRqn-QyCUV4dFAyyOktFPBtLE-ETB9nE-rPiQp_RNPyuD-NYO58_ngCw2&t=634516122000000000' [4:1] Error in style rule. (Invalid token ".". Was expecting one of: <S>, <LBRACE>, <COMMA>.) 
Jul 2, 2013 6:19:54 PM com.gargoylesoftware.htmlunit.DefaultCssErrorHandler warning 
WARNING: CSS warning: 'http://media.ethics.ga.gov/Search/WebResource.axd?d=12a7FOCbnwgUAwtiPjKWh6wDEhgkTfdV9_FCfkqzSp1sZ_YdcvnAj941ZFWBBPCjl5RQqmB3TVerNjIRqn-QyCUV4dFAyyOktFPBtLE-ETB9nE-rPiQp_RNPyuD-NYO58_ngCw2&t=634516122000000000' [4:1] Ignoring the following declarations in this rule. 
Jul 2, 2013 6:19:54 PM com.gargoylesoftware.htmlunit.DefaultCssErrorHandler error 
WARNING: CSS error: 'http://media.ethics.ga.gov/Search/WebResource.axd?d=12a7FOCbnwgUAwtiPjKWh6wDEhgkTfdV9_FCfkqzSp1sZ_YdcvnAj941ZFWBBPCjl5RQqmB3TVerNjIRqn-QyCUV4dFAyyOktFPBtLE-ETB9nE-rPiQp_RNPyuD-NYO58_ngCw2&t=634516122000000000' [538:16] Error in style rule. (Invalid token ":". Was expecting one of: <EOF>, <S>, <NUMBER>, "inherit", <IDENT>, <STRING>, <PLUS>, <COMMA>, <HASH>, <IMPORTANT_SYM>, <EMS>, <EXS>, <LENGTH_PX>, <LENGTH_CM>, <LENGTH_MM>, <LENGTH_IN>, <LENGTH_PT>, <LENGTH_PC>, <ANGLE_DEG>, <ANGLE_RAD>, <ANGLE_GRAD>, <TIME_MS>, <TIME_S>, <FREQ_HZ>, <FREQ_KHZ>, <DIMENSION>, <PERCENTAGE>, <URI>, <FUNCTION>, "}", ";", "/", "-".) 
Jul 2, 2013 6:19:54 PM com.gargoylesoftware.htmlunit.DefaultCssErrorHandler warning 
WARNING: CSS warning: 'http://media.ethics.ga.gov/Search/WebResource.axd?d=12a7FOCbnwgUAwtiPjKWh6wDEhgkTfdV9_FCfkqzSp1sZ_YdcvnAj941ZFWBBPCjl5RQqmB3TVerNjIRqn-QyCUV4dFAyyOktFPBtLE-ETB9nE-rPiQp_RNPyuD-NYO58_ngCw2&t=634516122000000000' [538:16] Ignoring the following declarations in this rule. 
Jul 2, 2013 6:19:54 PM com.gargoylesoftware.htmlunit.DefaultCssErrorHandler error 
WARNING: CSS error: 'http://media.ethics.ga.gov/Search/WebResource.axd?d=P_qivaU1jkjGS6yiS47lVyoi52Pqy5e8DnncH3bigK8349gyQVvRTapoSdHm45oIHlJhLQAhH3tEXp29b5hNLTwX4AdAh7qPU9_lVIhmQjWu1Kvx6RDeUrTdN4UrhhDIdOIrpOYk5RJGCyYDSr8ky9HSOiU1&t=634516122000000000' [6:1] Error in style rule. (Invalid token ".". Was expecting one of: <S>, <LBRACE>, <COMMA>.) 
Jul 2, 2013 6:19:54 PM com.gargoylesoftware.htmlunit.DefaultCssErrorHandler warning 
WARNING: CSS warning: 'http://media.ethics.ga.gov/Search/WebResource.axd?d=P_qivaU1jkjGS6yiS47lVyoi52Pqy5e8DnncH3bigK8349gyQVvRTapoSdHm45oIHlJhLQAhH3tEXp29b5hNLTwX4AdAh7qPU9_lVIhmQjWu1Kvx6RDeUrTdN4UrhhDIdOIrpOYk5RJGCyYDSr8ky9HSOiU1&t=634516122000000000' [6:1] Ignoring the following declarations in this rule. 
Jul 2, 2013 6:19:54 PM com.gargoylesoftware.htmlunit.DefaultCssErrorHandler error 
WARNING: CSS error: 'http://media.ethics.ga.gov/Search/WebResource.axd?d=P_qivaU1jkjGS6yiS47lVyoi52Pqy5e8DnncH3bigK8349gyQVvRTapoSdHm45oIHlJhLQAhH3tEXp29b5hNLTwX4AdAh7qPU9_lVIhmQjWu1Kvx6RDeUrTdN4UrhhDIdOIrpOYk5RJGCyYDSr8ky9HSOiU1&t=634516122000000000' [105:17] Error in style rule. (Invalid token ":". Was expecting one of: <EOF>, <S>, <NUMBER>, "inherit", <IDENT>, <STRING>, <PLUS>, <COMMA>, <HASH>, <IMPORTANT_SYM>, <EMS>, <EXS>, <LENGTH_PX>, <LENGTH_CM>, <LENGTH_MM>, <LENGTH_IN>, <LENGTH_PT>, <LENGTH_PC>, <ANGLE_DEG>, <ANGLE_RAD>, <ANGLE_GRAD>, <TIME_MS>, <TIME_S>, <FREQ_HZ>, <FREQ_KHZ>, <DIMENSION>, <PERCENTAGE>, <URI>, <FUNCTION>, "}", ";", "/", "-".) 
Jul 2, 2013 6:19:54 PM com.gargoylesoftware.htmlunit.DefaultCssErrorHandler warning 
WARNING: CSS warning: 'http://media.ethics.ga.gov/Search/WebResource.axd?d=P_qivaU1jkjGS6yiS47lVyoi52Pqy5e8DnncH3bigK8349gyQVvRTapoSdHm45oIHlJhLQAhH3tEXp29b5hNLTwX4AdAh7qPU9_lVIhmQjWu1Kvx6RDeUrTdN4UrhhDIdOIrpOYk5RJGCyYDSr8ky9HSOiU1&t=634516122000000000' [105:17] Ignoring the following declarations in this rule. 
Jul 2, 2013 6:19:54 PM com.gargoylesoftware.htmlunit.DefaultCssErrorHandler error 
WARNING: CSS error: 'http://media.ethics.ga.gov/Search/WebResource.axd?d=P_qivaU1jkjGS6yiS47lVyoi52Pqy5e8DnncH3bigK8349gyQVvRTapoSdHm45oIHlJhLQAhH3tEXp29b5hNLTwX4AdAh7qPU9_lVIhmQjWu1Kvx6RDeUrTdN4UrhhDIdOIrpOYk5RJGCyYDSr8ky9HSOiU1&t=634516122000000000' [160:16] Error in style rule. (Invalid token ":". Was expecting one of: <EOF>, <S>, <NUMBER>, "inherit", <IDENT>, <STRING>, <PLUS>, <COMMA>, <HASH>, <IMPORTANT_SYM>, <EMS>, <EXS>, <LENGTH_PX>, <LENGTH_CM>, <LENGTH_MM>, <LENGTH_IN>, <LENGTH_PT>, <LENGTH_PC>, <ANGLE_DEG>, <ANGLE_RAD>, <ANGLE_GRAD>, <TIME_MS>, <TIME_S>, <FREQ_HZ>, <FREQ_KHZ>, <DIMENSION>, <PERCENTAGE>, <URI>, <FUNCTION>, "}", ";", "/", "-".) 
Jul 2, 2013 6:19:54 PM com.gargoylesoftware.htmlunit.DefaultCssErrorHandler warning 
WARNING: CSS warning: 'http://media.ethics.ga.gov/Search/WebResource.axd?d=P_qivaU1jkjGS6yiS47lVyoi52Pqy5e8DnncH3bigK8349gyQVvRTapoSdHm45oIHlJhLQAhH3tEXp29b5hNLTwX4AdAh7qPU9_lVIhmQjWu1Kvx6RDeUrTdN4UrhhDIdOIrpOYk5RJGCyYDSr8ky9HSOiU1&t=634516122000000000' [160:16] Ignoring the following declarations in this rule. 
Jul 2, 2013 6:19:54 PM com.gargoylesoftware.htmlunit.javascript.StrictErrorReporter runtimeError 
SEVERE: runtimeError: message=[The data necessary to complete this operation is not yet available.] sourceName=[http://media.ethics.ga.gov/Search/Telerik.Web.UI.WebResource.axd?_TSM_HiddenField_=ctl00_ContentPlaceHolder1_RadScriptManager1_TSM&compress=1&_TSM_CombinedScripts_=%3b%3bSystem.Web.Extensions%2c+Version%3d3.5.0.0%2c+Culture%3dneutral%2c+PublicKeyToken%3d31bf3856ad364e35%3aen-US%3a7263e9c6-5962-41bc-b839-88b704bfcf0d%3aea597d4b%3ab25378d2%3bTelerik.Web.UI%2c+Version%3d2011.2.915.35%2c+Culture%3dneutral%2c+PublicKeyToken%3d121fae78165ba3d4%3aen-US%3a168ec6eb-791b-4159-8a0f-6c601196f873%3a16e4e7cd%3af7645509%3a24ee1bba%3af46195d3%3a19620875%3a874f8ea2%3a490a9d4e%3abd8f85e4%3bAjaxControlToolkit%2c+Version%3d3.0.20820.16598%2c+Culture%3dneutral%2c+PublicKeyToken%3d28f01b0e84b6d53e%3aen-US%3a707835dd-fa4b-41d1-89e7-6df5d518ffb5%3ab14bb7d5%3a13f47f54%3a369ef9d0%3a1d056c78%3adc2d6e36%3a5acd2e8e%3af8a45328] line=[997] lineSource=[null] lineOffset=[0] 
Jul 2, 2013 6:19:54 PM com.gargoylesoftware.htmlunit.IncorrectnessListenerImpl notify 
WARNING: Obsolete content type encountered: 'application/x-javascript'. 
Jul 2, 2013 6:19:54 PM com.gargoylesoftware.htmlunit.IncorrectnessListenerImpl notify 
WARNING: Obsolete content type encountered: 'text/javascript'. 
Jul 2, 2013 6:19:55 PM com.gargoylesoftware.htmlunit.javascript.host.ActiveXObject jsConstructor 
WARNING: Automation server can't create object for 'ShockwaveFlash.ShockwaveFlash.7'. 
Jul 2, 2013 6:19:55 PM com.gargoylesoftware.htmlunit.javascript.StrictErrorReporter runtimeError 
SEVERE: runtimeError: message=[Automation server can't create object for 'ShockwaveFlash.ShockwaveFlash.7'.] sourceName=[http://www.google-analytics.com/ga.js] line=[24] lineSource=[null] lineOffset=[0] 
Jul 2, 2013 6:19:55 PM com.gargoylesoftware.htmlunit.javascript.host.ActiveXObject jsConstructor 
WARNING: Automation server can't create object for 'ShockwaveFlash.ShockwaveFlash.6'. 
Jul 2, 2013 6:19:55 PM com.gargoylesoftware.htmlunit.javascript.StrictErrorReporter runtimeError 
SEVERE: runtimeError: message=[Automation server can't create object for 'ShockwaveFlash.ShockwaveFlash.6'.] sourceName=[http://www.google-analytics.com/ga.js] line=[24] lineSource=[null] lineOffset=[0] 
Jul 2, 2013 6:19:55 PM com.gargoylesoftware.htmlunit.javascript.host.ActiveXObject jsConstructor 
WARNING: Automation server can't create object for 'ShockwaveFlash.ShockwaveFlash'. 
Jul 2, 2013 6:19:55 PM com.gargoylesoftware.htmlunit.javascript.StrictErrorReporter runtimeError 
SEVERE: runtimeError: message=[Automation server can't create object for 'ShockwaveFlash.ShockwaveFlash'.] sourceName=[http://www.google-analytics.com/ga.js] line=[24] lineSource=[null] lineOffset=[0] 
Jul 2, 2013 6:19:55 PM com.gargoylesoftware.htmlunit.IncorrectnessListenerImpl notify 
WARNING: Obsolete content type encountered: 'application/x-javascript'. 
Jul 2, 2013 6:19:56 PM com.gargoylesoftware.htmlunit.IncorrectnessListenerImpl notify 
WARNING: Obsolete content type encountered: 'text/javascript'. 
PULLING LINKS: 
Jul 2, 2013 6:19:56 PM com.gargoylesoftware.htmlunit.javascript.background.JavaScriptJobManagerImpl runSingleJob 
SEVERE: Job run failed with unexpected RuntimeException: Exception invoking setOuterHTML 
======= EXCEPTION START ======== 
Exception class=[java.lang.RuntimeException] 
com.gargoylesoftware.htmlunit.ScriptException: Exception invoking setOuterHTML 
    at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$HtmlUnitContextAction.run(JavaScriptEngine.java:663) 
    at net.sourceforge.htmlunit.corejs.javascript.Context.call(Context.java:559) 
    at net.sourceforge.htmlunit.corejs.javascript.ContextFactory.call(ContextFactory.java:525) 
    at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.callFunction(JavaScriptEngine.java:594) 
    at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.callFunction(JavaScriptEngine.java:569) 
    at com.gargoylesoftware.htmlunit.html.HtmlPage.executeJavaScriptFunctionIfPossible(HtmlPage.java:996) 
    at com.gargoylesoftware.htmlunit.javascript.background.JavaScriptFunctionJob.runJavaScript(JavaScriptFunctionJob.java:53) 
    at com.gargoylesoftware.htmlunit.javascript.background.JavaScriptExecutionJob.run(JavaScriptExecutionJob.java:101) 
    at com.gargoylesoftware.htmlunit.javascript.background.JavaScriptJobManagerImpl.runSingleJob(JavaScriptJobManagerImpl.java:328) 
    at com.gargoylesoftware.htmlunit.javascript.background.DefaultJavaScriptExecutor.run(DefaultJavaScriptExecutor.java:161) 
    at java.lang.Thread.run(Thread.java:680) 
Caused by: java.lang.RuntimeException: Exception invoking setOuterHTML 
    at net.sourceforge.htmlunit.corejs.javascript.MemberBox.invoke(MemberBox.java:163) 
    at net.sourceforge.htmlunit.corejs.javascript.ScriptableObject$GetterSlot.setValue(ScriptableObject.java:287) 
    at net.sourceforge.htmlunit.corejs.javascript.ScriptableObject$RelinkedSlot.setValue(ScriptableObject.java:359) 
    at net.sourceforge.htmlunit.corejs.javascript.ScriptableObject.putImpl(ScriptableObject.java:2659) 
    at net.sourceforge.htmlunit.corejs.javascript.ScriptableObject.put(ScriptableObject.java:509) 
    at net.sourceforge.htmlunit.corejs.javascript.ScriptableObject.putProperty(ScriptableObject.java:2364) 
    at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.setObjectProp(ScriptRuntime.java:1601) 
    at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.setObjectProp(ScriptRuntime.java:1595) 
    at net.sourceforge.htmlunit.corejs.javascript.Interpreter.interpretLoop(Interpreter.java:1248) 
    at net.sourceforge.htmlunit.corejs.javascript.Interpreter.interpret(Interpreter.java:815) 
    at net.sourceforge.htmlunit.corejs.javascript.InterpretedFunction.call(InterpretedFunction.java:109) 
    at net.sourceforge.htmlunit.corejs.javascript.ContextFactory.doTopCall(ContextFactory.java:415) 
    at com.gargoylesoftware.htmlunit.javascript.HtmlUnitContextFactory.doTopCall(HtmlUnitContextFactory.java:274) 
    at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.doTopCall(ScriptRuntime.java:3132) 
    at net.sourceforge.htmlunit.corejs.javascript.InterpretedFunction.call(InterpretedFunction.java:107) 
    at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$4.doRun(JavaScriptEngine.java:587) 
    at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$HtmlUnitContextAction.run(JavaScriptEngine.java:651) 
    ... 10 more 
Caused by: java.lang.IllegalStateException: Previous sibling for HtmlDivision[<div style="height: 0px; overflow: hidden; border-top: solid black; border-top-width: thick;">] is null. 
    at com.gargoylesoftware.htmlunit.html.DomNode.insertBefore(DomNode.java:1023) 
    at com.gargoylesoftware.htmlunit.javascript.host.html.HTMLElement$ProxyDomNode.appendChild(HTMLElement.java:1091) 
    at com.gargoylesoftware.htmlunit.html.HTMLParser$HtmlUnitDOMBuilder.handleCharacters(HTMLParser.java:710) 
    at com.gargoylesoftware.htmlunit.html.HTMLParser$HtmlUnitDOMBuilder.endDocument(HTMLParser.java:718) 
    at org.apache.xerces.parsers.AbstractSAXParser.endDocument(Unknown Source) 
    at org.cyberneko.html.HTMLTagBalancer.endDocument(HTMLTagBalancer.java:510) 
    at org.cyberneko.html.filters.DefaultFilter.endDocument(DefaultFilter.java:213) 
    at org.cyberneko.html.HTMLScanner$ContentScanner.scan(HTMLScanner.java:2116) 
    at org.cyberneko.html.HTMLScanner.scanDocument(HTMLScanner.java:918) 
    at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:499) 
    at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:452) 
    at org.apache.xerces.parsers.XMLParser.parse(Unknown Source) 
    at com.gargoylesoftware.htmlunit.html.HTMLParser$HtmlUnitDOMBuilder.parse(HTMLParser.java:818) 
    at com.gargoylesoftware.htmlunit.html.HTMLParser.parseFragment(HTMLParser.java:162) 
    at com.gargoylesoftware.htmlunit.html.HTMLParser.parseFragment(HTMLParser.java:121) 
    at com.gargoylesoftware.htmlunit.javascript.host.html.HTMLElement.parseHtmlSnippet(HTMLElement.java:1048) 
    at com.gargoylesoftware.htmlunit.javascript.host.html.HTMLElement.setOuterHTML(HTMLElement.java:1035) 
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) 
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) 
    at java.lang.reflect.Method.invoke(Method.java:597) 
    at net.sourceforge.htmlunit.corejs.javascript.MemberBox.invoke(MemberBox.java:137) 
    ... 26 more 
Enclosed exception: 
java.lang.RuntimeException: Exception invoking setOuterHTML 
    at net.sourceforge.htmlunit.corejs.javascript.MemberBox.invoke(MemberBox.java:163) 
    at net.sourceforge.htmlunit.corejs.javascript.ScriptableObject$GetterSlot.setValue(ScriptableObject.java:287) 
    at net.sourceforge.htmlunit.corejs.javascript.ScriptableObject$RelinkedSlot.setValue(ScriptableObject.java:359) 
    at net.sourceforge.htmlunit.corejs.javascript.ScriptableObject.putImpl(ScriptableObject.java:2659) 
    at net.sourceforge.htmlunit.corejs.javascript.ScriptableObject.put(ScriptableObject.java:509) 
    at net.sourceforge.htmlunit.corejs.javascript.ScriptableObject.putProperty(ScriptableObject.java:2364) 
    at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.setObjectProp(ScriptRuntime.java:1601) 
    at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.setObjectProp(ScriptRuntime.java:1595) 
    at net.sourceforge.htmlunit.corejs.javascript.Interpreter.interpretLoop(Interpreter.java:1248) 
    at net.sourceforge.htmlunit.corejs.javascript.Interpreter.interpret(Interpreter.java:815) 
    at net.sourceforge.htmlunit.corejs.javascript.InterpretedFunction.call(InterpretedFunction.java:109) 
    at net.sourceforge.htmlunit.corejs.javascript.ContextFactory.doTopCall(ContextFactory.java:415) 
    at com.gargoylesoftware.htmlunit.javascript.HtmlUnitContextFactory.doTopCall(HtmlUnitContextFactory.java:274) 
    at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.doTopCall(ScriptRuntime.java:3132) 
    at net.sourceforge.htmlunit.corejs.javascript.InterpretedFunction.call(InterpretedFunction.java:107) 
    at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$4.doRun(JavaScriptEngine.java:587) 
    at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$HtmlUnitContextAction.run(JavaScriptEngine.java:651) 
    at net.sourceforge.htmlunit.corejs.javascript.Context.call(Context.java:559) 
    at net.sourceforge.htmlunit.corejs.javascript.ContextFactory.call(ContextFactory.java:525) 
    at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.callFunction(JavaScriptEngine.java:594) 
    at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.callFunction(JavaScriptEngine.java:569) 
    at com.gargoylesoftware.htmlunit.html.HtmlPage.executeJavaScriptFunctionIfPossible(HtmlPage.java:996) 
    at com.gargoylesoftware.htmlunit.javascript.background.JavaScriptFunctionJob.runJavaScript(JavaScriptFunctionJob.java:53) 
    at com.gargoylesoftware.htmlunit.javascript.background.JavaScriptExecutionJob.run(JavaScriptExecutionJob.java:101) 
    at com.gargoylesoftware.htmlunit.javascript.background.JavaScriptJobManagerImpl.runSingleJob(JavaScriptJobManagerImpl.java:328) 
    at com.gargoylesoftware.htmlunit.javascript.background.DefaultJavaScriptExecutor.run(DefaultJavaScriptExecutor.java:161) 
    at java.lang.Thread.run(Thread.java:680) 
Caused by: java.lang.IllegalStateException: Previous sibling for HtmlDivision[<div style="height: 0px; overflow: hidden; border-top: solid black; border-top-width: thick;">] is null. 
    at com.gargoylesoftware.htmlunit.html.DomNode.insertBefore(DomNode.java:1023) 
    at com.gargoylesoftware.htmlunit.javascript.host.html.HTMLElement$ProxyDomNode.appendChild(HTMLElement.java:1091) 
    at com.gargoylesoftware.htmlunit.html.HTMLParser$HtmlUnitDOMBuilder.handleCharacters(HTMLParser.java:710) 
    at com.gargoylesoftware.htmlunit.html.HTMLParser$HtmlUnitDOMBuilder.endDocument(HTMLParser.java:718) 
    at org.apache.xerces.parsers.AbstractSAXParser.endDocument(Unknown Source) 
    at org.cyberneko.html.HTMLTagBalancer.endDocument(HTMLTagBalancer.java:510) 
    at org.cyberneko.html.filters.DefaultFilter.endDocument(DefaultFilter.java:213) 
    at org.cyberneko.html.HTMLScanner$ContentScanner.scan(HTMLScanner.java:2116) 
    at org.cyberneko.html.HTMLScanner.scanDocument(HTMLScanner.java:918) 
    at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:499) 
    at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:452) 
    at org.apache.xerces.parsers.XMLParser.parse(Unknown Source) 
    at com.gargoylesoftware.htmlunit.html.HTMLParser$HtmlUnitDOMBuilder.parse(HTMLParser.java:818) 
    at com.gargoylesoftware.htmlunit.html.HTMLParser.parseFragment(HTMLParser.java:162) 
    at com.gargoylesoftware.htmlunit.html.HTMLParser.parseFragment(HTMLParser.java:121) 
    at com.gargoylesoftware.htmlunit.javascript.host.html.HTMLElement.parseHtmlSnippet(HTMLElement.java:1048) 
    at com.gargoylesoftware.htmlunit.javascript.host.html.HTMLElement.setOuterHTML(HTMLElement.java:1035) 
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) 
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) 
    at java.lang.reflect.Method.invoke(Method.java:597) 
    at net.sourceforge.htmlunit.corejs.javascript.MemberBox.invoke(MemberBox.java:137) 
    ... 26 more 
== CALLING JAVASCRIPT == 

    function() { 
     return b.apply(a, arguments); 
    } 

======= EXCEPTION END ======== 
COMPLETE 

Odpowiedz

21

Błąd pochodzi z pliku MicrosoftAjax.js. Spróbuj symulować chrom:

final WebClient webclient = new WebClient(BrowserVersion.CHROME); 

Dodano również link do tłumienia ostrzeżeń HtmlUnit.

Twój XPath również niczego nie znajduje (testowałem w Chrome). Użyłem innego na przykład:

import java.util.List; 

import com.gargoylesoftware.htmlunit.WebClient; 
import com.gargoylesoftware.htmlunit.BrowserVersion; 
import com.gargoylesoftware.htmlunit.html.HtmlAnchor; 
import com.gargoylesoftware.htmlunit.html.HtmlPage; 

public class ScrapperApp { 

    private static void go() throws Exception { 
     /* turn off annoying htmlunit warnings */ 
     java.util.logging.Logger.getLogger("com.gargoylesoftware").setLevel(java.util.logging.Level.OFF); 

     HtmlPage nextPage; 
     String url = "http://media.ethics.ga.gov/search/Campaign/Campaign_Name.aspx?NameID=5751&FilerID=C2009000085&Type=candidate"; 

     final WebClient webclient = new WebClient(BrowserVersion.CHROME); 
     final HtmlPage page = webclient.getPage(url); 

     System.out.println("PULLING LINKS:"); 

     List<HtmlAnchor> articles = (List<HtmlAnchor>) page.getByXPath("//a[@class='lblentrylink']"); 
     //List<HtmlAnchor> articles = (List<HtmlAnchor>) page.getByXPath("//div[@class='hform1']/a[@class='lblentrylink']"); 

     for(int x=0; x<articles.size(); x++) { 
      System.out.println("Clicking "+articles.get(x).asText()); 
      //nextPage = articles.get(x).click(); 
      //System.out.println(nextPage.getBody()); 
     } 
    } 
    public static void main(String[] args) throws Exception { 
     go(); 
     System.out.println("COMPLETE"); 
    } 
} 
+3

To działa. Dzięki wielkie! – Jeff