Make the multi-threading interface a bit more flexible #44

dmhacker · 2019-08-23T02:52:09Z

Tackles #43 .

A lot of the earlier multi-threading code had a lot of cruft that wasn't really doing much. For example: the unnecessary spawning of an entire scheduled executor service per scraper per company, when that service had only 1 thread itself and the main thread per company was just blocking on the individual service.

The new proposal is this: keep a massive scheduled executor service available to all scrapers. Now, treat each scraper process as simply a chain of Runnable's (maybe with delay) to be executed in sequence and ending a callback that returns a completed list of positions. The naive BFSStrategy will simply be a giant Runnable that does the entire BFS process + scraping + waiting without deferring to the service. The faster BFSMTStrategy will be better broken up as such:

Find current candidate & fetch the page. EXECUTE IMMEDIATELY.
Process page, extract position if there is one, go back to 1. EXECUTE AFTER PAGE LOAD DELAY.
If no more candidates exist after 1 ends, execute callback.

This seemingly simple change yields much faster results. Why? BFSStrategy runs all in one thread and to completion. This means that if you add 3N BFSStrategies to the executor service with N threads, the executor will complete the first N strategies added, then the second N strategies added, then the third N strategies added. When the N threads are stuck waiting, they are unable to do anything else. However, since BFSMTStrategy is broken up into separate tasks, this means that rather than waiting for a page to load, a thread can simply drop its task back into the pool and go try to take another available task. In this manner, the BFSMTStrategy runnables are continuously passed between threads, leading to a performance increase. (Note that there is still a performance increase using BFSStrategy + the scheduled executor service, which is what is enabled right now - more on this below).

However, BFSMTStrategy is 'broken' at the moment and has been deprecated. This is because it waits after it releases control of the driver rather than during. Waiting during acquisition of a driver from the driver pool must handled entirely differently! In fact, it might be impossible to use the idea of a Runnable chain with how the driver pool is currently set up. This is because if a task holds on to a driver from the pool and is then put back into the service with a delay, it will get stuck behind other tasks waiting for the semaphore to be released. Essentially, it leads to deadlock. A solution might be to initialize the executor service with less threads than open drivers - this will be looked into later. Another problem is of course that you can't use a try-with block when transferring control of a driver acquisition to another task, which could lead to deadlock if either task fails and doesn't release the driver.

It should be pretty easy to switch out strategies, since all you have to do is change the one line in Main::scrapePositions() where they are initialized.

…ding interface, deprecate BFSMT scraper

dmhacker · 2019-08-23T02:56:26Z

Oh, and I should also add that obviously the main speed decrease is going to be the web driver pool. Even using BFSMTStrategy with a large number (24+) of threads, most of the threads are eventually going to end up stuck waiting for another thread to release a driver back into the pool. And for BFSStrategy, since tasks hold on to their web drivers when loading, other threads will also block waiting for them.

CTrando · 2019-08-26T01:51:22Z

src/main/java/com/internhub/data/Main.java

            }
        }
+        PositionWriter writer = new PositionHibernateWriter();


We should write as we scrape so if the scraper crashes we don't lose all progress.

CTrando · 2019-08-26T01:53:47Z

src/main/java/com/internhub/data/positions/scrapers/PositionCallback.java

+
+import java.util.List;
+
+public interface PositionCallback {


maybe just use a runnable here, it's so similar anyways

also we are using the convention I- for interfaces. Look into anonymous classes too I think this is unnecessary

I was thinking about using a runnable, but it would still require us to extend the class. The idea behind the callback class was just for it to run with the finished results upon completion, so we don't have to worry resolving the function stack with return arguments. Will look more into this.

CTrando · 2019-08-26T01:56:12Z

src/main/java/com/internhub/data/positions/scrapers/PositionScraper.java

+    }
+
+    public void setup(Company company) {
+        mLinkCache.put(company.getName(), mInitialStrategy.fetchInitialLinks(company));


what's this second call doing here? Does it have some side effect you are using? If it does then it's bad practice cause it doesn't make sense while reading it

Yeah, that was mainly used so that we can treat it as a separate task that the scheduled executor service can process concurrently along with the actual scraping. See Main::scrapePositions() to see how it's used. I admit though, it's not the best design.

CTrando · 2019-08-26T01:56:42Z

src/main/java/com/internhub/data/positions/scrapers/strategies/IPositionScraperStrategy.java

@@ -10,5 +10,5 @@

 public interface IPositionScraperStrategy {
    Logger logger = LoggerFactory.getLogger(PositionScraper.class);
-    List<Position> fetch(Company company, List<String> initialLinks);
+    void fetch(Company company, List<String> initialLinks, PositionCallback callback);


consider overloading the method so one doesn't require the callback

CTrando · 2019-08-26T01:57:08Z

...in/java/com/internhub/data/positions/scrapers/strategies/impl/GoogleInitialLinkStrategy.java

-        if(googled.isEmpty()) {
-            logger.warn("Link strategy could not find any initial links.");
+        if (googled.isEmpty()) {
+            logger.warn(String.format("Link strategy could not find any initial links for %s.", company.getName()));


maybe add some quotes around the company name

CTrando · 2019-08-26T02:01:33Z

src/main/java/com/internhub/data/positions/scrapers/strategies/impl/PositionBFSStrategy.java

+            try {
+                Thread.sleep(PAGE_LOAD_DELAY_MS);
+            } catch (InterruptedException e) {
+                logger.error("Could not wait for page to load.", e);


Could not wait for javascript to load*

CTrando · 2019-08-26T02:03:51Z

src/main/java/com/internhub/data/Main.java

-        List<Callable<List<Position>>> tasks = Lists.newArrayList();
-        try (MyWebDriverPool pool = new MyWebDriverPool()) {
-            InitialLinkStrategy linkStrategy = new GoogleInitialLinkStrategy();
+        ScheduledExecutorService executor = Executors.newScheduledThreadPool(24);


is 24 a safe value?

Eventually, the executor will be throttled by the amount of web drivers available, so it doesn't really matter what that is. As long as it isn't an insanely high number, I think it's ok.

CTrando · 2019-08-26T02:04:06Z

src/main/java/com/internhub/data/Main.java

        try {
-            futures = executor.invokeAll(tasks);
+            latch.await();


shutdown the executor too

CTrando · 2019-08-26T02:04:23Z

src/main/java/com/internhub/data/Main.java

+                executor.execute(() -> {
+                    positionScraper.setup(company);
+                    executor.execute(() -> {
+                        positionScraper.fetch(company, (intermediate) -> {


rename intermediate -> results

CTrando · 2019-08-26T02:04:54Z

src/main/java/com/internhub/data/Main.java

@@ -114,15 +103,17 @@ private static void scrapePositions(String name) {
    }

    private static void scrapePositions(List<Company> companies) {


make it clearer that this method is for single threading. Why are we still maintaining single threading as well?

Not sure. It should probably be transitioned over to the multi threading variant.

CTrando · 2019-08-26T02:07:58Z

This is because if a task holds on to a driver from the pool and is then put back into the service with a delay, it will get stuck behind other tasks waiting for the semaphore to be released. Essentially, it leads to deadlock.

How will this occur? How can a task hold on to a driver from the pool after it finishes executing?

CTrando · 2019-08-26T02:08:40Z

The new proposal is this: keep a massive scheduled executor service available to all scrapers. Now, treat each scraper process as simply a chain of Runnable's (maybe with delay) to be executed in sequence and ending a callback that returns a completed list of positions. The naive BFSStrategy will simply be a giant Runnable that does the entire BFS process + scraping + waiting without deferring to the service.

We discussed this before and I thought you said it the performance increase would be negligible. Was the performance increase good enough to warrant these changes?

CTrando · 2019-08-26T02:09:08Z

If you don't mind, can you provide some benchmarks between master and this branch. Just a depth of 20-25 should be enough to demonstrate noticeable differences.

dmhacker · 2019-08-28T17:22:23Z

How will this occur? How can a task hold on to a driver from the pool after it finishes executing?

Right now, you're acquiring a driver briefly, fetching the page source as fast as possible, releasing the driver back into the pool, and then waiting. The actual behavior should be to acquire the driver, call fetch, wait 2 seconds, get the page source, and then release the driver. Because it has to hold on to the driver for those 2 seconds, it can't release it. Then, when you put the task back into the scheduled executor service, it gets put last in the task queue. The tasks ahead of it were previously added and are waiting on the sephamore to unlock so they can also acquire a driver. So essentially what happens is the tasks ahead wait for the sephamore, preventing the waiting task from finishing and releasing the driver, causing deadlock.

We discussed this before and I thought you said it the performance increase would be negligible. Was the performance increase good enough to warrant these changes?

It's not necessarily just a performance thing. I didn't like how each scraper was using their own executor service, dumping all their tasks on that single thread, and then just blocking their main thread. You're effectively doubling the number of threads you're using unnecessarily because the scraper's main thread (in the original work stealing pool) is just stuck waiting for the executor service thread to finish everything. Not to mention that the hacks for shutting down the service on completion. It's easier just to keep a single executor service open for all tasks and then just let it do the prioritizing.

… the list callback

dmhacker · 2019-09-03T13:50:06Z

Benchmarks will be updated as they finish.

Companies used:

Amazon
Google
Tesla
Citadel LLC
IBM
Intel
Redfin
SurveyMonkey
Uber
Workday
Yelp
Microsoft

Constants:

Run on Surface Book 13" with 8-core Intel Core i7-8650U CPU with docker enabled.
Maximum of 6 browser instances, maximum of 12 threads (master uses work stealing pool instead).
Firefox, IntelliJ running in background but not being actively used.

Commands used:

master branch - BFSMTStrategy
- scripts/start_docker.sh -p
- scrapePositions() modified directly to run list of companies
fix/PageLoadMT branch - BFSStrategy
- scripts/start_docker.sh "-s 'Amazon,Google,Tesla,Citadel LLC,IBM,Intel,Redfin,SurveyMonkey,Uber,Workday,Yelp,Microsoft' -d"
- Modified to use BFSStrategy internally
fix/PageLoadMT branch - BFSMTStrategy
- scripts/start_docker.sh "-s 'Amazon,Google,Tesla,Citadel LLC,IBM,Intel,Redfin,SurveyMonkey,Uber,Workday,Yelp,Microsoft' -d"
- Modified to use BFSMTStrategy internally

Note that Citadel LLC has a space; the command is not split up there.

Results (best of two trials):

20 minutes, 4 seconds.
24 minutes, 30 seconds.
21 minutes, 25 seconds.

Conclusion:
Performance increases in terms of time are pretty negligible no matter which multi-threading design you choose. Only ways we could speed this up are 1) increase number of web drivers, 2) refractor actual scraper algorithm to run in parallel, 3) split the workload across multiple machines. These changes will reduce the number of active threads at any given point in time however.

CTrando · 2019-09-10T03:25:36Z

.gitignore

@@ -9,6 +9,7 @@

 # Log files
 *.log*
+/logs/


Do you need a starting forward slash here?

As in logs/ instead of /logs/

CTrando · 2019-09-10T03:26:10Z

src/main/java/com/internhub/data/Main.java

@@ -11,6 +11,7 @@
 import com.internhub.data.positions.scrapers.IPositionScraper;
 import com.internhub.data.positions.scrapers.ScheduledPositionScraper;
 import com.internhub.data.positions.scrapers.strategies.impl.GoogleInitialLinkStrategy;
+import com.internhub.data.positions.scrapers.strategies.impl.PositionBFSMTStrategy;


does this import get used?

Make stuff faster & remove individual executors, simplify multi-threa…

01be7ac

…ding interface, deprecate BFSMT scraper

dmhacker requested a review from CTrando August 23, 2019 02:52

CTrando reviewed Aug 26, 2019

View reviewed changes

dmhacker added 12 commits September 2, 2019 00:11

Address some minor issues

57fad88

Merge branch 'master' into fix/PageLoadMT

1d6ef03

Make Dockerfile less verbose

e7f8d4a

Improve interface for position scraper by using a consumer instead of…

bade0ac

… the list callback

Add logback.xml file to shorten logger names

90d0f5c

Do some reorganizing to make code clearer & add dry-run option

8e8cc9e

Pass all user-supplied arguments to docker script

bcf0ef7

Fix docker start script to run correct command

339ba8f

Fix dry-run option so that it actually works

415e546

Fix MT BFS strategy hanging issue; clean unused imports

f87d7cf

Fix issue with dockerized Java not splitting CLI arguments

9c54d59

Fix comment on previous issue

555282f

dmhacker added 3 commits September 3, 2019 09:59

Add log rolling, store all logs in directory

265e3a7

Bind-mount logs folder to Docker container

222a820

Rename position scraper involving executor service

297450a

CTrando self-requested a review September 10, 2019 03:22

CTrando reviewed Sep 10, 2019

View reviewed changes

		@@ -114,15 +103,17 @@ private static void scrapePositions(String name) {
		}

		private static void scrapePositions(List<Company> companies) {

Make the multi-threading interface a bit more flexible #44

Are you sure you want to change the base?

Make the multi-threading interface a bit more flexible #44

Uh oh!

Conversation

dmhacker commented Aug 23, 2019

Uh oh!

dmhacker commented Aug 23, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

CTrando commented Aug 26, 2019

Uh oh!

CTrando commented Aug 26, 2019

Uh oh!

CTrando commented Aug 26, 2019

Uh oh!

dmhacker commented Aug 28, 2019

Uh oh!

dmhacker commented Sep 3, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dmhacker commented Sep 3, 2019 •

edited

Loading